# Working with Text Data and Naive Bayes in scikit-learn

## Agenda

**Working with text data**

- Representing text as data
- Reading SMS data
- Vectorizing SMS data
- Examining the tokens and their counts
- Bonus: Calculating the "spamminess" of each token

**Naive Bayes classification**

- Building a Naive Bayes model
- Comparing Naive Bayes with logistic regression

## Part 1: Representing text as data

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [None]:
# start with a simple example


### CountVectorizer Process 
1. Import
2. Instantiate (and set parameters)
3. Use .fit to learn vocab --> ONLY FOR TRAIN DATA
4. Transform (vectorize the data)

In [None]:
# learn the 'vocabulary' of the training data

### instantiate

### fit --> "learns" vocabulary (creates an empty DTM with column headers)


In [None]:
# take a look at the features --> "LEARNED VOCAB"


In [None]:
# transform training data into a 'document-term matrix'


In [None]:
# print the sparse matrix


In [None]:
# convert sparse matrix to a dense matrix


In [None]:
# examine the vocabulary and document-term matrix together


In [None]:
# create a document-term matrix on your own


In [None]:
### instantiate

### fit to learn vocab (it sets the vocab in local memory)

### transform (takes tha tlocal momory of words and encodes it to matrix)

### DTM to DF


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [None]:
# transform testing data into a document-term matrix (using existing vocabulary)
# why don't we call .fit?


In [None]:
# examine the vocabulary and document-term matrix together


### Why are you getting an empty dataframe?!!!! Think about how you instantiate CountVectorizer.

You will have to reinstantiate Countvectorizer and go through the instantiate, fit, transform steps on training data before you can do these steps for test data.

In [None]:
# start with a simple example

## instantiate

# fit

# transform training data into a 'document-term matrix'

# transform testing data into a document-term matrix (using existing vocabulary)
# DON'T CALL FIT ON TEST DATA!!!!! --> you can copy paste from the cell 2 above



# examine the vocabulary and document-term matrix together


**Summary:**

- `vect.fit(train)` learns the vocabulary of the training data
- `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data
- `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

## Part 2: Reading SMS data

In [None]:
# read tab-separated file


In [None]:
# convert label to a numeric variable
# ham = 0, spam = 1


In [None]:
# What is the length of each message?


In [None]:
# Plot a pretty picture of the length and show summary statistics
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# plot

# sum stats


In [None]:
# plot ham and spam in two seperate charts


In [None]:
# define X and y


In [None]:
# split into training and testing sets


## Part 3: Vectorizing SMS data

In [None]:
# instantiate the vectorizer


In [None]:
# learn training data vocabulary, then create document-term matrix


In [None]:
# alternative: combine fit and transform into a single step


In [None]:
# transform testing data (using fitted vocabulary) into a document-term matrix (DON'T USE .FIT ON TEST DATA!!!)


## Part 4: Examining the tokens and their counts

In [None]:
# store token names
X_train_tokens = vect.get_feature_names()

In [None]:
# first 50 tokens
X_train_tokens[:50]

In [None]:
# last 50 tokens
X_train_tokens[-50:]

In [None]:
# view X_train_dtm as a dense matrix


In [None]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np


In [None]:
# create a DataFrame of tokens with their counts


## Bonus: Calculating the "spamminess" of each token

In [None]:
# create separate DataFrames for ham and spam


In [None]:
# learn the vocabulary of ALL messages and save it


In [None]:
# create document-term matrices for ham and spam


In [None]:
# count how many times EACH token appears across ALL ham messages
import numpy as np


In [None]:
# count how many times EACH token appears across ALL spam messages


In [None]:
# create a DataFrame of tokens with their separate ham and spam counts


In [None]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)


In [None]:
# calculate ratio of spam-to-ham for each token


In [None]:
#observe spam messages that contain the word 'claim'


## Part 5: Building a Naive Bayes model

We will use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

### NLP with ML Process

------- Pre-processing word data -------
1. Import Countvectorizer
2. Instantiate " "
3. Fit X_train
4. Transform X_train
5. Transform X_test

------ Model building process -------
1. Import
2. Instantiate model
3. Fit on X_train
4. Predict X_test
5. Evaluate



In [None]:
# train a Naive Bayes model using X_train_dtm

# import

# instantiate

# fit


In [None]:
# make class predictions for X_test_dtm

# predict


In [None]:
# calculate accuracy of class predictions



In [None]:
# confusion matrix


In [None]:
# calculate AUC


In [None]:
# print out the classification table


In [None]:
# print message text for the false positives


In [None]:
# print message text for the false negatives


In [None]:
# what do you notice about the false negatives?


Note: The EDA section came from the following blog: https://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/