This is week 1 of Markham's test analysis course. In this lesson we look at how to create a bayesian spam detector.  


[Here](https://www.dataschool.io/mltext-videos/) is the link to the private blog post that has all the videos. 

In [219]:
#import modules
import numpy as np
import pandas as pd
import matplotlib as plt
%matplotlib inline

In [220]:
#import iris
from sklearn.datasets import load_iris

We create the X and y matrices. Note that we make the X a capital letter because it is a matrix and the y is just a single vector.

In [221]:
#create labels and features
iris = load_iris()
X = iris.data
y = iris.target

Print out the shape of the two matrices. Note that the shape attribute returns a tuple so that is why the shape of y has a trailing comma in it. 

In [222]:
print(X.shape)
print(y.shape)

(150, 4)
(150,)


Next we create a Pandas DataFrame. We could load the data in from the Pandas DataFrame but here we just use it to examine our data. Note that we aren't really creating a whole new data frame, just getting a look at the first 5 rows. 

In [223]:
#create DataFrame for examining first 5 rows of features but not labels.
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Pandas is built on top of Numpy are arrays so when you pass a Pandas object to sklearn it knows there is a Numpy object underneath so you can pass either. 

In [224]:
#import the nearest neighbors model from sklearn
from sklearn.neighbors import KNeighborsClassifier

In [225]:
#instantiate and fit
knn = KNeighborsClassifier()
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [226]:
#get predictions
prediction = knn.predict([[3,5,4,2]])
print(prediction)
prediction.shape

[1]


(1,)

## Representing Text as Numerical Data

In [227]:
#example messages for training (SMS messages)
simple_train = ['call me tonight', 'Call me a cab', 'please call me...PLEASE!']
print(simple_train)

['call me tonight', 'Call me a cab', 'please call me...PLEASE!']


So here is the problem with text. You have symbols that are your data. But sklearn wants numbers as data and it wants them in a fixed length. So you have to 'vectorize'.  

CountVectorizer() is not a model but uses the same API as a model so it has the same methods and the same sort of four step procedure. 

In [228]:
#import the vectorizing function
from sklearn.feature_extraction.text import CountVectorizer

In [229]:
#instantiate
vect = CountVectorizer()

In [230]:
#fit the model
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [231]:
#get the words that we will be counting for the model
vect.get_feature_names()

['cab', 'call', 'me', 'please', 'tonight']

So the fit method for CountVectorizer just learns the names of the words, setting up the framework of the data set.  

Notice that the 'a' is removed, but not because it is a stop word. The default 'token_pattern' counts as a word strings that are a minimum of two characters.  

The next step is the transform. fit always means learn the relationship between the X and y. But transform can mean any number of things depending on the funciton it is called on.  

So fit has learned the relationship in the data set. It is looking for the counts of the feature_names. But the transform is going to take in one thing and return another, transformed thing. Here it is going to take simple_train, which is still a list of strings, and transform it into a count of the features, a document-term matrix. 

In [233]:
#obtain the matrix of documents and word counts
simple_train_dtm = vect.transform(simple_train)
print(simple_train_dtm.shape)

(3, 5)


The dtm will be stored as a sparce matrix, which is just a matrix that doesn't store zeros but rather only stores the locations of the non-zero elements. This is a way to deal with the 'sparceness of language'. 

In [235]:
#the dtm matrix is saved as a sparce matrix. Show dtm and data type 
print(simple_train_dtm)
print(type(simple_train_dtm))

  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2
<class 'scipy.sparse.csr.csr_matrix'>


Now transform the matrix to a pandas DataFrame from a sparse matrix

In [236]:
#dtm matricies have a 'toarray()' method
print(simple_train_dtm.toarray())
#use toarray() to get a DataFrame
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

[[0 1 1 0 1]
 [1 1 1 0 0]
 [0 1 1 2 0]]


Unnamed: 0,cab,call,me,please,tonight
0,0,1,1,0,1
1,1,1,1,0,0
2,0,1,1,2,0


Now we want to test our model. This requires some test. The test data has to be in the same form as the training data, the same number of columns with the same meaning as the training data's columns.  

In [237]:
#test data: a 1 string list
simple_test = ["please don't call me."]

In [238]:
print(simple_test)

["please don't call me."]


In [240]:
#get the numeric matrix for the test data
simple_test_dtm = vect.transform(simple_test)

In [241]:
print(simple_test_dtm)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1


In [242]:
#show the test data in a data frame
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight
0,0,1,1,1,0


Notice that the word "don't" is there. That is because there was no "don't" in the original data so the dtm for the training data and, consequently, for the test data, does not have "don't" in it. Since you didn't train the original model on the frequency of the term "don't" the number of times "don't" occurs in the test data is irrelevant. The model has nothing to say about it.  

Markham uses the example of the Iris data set examined above. You have a model built on predicting species built on lengths and widths. If you come along with test data that includes a new feature, say, color, the model says, "Fine, but I don't know anything about the relationship between color and species so there is no point in recording that feature in the test data." This is a much more intuitive explanation than the long discussions of the test data "seeping into" the training set.  

This is also why we have to do the train_test_split before we do the vectorization.  

The issue wasn't discussed in this lesson but I am wondering how this affects the k-fold cross-validation procedure. It seems that there the vectorization is done before the k-fold splits into training and test sets so there would be a problem, no?  

## Part 3: Reading the SMS data

Get the url to the data (though we have cloned the repository since doing this notebook and so this step is no longer strictly necessary) and a DataFrame with read_table(). 

In [246]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
sms = pd.read_table(url, header=None, names=['label','message'])

The first step is always to examine the shape.

In [247]:
print(sms.shape)
print(type(sms))

(5572, 2)
<class 'pandas.core.frame.DataFrame'>


In [248]:
sms.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


Look at the distribution of the dependent variable. 

In [250]:
#get the counts and percentages of the dependent variable
print(sms.label.value_counts())
sms.label.value_counts(normalize=True)

ham     4825
spam     747
Name: label, dtype: int64


ham     0.865937
spam    0.134063
Name: label, dtype: float64

Now we want to make the dependent variable into a number. In this case we want to make the spam into one and the ham into zero. We use the Series method 'map'.

In [251]:
sms['label_num'] = sms.label.map({'ham': 0, 'spam': 1})

In [252]:
sms.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Now we prepare the data for CountVectorizer. We are going to keep the data labeled X and y. 

In [253]:
X = sms.message
y = sms.label_num

In [254]:
print(X.shape)
print(y.shape)

(5572,)
(5572,)


Now that we have a real data set to work on we are going to have to do the train-test split before we vectorize the string data. 

In [31]:
from sklearn.model_selection import train_test_split

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [33]:
print(X_train.shape)
print(X_test.shape)

(4179,)
(1393,)


You have to vectorize before splitting because if you vectorized the whole data set the test set would have all the words in it that the training set does, right? In reality we want the test set to look like new data that we have never encountered before and which would almost certainly have new words it has never seen before. 

In [34]:
vect = CountVectorizer()

In [35]:
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [36]:
X_train_dtm = vect.transform(X_train)

In [37]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

## Part 5: Building a Naive Payes model
So we are going to model this with a multi-nomial naive bayes model. It usually takes integers as the independent variable but it can also work with fractional counts suca as tf-idf. It will not accept negative features in any case.  

The module is naive_bayes. There is not 'bayes' module. I don't anyone uses anything but the naive_bayes module. So anytime you are using a bayesian model the module is going to be naive_bayes. In the actual name of the model it is the 'naive bayes' part that gets truncated into two letters, not the 'Multinomial' part. I am guessing that is because since it is in the 'naive_bayes' module that that information can be assumed. 

In [41]:
#import and instantiate a multinomical naive bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [43]:
#train the model and time it with the time magic command (or the timeit command?). 
%time nb.fit(X_train_dtm, y_train)

CPU times: user 5 ms, sys: 2.43 ms, total: 7.43 ms
Wall time: 5.25 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Here is a slight departure in Markham's notation where we only use the y_pred_class as the name of the predictions. I think it would be clearer to call it y_test_pred but what do I know. I am going to do it his way to see if there is some larger advantage from his naming scheme. 

In [46]:
#make the predictions for the y's 
y_pred_class = nb.predict(X_test_dtm)

He imports the entire metrics module because he is going to use a lot of the functions from that module. 

In [52]:
#score your predictions
# from sklearn.metrics import accuracy_score
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.98851399856424982

In [53]:
#print the confusion matrix.
metrics.confusion_matrix(y_test, y_pred_class)

array([[1203,    5],
       [  11,  174]])

So what does this mean? Remember this is reading things computer style so 0 comes before 1 on each axis. So the upper right hand corner is 0,0, or true negatives. We had 1203 true negatives or correctly classified 'ham'. We had 174 True Positives, or correctly classified 'spam'. The classification failures are on the other diagonal, 17 mistakes in total.  
The first axis, the rows, are the ground truth and the second axis, the horozontal, are the classifications. So, the things that were spam but were classified as ham (0) are in the lower left hand corner, namely, 11, or 11 false negatives. The things that were actually ham (0 on the rows axis) but were classified as spam (1) are in the upper right hand corner, namely, 5, or 5 false positives.  

So, a false positive means the message was incorrectly classified as spam. There are 5 of those. So we can read the messages that we incorrectly identified as spam by going to the test data. If it is classified as spam it means that in the y_pred_class it got a 1 but in the y_test data it got a zero. We can use that fact to filter the X_test raw data. We are going to keep to the convention of putting the truth variable before the prediction variable.  

In [58]:
#print messages for false positives (incorrectly classified as spam)
X_test[y_test < y_pred_class]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

A false negative means that the message was incorrectly classified as ham. There are 11 of those. That means that a message got classified as ham in the y_pred_class, or 0, but was a 1 in the y_test data. Again, we can use this fact to filter the raw data stating the ground truth variable first. 

In [59]:
#print messages for false negatives (incorrectly classified as ham when they were 
#actually spam)
X_test[y_test > y_pred_class]

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

What do we notice about the false negatives? 

In [60]:
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

They are longer? We only have 16 instances to study so we can't make any solid conclusions, but naive bayes might be getting lost in all these 'hammy' words. Old time spam messages would have the spam plus a lot of words tacked on from, say, a book, that would have a lot of un-spammy words. 

In [64]:
#calculate predicted probabilites for each observation
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

What does this show us? First, naive bayes gives us really poorly calibrated probabilities, all extremes, either positive or negative.  

The AUC is still good simply because the accurracy is so high in the first place. 

In [65]:
metrics.roc_auc_score(y_test, y_pred_prob)

0.98664310005369615

Do the probabilites add up to 1? No, because the probabilities are at the observation level. So the probabilities for all the observations or any subset of the observations are not constrained to add up to 1, but the probabilities for each *observation* will add up to 1. 

In [66]:
nb.predict_proba(X_test_dtm)

array([[  9.97122551e-01,   2.87744864e-03],
       [  9.99981651e-01,   1.83488846e-05],
       [  9.97926987e-01,   2.07301295e-03],
       ..., 
       [  9.99998910e-01,   1.09026171e-06],
       [  1.86697467e-10,   1.00000000e+00],
       [  9.99999996e-01,   3.98279868e-09]])

In [68]:
#add up the probabilities for each observation
predictions = nb.predict_proba(X_test_dtm)

for i in predictions[:10]:
    print(i[0] + i[1])

1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


## Part 6: Comparing Naive Bayes with logistic regression

In [75]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [76]:
%time logreg.fit(X_train_dtm, y_train)

CPU times: user 35.8 ms, sys: 5.71 ms, total: 41.5 ms
Wall time: 41.4 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Note that the time is about 10 times longer than Naive Bayes.

In [77]:
y_pred_class = logreg.predict(X_test_dtm)

In [78]:
metrics.accuracy_score(y_test, y_pred_class)

0.9877961234745154

Now lets get the predicted probabilities from the regression model. 

In [83]:
logreg.predict_proba(X_test_dtm)[:10]

array([[  9.87304442e-01,   1.26955581e-02],
       [  9.96528168e-01,   3.47183239e-03],
       [  9.93834835e-01,   6.16516539e-03],
       [  9.89535451e-01,   1.04645491e-02],
       [  9.88807519e-01,   1.11924808e-02],
       [  9.99269754e-01,   7.30246487e-04],
       [  9.90520731e-01,   9.47926922e-03],
       [  9.95547924e-01,   4.45207614e-03],
       [  9.94175638e-01,   5.82436234e-03],
       [  9.98997753e-01,   1.00224702e-03]])

Note that this function gives you a predicted probability for both class. We don't want that. We only want the predicted probabilities that it will classify something as spam, i.e., predict a value of 1. So we take all rows and only the second column from the output of the `predict_proba()` method. 

In [85]:
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,
        0.99725053,  0.00157706])

Notice that these values are less extreme or better calibrated. Now we can get the roc curve. Note that the roc curve uses the predicted probabilities rather than the predicted class to calculate its output. 

In [88]:
metrics.roc_auc_score(y_test, y_pred_prob)

0.99368176123143015

## Part 7: Calculate the 'Spamminess' of each token
This is a neat little trick you can do with Naive Bayes. The question we are trying to answer is why did certain messages get flagged as spam? Were the individual words 'though of' as Spammy words by the model? 

In [100]:
#store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

7456

Note that 7456 is the number of words in the vocabulary that was pulled out by the `.fit()` method. 

In [101]:
#examine the first 50 and last 50 tokens
print(X_train_tokens[:50]) #use print get keep it from printing out 50 rows

['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']


In [102]:
print(X_train_tokens[-50:])

['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']


In [98]:
#Naive Bayes counts the number of times each token appears in each class. 
nb.feature_count_

array([[  0.,   0.,   0., ...,   1.,   1.,   1.],
       [  5.,  23.,   2., ...,   0.,   0.,   0.]])

Now look at what is going on here. The two classes are 0, ham, and 1, spam. The first row is the number of times the token showed up in a ham email and the second row is the number of times the token showed up in a spam email. The columns are the names of the tokens, i.e., the words. So the first word, '00', showed up in zero emails that were ham but 5 times in emails that were in truth spam. A pretty spammy word. 

In [103]:
#rows represent classes, columns represent tokens 
nb.feature_count_.shape

(2, 7456)

Now we use a little numpy to store the two different rows in separate objects. 

In [112]:
#number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0,:]
ham_token_count

myHam_token_count = nb.feature_count_[:1,:]

ham_token_count == myHam_token_count

array([[ True,  True,  True, ...,  True,  True,  True]], dtype=bool)

So both of these methods give the same answer. 

In [119]:
#number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count

array([  5.,  23.,   2., ...,   0.,   0.,   0.])

Now we want to put this into a DataFrame. Here, each word will be a row. The variables will be the name of the token itself, the probability assigned to it of being ham and the probability assigned to it of being ham. So, three columns.

In [123]:
#create a DataFrame with the words and the token counts
tokens = pd.DataFrame({'token': X_train_tokens, 'ham': ham_token_count, 'spam': spam_token_count}).set_index('token')

In [124]:
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.0,5.0
0,0.0,23.0
8704050406,0.0,2.0
121,0.0,1.0
1223585236,0.0,1.0


In [125]:
#get a random selection of the rows in the DataFrame
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,64.0,2.0
nasty,1.0,1.0
villa,0.0,1.0
beloved,1.0,0.0
textoperator,0.0,2.0


Now we want to develop a spamminess score. To do that we have to account for the class imbalance in order to give the words that appear in spam a greater weight. The other thing we have to do is avoid the possibility of dividing by zero. To do that we just add a 1 to everthing to avoid dividing by zero.  

In [126]:
#add 1 to ham and spam counts to avoid dividing by zero. 
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,65.0,3.0
nasty,2.0,2.0
villa,1.0,2.0
beloved,2.0,1.0
textoperator,1.0,3.0


Then we have to normaize. We have to get the counts of the tokens in the class and divide it by the number of tokens in the class. This is the whole point of normalizing. A single spam message counts more toward the 'spamminess' of a message because there are fewer spam messages in the first place. So we are going to vectorize this operation. And we are going to have to get the get the total counts for each of the classes. 

In [139]:
spam = sum(y)
ham = len(y) - spam
print(spam)
print(ham)
print(spam + ham)

747
4825
5572


In [142]:
len(X_train) + len(X_test)

5572

Ok, I was barking up the wrong tree. To get the total number of members of each class we need to go to the nb model and get the class counts. Naive Bayes stores this information as an attribute of the fitted model. This is from the training data. 

In [152]:
nb.class_count_[0] + nb.class_count_[1]

4179.0

Which should be the length of the training data set.

In [153]:
len(y_train)

4179

So we could also get the number of words in each class by doing the method I tried above but only on the training data. 

In [154]:
spam = sum(y_train)
ham = len(y_train) - spam
print(spam)
print(ham)
print(spam + ham)

562
3617
4179


In [155]:
#Markham's method of getting the class counts
nb.class_count_

array([ 3617.,   562.])

In [156]:
#get frequencies
tokens['ham'] = tokens['ham'] / nb.class_count_[0] # he uses tokens.ham for calculation
tokens['spam'] = tokens['spam'] / nb.class_count_[1]

In [157]:
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,0.017971,0.005338
nasty,0.000553,0.003559
villa,0.000276,0.003559
beloved,0.000553,0.001779
textoperator,0.000276,0.005338


Now we want a ratio for each word for its spamminess for its hamminess. 

In [159]:
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
very,0.017971,0.005338,0.297044
nasty,0.000553,0.003559,6.435943
villa,0.000276,0.003559,12.871886
beloved,0.000553,0.001779,3.217972
textoperator,0.000276,0.005338,19.307829


Now this is really cool! What you have here is the essence of naive bayes. What naive bayes does is learn the 'spam_ratio' column for the data and uses that to make predictions! It just takes the spam ratio for every word that occurs and combines them (either through adding or taking thier logs) and uses that to make the decision based on high it is.  

Someone in the class asked about the trailing underscore on nb.class_count_ and whether that indicates it is an attribute. The answer is that the fact that there are no parentheses is what tells you it is an attribute. The trailing underscore tells you that the attribute is calculated by sklearn. Specifically, anything with a trailing underscore is something that is calculated by the `.fit()` method and so is something that does not exist untill the `fit()` method is run.  

Now lets look at the tokens DataFrame to get more info about what makes a token spammy. 

In [161]:
tokens.sort_values('spam_ratio', ascending=False)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
claim,0.000276,0.158363,572.798932
prize,0.000276,0.135231,489.131673
150p,0.000276,0.087189,315.361210
tone,0.000276,0.085409,308.925267
guaranteed,0.000276,0.076512,276.745552
18,0.000276,0.069395,251.001779
cs,0.000276,0.065836,238.129893
www,0.000553,0.129893,234.911922
1000,0.000276,0.056940,205.950178
awarded,0.000276,0.053381,193.078292


So this is a good way to get an idea of what words make things spam. We can get the spam ratio for a single word using loc. 

In [162]:
tokens.loc['dating', 'spam_ratio']

83.667259786476862

The threshold for deciding whether an email (?) is spam or ham is set to 0.5 by default and can only be changed manually. You can't set it in the predict method. You have to get the predicted probabilities and pull them out by filtering (he usually uses a .where clause -- pandas?).    

Naive Bayes is more accurate in small data sets than Logistic Regression.  

## Part 8: Creating a DataFrame from individual text files
Suppose you want to use your own text files and you have them in two different folders, say, ham and spam. How do you get them into a data frame so you can analyze them?

The steps are: 

- You have the text files in folders marked with the category names
- Use the glob.glob function to get a list of file names in the two folders
- multiple the lists [0] and [1] by the length of the respective lists of file names 
- put the test of each file into a list of strings
- build the DataFame with a dictionary where the keys are column names and the values are the lists of strings

Figure out the path to the spam and ham files. 

In [174]:
!pwd

/Users/michaelreinhardme.com/ds/markham/nlp


In [193]:
!ls ../MLtext2/MLtext2/data/ham_files

email1.txt email3.txt email5.txt


In [204]:
#glob to create a list of file names
import glob
ham_filenames = glob.glob('../MLtext2/MLtext2/data/ham_files/*.txt')
print(ham_filenames)
spam_filenames = glob.glob('../MLtext2/MLtext2/data/spam_files/*.txt')
print(spam_filenames)

['../MLtext2/MLtext2/data/ham_files/email1.txt', '../MLtext2/MLtext2/data/ham_files/email3.txt', '../MLtext2/MLtext2/data/ham_files/email5.txt']
['../MLtext2/MLtext2/data/spam_files/email2.txt', '../MLtext2/MLtext2/data/spam_files/email4.txt']


In [205]:
#read the file contents of one folder into a list of strings
ham_list = []
for file in ham_filenames:
    with open(file) as f:
        ham_list.append(f.read())
print(ham_list)

['This is a ham email.\nIt has 2 lines.\n', 'This is another ham email.\n', 'This is yet another ham email.\n']


In [206]:
spam_list = []
for file in spam_filenames:
    with open(file) as f:
        spam_list.append(f.read())
        
print(spam_list)

['This is a spam email.\n', 'This is another spam email.\n']


In [209]:
#concatenate the lists of strings
features = ham_list + spam_list
print(features)

['This is a ham email.\nIt has 2 lines.\n', 'This is another ham email.\n', 'This is yet another ham email.\n', 'This is a spam email.\n', 'This is another spam email.\n']


In [207]:
#combine the two lists
target_values = [0]*len(ham_list) + [1]*len(spam_list)
print(target_values)

[0, 0, 0, 1, 1]


In [216]:
#make a DataFrame
df = pd.DataFrame({'label':target_values, 'message':features})

In [217]:
df

Unnamed: 0,label,message
0,0,This is a ham email.\nIt has 2 lines.\n
1,0,This is another ham email.\n
2,0,This is yet another ham email.\n
3,1,This is a spam email.\n
4,1,This is another spam email.\n
