# Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score



In [48]:
import pandas as pd
import nltk
import math
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn import cross_validation

In [22]:
df= pd.read_csv("SMSSpamCollection",sep='\t', names=['spam', 'txt'])

In [23]:
#run this only once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/JG/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
df.head()

Unnamed: 0,spam,txt
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Convert Ham and Spam to 0/1 as binary indicator variables.

In [25]:
df['spam'] = pd.get_dummies(df.spam)['spam'] #ham = 0, spam = 1

In [29]:
df.head()

Unnamed: 0,spam,txt
0,0.0,"Go until jurong point, crazy.. Available only ..."
1,0.0,Ok lar... Joking wif u oni...
2,1.0,Free entry in 2 a wkly comp to win FA Cup fina...
3,0.0,U dun say so early hor... U c already then say...
4,0.0,"Nah I don't think he goes to usf, he lives aro..."


Lets split of our dependant variable. Y will be the column 'spam' where 0 = ham and 1 = spam.

In [31]:
y = df.spam

##### Converting the text to a sparse matrix of TFIDF vectors.

In [32]:
stopset = set(stopwords.words('english'))

In [33]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 2)) #This will convert a collection of raw documents
                                                                   #into matrix of TF-IDF features.
X = vectorizer.fit_transform(df.txt) #Learn vocabulary and idf, return term-document matrix.

In [35]:
X.shape

(5572, 40222)

In [36]:
y.shape

(5572,)

##### Fit Naive-Bayes Classifier

In [40]:
X_train, X_test,y_train, y_test = train_test_split(X, y, random_state=45)

In [41]:
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [42]:
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.99059899637934312

Lets do K-Fold cross validation with 50 folds.

In [50]:
scores = cross_validation.cross_val_score(clf, X, y, cv=50)

In [51]:
scores

array([ 0.94642857,  0.97321429,  0.96428571,  1.        ,  0.97321429,
        0.96428571,  0.95535714,  0.96428571,  0.97321429,  0.96428571,
        0.96428571,  0.95535714,  0.9375    ,  0.97321429,  0.96428571,
        0.95535714,  0.98214286,  0.96428571,  0.97321429,  0.98214286,
        0.96428571,  0.94642857,  0.97321429,  0.96428571,  0.96428571,
        0.94594595,  0.94594595,  0.95495495,  0.98198198,  0.95495495,
        0.94594595,  0.94594595,  1.        ,  0.95495495,  0.94594595,
        0.97297297,  0.97297297,  0.96396396,  0.95495495,  0.96396396,
        0.93693694,  1.        ,  0.95495495,  0.98198198,  0.93693694,
        0.97297297,  0.99099099,  1.        ,  0.93636364,  0.96363636])

In [52]:
mean_score = scores.mean()
std_dev = scores.std()
std_error = scores.std() / math.sqrt(scores.shape[0])
ci =  2.262 * std_error
lower_bound = mean_score - ci
upper_bound = mean_score + ci

print "Score is %f +/-  %f" % (mean_score, ci)
print '95 percent probability that if this experiment were repeated over and over the average score would be between %f and %f' % (lower_bound, upper_bound)

Score is 0.964461 +/-  0.005309
95 percent probability that if this experiment were repeated over and over the average score would be between 0.959152 and 0.969769
