##Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score



In [29]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [73]:
df= pd.read_csv("SMSSpamCollection",sep='\t', names=['spam', 'txt'])

In [74]:
df.head()

Unnamed: 0,spam,txt
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [75]:
#lets convert values in spam column to numerics 
#df.loc[df['spam'] == 'spam', 'spam'] = 1
#df.loc[df['spam'] == 'ham', 'spam'] = 0
df['spam'] = pd.get_dummies(df.spam)['spam']

In [79]:
df.head()

Unnamed: 0,spam,txt
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [80]:
#our dependent variable is spam with spam as value as 1 and ham value as 0
y = df.spam

In [81]:
#TFIDF Vectorizer, just like before
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

In [82]:
#convert df.txt from text to features
X= vectorizer.fit_transform(df.txt)

In [83]:
#lets check the shape
print y.shape
print X.shape

(5572L,)
(5572, 8605)


In [84]:
#lets do the test train split of the data
X_train, X_test,y_train, y_test = train_test_split(X, y, random_state=42)

In [85]:
#Lets train the naive bayes classifier
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [86]:
#Lets test our model accuracy 
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.98558587451336743

In [87]:
#lets try check our model

import numpy as np
spam_array=np.array(["welcome home","You have won a lottery prize"])
spam_review_vector = vectorizer.transform(spam_array)
print clf.predict(spam_review_vector)

[ 0.  1.]


    We can see that "Welcome home" is taken as not spam or ham which is why the value is 0 
     and
    "You have won a lottery prize" is taken as spam which is why the vlaue is 1