##Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score

In [18]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

In [19]:
df= pd.read_csv("D:\UIS-LEARN\Spring 2016/SMSSpamCollection",sep='\t', names=['spam', 'txt'])

In [22]:
df.head(10)

Unnamed: 0,spam,txt
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
5,1,FreeMsg Hey there darling it's been 3 week's n...
6,0,Even my brother is not like to speak with me. ...
7,0,As per your request 'Melle Melle (Oru Minnamin...
8,1,WINNER!! As a valued network customer you have...
9,1,Had your mobile 11 months or more? U R entitle...


In [21]:
#Converting the words ham and spam to a binary indicator variable(0/1)
df.spam.replace("ham", 0, inplace=True)
df.spam.replace("spam", 1, inplace=True)

In [7]:
df.head()

Unnamed: 0,spam,txt
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
#TFIDF Vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, lowercase=True, strip_accents='ascii')

In [10]:
#Dependent variable is spam. Where 1 is spam SMS and 0 is ham.
y=df.spam

In [11]:
#Convert the df.txt from text to features.
X = vectorizer.fit_transform(df.txt)

In [12]:
print y.shape
print X.shape

(5572L,)
(5572, 8605)


In [13]:
#build test and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
#Train naive bayes classifier
clf= naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
#Accuracy of the model
roc= roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])
print "Naive Bayes AUC = ", roc

Naive Bayes AUC =  0.984319201856


We created a good model. Now, let me check the model by giving one SMS as input. It has to identify whether it is spam(1) or not(0)

In [17]:
import numpy as np
sms_review_array= np.array(["Ok lar... Joking wif u oni..."])
sms_review_vector= vectorizer.transform(sms_review_array)
print clf.predict(sms_review_vector)

[0]


Model says that given SMS is not spam. Which is a correct classification. 