##Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.
<p style="color:blue">
Import pandas, numpy, Natural Language Toolkit (nltk), Sci Toolkit (sklearn) and libraries for implementing Naive Bayes model and measuring performance
</p>

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

<p style="color:blue">
Read SMS training data into memory. Data is separated by tabs and organized into two columns names spam and txt 
</p>

In [2]:
df= pd.read_csv("SMSSpamCollection",sep='\t', names=['spam', 'txt'])

<p style="color:blue">
Let us peek at few records in the dataframe
</p>

In [3]:
df.head()

Unnamed: 0,spam,txt
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


<p style="color:blue">
Convert spam from categorical to numeric using get_dummies and one hot encoding 
</p>

In [4]:
df['spam'] = pd.get_dummies(df.spam)['spam']

<p style="color:blue">
Let us again look at the records in the dataframe
</p>

In [5]:
df.head()

Unnamed: 0,spam,txt
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


<p style="color:blue">
Initialize a TFIDF vectorizer with stopwords
</p>

In [6]:
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

<p style="color:blue">
'spam' will be our dependent variable with values: 0 (not spam) or 1 (spam)
</p>

In [7]:
y = df.spam

<p style="color:blue">
Use vectorizer to fit tranform 'txt' into a sparse matrix X of TFIDF features
</p>

In [8]:
X= vectorizer.fit_transform(df.txt)

<p style="color:blue">
Let us check the number of observations and features
</p>

In [9]:
print y.shape
print X.shape

(5572L,)
(5572, 8605)


<p style="color:blue">
Build test and training sets
</p>

In [10]:
X_train, X_test,y_train, y_test = train_test_split(X, y, random_state=42)

<p style="color:blue">
Train a Naive Bayes Classifier
</p>

In [11]:
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

<p style="color:blue">
Test model accuracy with roc_auc_score
</p>

In [12]:
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.98558587451336732

<p style="color:blue">
Let us verify using a sample message
</p>

In [13]:
spam_array=np.array(['You have won a free mobile phone'])
spam_vector = vectorizer.transform(spam_array)
print clf.predict(spam_vector)

[ 1.]
