#DAT19 Lab 08
Cross Validation & Naive Bayes Lab - SMS Spam Classification
===============
* orignally developed by Ankit Jain
* modified by Justin Breucop
* modified by Dylan Hercher

Data source: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

## Naive Bayes and SMS Spam Classification

We've already learned how to classify Spam using logistic regression on word frequencies on emails, to relatively strong results. 

In [3]:
import numpy as np
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [5]:
## READING IN THE DATA
df = pd.DataFrame.from_csv("../data/SMSSpamCollection.tsv",sep='\t',header=0,index_col=None)

In [6]:
# examine the data
df.head(3)

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [7]:
df[df.label=='spam'].head(10)

Unnamed: 0,label,msg
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
12,spam,URGENT! You have won a 1 week FREE membership ...
15,spam,"XXXMobileMovieClub: To use your credit, click ..."
19,spam,England v Macedonia - dont miss the goals/team...
34,spam,Thanks for your subscription to Ringtone UK yo...
42,spam,07732584351 - Rodger Burns - MSG = We tried to...


In [8]:
df.label.value_counts()

ham     4825
spam     747
dtype: int64

In [9]:
df.msg.describe()

count                       5572
unique                      5169
top       Sorry, I'll call later
freq                          30
Name: msg, dtype: object

In [10]:
# Convert the label into a binary variable
# Remember the map function we learned before?
df['label'] = df.label.map({'ham': 0 , 'spam':1})

In [11]:
df.head()

Unnamed: 0,label,msg
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
# split into training and testing sets by calling sklearn lib
# by default, the data set is split into 0.75 (training) and 0.25 (testing)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1)

In [13]:
print X_train.shape
print X_train

(4179L,)
710     4mths half price Orange line rental & latest c...
3740                           Did you stitch his trouser
2711    Hope you enjoyed your new content. text stop t...
3155    Not heard from U4 a while. Call 4 rude chat pr...
3748    Ü neva tell me how i noe... I'm not at home in...
2389    wiskey Brandy Rum Gin Beer Vodka Scotch Shampa...
3464    i am seeking a lady in the street and a freak ...
772     Lol! U drunkard! Just doing my hair at d momen...
3667    I'm turning off my phone. My moms telling ever...
4955    U coming back 4 dinner rite? Dad ask me so i r...
854     AH POOR BABY!HOPE URFEELING BETTERSN LUV! PROB...
4079                  Gam gone after outstanding innings.
2837                         Nice.nice.how is it working?
1392                  Haha just kidding, papa needs drugs
5533    Hey chief, can you give me a bell when you get...
874     Ugh its been a long day. I'm exhausted. Just w...
4408    Awesome, plan to get here any time after like ...
3990 

In [14]:
X_test.shape

(1393L,)

Now we need to convert the text into feature vectors which can be used for machine learning purposes.
We will use the scikit function of CountVectorizer to 'convert text into a matrix of token counts'

 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

#### Lets try a simple example

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
# start with a simple example
train_simple = ['call you tonight',
                'Call me a cab',
                'please call me... PLEASE!']

In [17]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer(decode_error = 'ignore')
vect.fit(train_simple)
vect.get_feature_names()

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [18]:
# transform training data into a 'document-term matrix'
train_simple_dtm = vect.transform(train_simple)
train_simple_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [19]:
# We can see how we've adjusted our data easily!
# examine the vocabulary and document-term matrix together
print train_simple
    
pd.DataFrame(train_simple_dtm.toarray(), columns=vect.get_feature_names())

['call you tonight', 'Call me a cab', 'please call me... PLEASE!']


Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [20]:
# transform testing data into a document-term matrix (using existing vocabulary)
test_simple = ["please don't call me"]
test_simple_dtm = vect.transform(test_simple)

test_simple_dtm.toarray()
pd.DataFrame(test_simple_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


#### Question:  How does the above test_simple show how things can go wrong?

#### Exercise: Using the dataset below
   * Vectorize the text
   * Store the results in a DataFrame
   * Show word counts (hint: one dataframe describer can do this)
   * Transform the test text

In [22]:
train_exp = ['where is my taco?',
                'did I eat the taco',
                'I can easily eat my way through that whole box of tacos!',
                'I think way too much about tacos, huh',
                'taco, taco, taco!!!'                
               ]
test_exp = [
    'where did he go?', 'how long did the whole thing last', 'lets go eat one taco or multiple tacos'
]

Vectorize the text

In [28]:
vect = CountVectorizer(decode_error = 'ignore')


Store the results in a DataFrame

Unnamed: 0,about,box,can,did,easily,eat,huh,is,much,my,...,taco,tacos,that,the,think,through,too,way,where,whole
0,0,0,0,0,0,0,0,1,0,1,...,1,0,0,0,0,0,0,0,1,0
1,0,0,0,1,0,1,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
2,0,1,1,0,1,1,0,0,0,1,...,0,1,1,0,0,1,0,1,0,1
3,1,0,0,0,0,0,1,0,1,0,...,0,1,0,0,1,0,1,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,3,0,0,0,0,0,0,0,0,0


Show word counts (hint: one dataframe describer can do this)

about      1
box        1
can        1
did        1
easily     1
eat        2
huh        1
is         1
much       1
my         2
of         1
taco       5
tacos      2
that       1
the        1
think      1
through    1
too        1
way        2
where      1
whole      1
dtype: int64

Transform the test text

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

## Vectorizing our SMS Dataset

Returning to our SMS Spam Dataset

In [None]:
X_train

In [None]:
# instantiate the vectorizer ( use variable name as vect)
vect = CountVectorizer(decode_error = 'ignore')
vect.fit(X_train)
vect.get_feature_names()

In [None]:
# transform testing data into a document-term matrix: Use Variable name as test_dtm
train_dtm = vect.transform(X_train)
test_dtm = vect.transform(X_test)
print test_dtm

In [None]:
# Get the length  and names of the feature names
train_features = vect.get_feature_names()
len(train_features)

In [None]:
train_features[:50]

In [None]:
train_features[-50:]

In [None]:
# convert train_dtm to a regular array
train_arr = train_dtm.toarray()
train_arr

In [None]:

# Revisit Numpy
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print arr[0, 0]
print arr[1, 3]
print arr[0, :]
print arr[:, 0]
print np.sum(arr)
print np.sum(arr,axis = 0)
print np.sum(arr,axis = 1)




In [None]:
# exercise: calculate the number of tokens in the 0th message in train_arr


In [None]:

# exercise: count how many times the 0th token appears across ALL messages in train_arr


In [None]:
# exercise: count how many times EACH token appears across ALL messages in train_arr


In [34]:
# exercise: create a DataFrame of tokens with their counts.


Let's build the model with Naive Bayes Now

http://scikit-learn.org/stable/modules/naive_bayes.html

In [None]:
# train a Naive Bayes model using train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(train_dtm, y_train)

In [None]:
# make predictions on test data using test_dtm
preds = nb.predict(test_dtm)
preds

In [None]:
# compare predictions to true labels
from sklearn import metrics

print metrics.accuracy_score(y_test, preds)
print metrics.confusion_matrix(y_test, preds)
# confusion matrix: http://en.wikipedia.org/wiki/Confusion_matrix

In [None]:
# exercise: show the message text for the false positives


In [None]:
# exercise: show the message text for the false negatives


In [35]:
## USING ALL DATA AND CROSS-VALIDATION, run NB again
df = pd.DataFrame.from_csv("../data/SMSSpamCollection.tsv",sep='\t',header=0,index_col=None)
df.label = df.label.map({'ham': 0 , 'spam':1})

X_train, X_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1)

vect = CountVectorizer(decode_error = 'ignore')

vect.fit(X_train)
vect.get_feature_names()

train_dtm = vect.transform(X_train)
test_dtm = vect.transform(X_test)

In [36]:
df.head(1)

Unnamed: 0,label,msg
0,0,"Go until jurong point, crazy.. Available only ..."


In [37]:
from sklearn.cross_validation import cross_val_score
nb = MultinomialNB()

vect = CountVectorizer(decode_error='ignore')

vect.fit(df.msg)

X_dtm = vect.transform(df.msg)
y = df.label

cross_val_score(nb, X_dtm, y, cv=5)

array([ 0.98026906,  0.98026906,  0.97845601,  0.98114901,  0.97935368])

In [38]:
## EXERCISE: CALCULATE THE 'SPAMMINESS' OF EACH TOKEN

# create separate DataFrames for ham and spam ( df_ham and df_spam)
df_ham = df[df.label==0]
df_spam = df[df.label==1]

In [39]:
# learn the vocabulary of ALL messages and save it
vect.fit(df.msg)
all_features = vect.get_feature_names()

In [40]:
# create document-term matrix of ham, then convert to a regular array
ham_dtm = vect.transform(df_ham.msg)
ham_arr = ham_dtm.toarray()

In [None]:
# create document-term matrix of spam, then convert to a regular array
spam_dtm = vect.transform(df_spam.msg)
spam_arr = spam_dtm.toarray()

In [None]:
# count how many times EACH token appears across ALL messages in ham_arr
ham_counts = np.sum(ham_arr, axis=0)
ham_counts

In [None]:
# count how many times EACH token appears across ALL messages in spam_arr
spam_counts = np.sum(spam_arr, axis=0)
spam_counts

In [None]:
# create a DataFrame of tokens with their separate ham and spam counts
all_token_counts = pd.DataFrame({'token':all_features, 'ham':ham_counts, 'spam':spam_counts})
all_token_counts.head()

In [None]:
# add one to ham counts and spam counts so that ratio calculations (below) make more sensse
all_token_counts['ham'] = all_token_counts.ham + 1
all_token_counts['spam'] = all_token_counts.spam + 1

In [None]:
# calculate ratio of spam-to-ham for each token
all_token_counts['spam_ratio'] = all_token_counts.spam / all_token_counts.ham
all_token_counts.sort_index(by='spam_ratio',ascending = False )

In [None]:
# advanced: implement your own naive bayes classifier
