
Naive Bayes Lab - SMS Spam Classification
===============
orignally developed by Ankit Jain

CLASS: Naive Bayes SMS spam classifier using sklearn

Data source: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [34]:
# Importing Packages 
import numpy as np
import pandas as pd

In [35]:
## READING IN THE DATA
df = pd.read_csv("data/sms.csv")

In [36]:
# examine the data
df.head(10)

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [37]:
df[df.label=='spam'].head(10)

Unnamed: 0,label,msg
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
12,spam,URGENT! You have won a 1 week FREE membership ...
15,spam,"XXXMobileMovieClub: To use your credit, click ..."
19,spam,England v Macedonia - dont miss the goals/team...
34,spam,Thanks for your subscription to Ringtone UK yo...
42,spam,07732584351 - Rodger Burns - MSG = We tried to...


In [38]:
df.label.value_counts()

ham     4825
spam     747
dtype: int64

In [39]:
df.msg.describe()

count                       5572
unique                      5169
top       Sorry, I'll call later
freq                          30
Name: msg, dtype: object

In [40]:
# Convert the label into a binary variable
# Remember the map function we learned before?
df['label'] = df.label.map({'ham': 0 , 'spam':1})

In [41]:
df.head()

Unnamed: 0,label,msg
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [42]:
# split into training and testing sets by calling sklearn lib
# by default, the data set is split into 0.75 (training) and 0.25 (testing)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1)

In [43]:
print X_train.shape
print X_train

(4179L,)
[ '4mths half price Orange line rental & latest camera phones 4 FREE. Had your phone 11mths+? Call MobilesDirect free on 08000938767 to update now! or2stoptxt T&Cs'
 'Did you stitch his trouser'
 'Hope you enjoyed your new content. text stop to 61610 to unsubscribe. help:08712400602450p Provided by tones2you.co.uk'
 ...,
 'CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C YA 2MORO! WHO NEEDS BLOKES'
 'Text & meet someone sexy today. U can find a date or even flirt its up to U. Join 4 just 10p. REPLY with NAME & AGE eg Sam 25. 18 -msg recd@thirtyeight pence'
 'K k:) sms chat with me.']


In [44]:
X_test.shape

(1393L,)

Now we need to convert the text into feature vectors which can be used for machine learning purposes.
We will use the scikit function of CountVectorizer to 'convert text into a matrix of token counts'

 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

In [46]:
# start with a simple example
train_simple = ['call you tonight',
                'Call me a cab',
                'please call me... PLEASE!']

In [47]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer(decode_error = 'ignore')
vect.fit(train_simple)
vect.get_feature_names()

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [48]:
# transform training data into a 'document-term matrix'
train_simple_dtm = vect.transform(train_simple)
train_simple_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [49]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(train_simple_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [50]:
# transform testing data into a document-term matrix (using existing vocabulary)
test_simple = ["please don't call me"]
test_simple_dtm = vect.transform(test_simple)
test_simple_dtm.toarray()
pd.DataFrame(test_simple_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


In [21]:
# instantiate the vectorizer ( use variable name as vect)
vect = CountVectorizer(decode_error = 'ignore')
vect.fit(X_train)
vect.get_feature_names()

[u'00',
 u'000',
 u'008704050406',
 u'0121',
 u'01223585236',
 u'01223585334',
 u'0125698789',
 u'02',
 u'0207',
 u'02072069400',
 u'02073162414',
 u'02085076972',
 u'021',
 u'03',
 u'04',
 u'0430',
 u'05',
 u'050703',
 u'0578',
 u'06',
 u'07',
 u'07008009200',
 u'07090201529',
 u'07090298926',
 u'07123456789',
 u'07732584351',
 u'07734396839',
 u'07742676969',
 u'0776xxxxxxx',
 u'07781482378',
 u'07786200117',
 u'078',
 u'07801543489',
 u'07808',
 u'07808247860',
 u'07808726822',
 u'07815296484',
 u'07821230901',
 u'07880867867',
 u'0789xxxxxxx',
 u'07946746291',
 u'0796xxxxxx',
 u'07973788240',
 u'07xxxxxxxxx',
 u'08',
 u'0800',
 u'08000407165',
 u'08000776320',
 u'08000839402',
 u'08000930705',
 u'08000938767',
 u'08001950382',
 u'08002888812',
 u'08002986030',
 u'08002986906',
 u'08002988890',
 u'08006344447',
 u'0808',
 u'08081263000',
 u'08081560665',
 u'0825',
 u'083',
 u'0844',
 u'08448714184',
 u'0845',
 u'08450542832',
 u'08452810071',
 u'08452810073',
 u'08452810075over18',


In [51]:
# transform testing data into a document-term matrix: Use Variable name as test_dtm
train_dtm = vect.transform(X_train)
test_dtm = vect.transform(X_test)
print test_dtm

  (1, 2)	1
  (1, 5)	1
  (5, 2)	1
  (10, 5)	1
  (16, 5)	1
  (17, 2)	1
  (17, 5)	1
  (18, 1)	1
  (19, 3)	1
  (22, 5)	1
  (24, 5)	1
  (29, 5)	3
  (30, 5)	1
  (32, 2)	1
  (34, 5)	1
  (35, 5)	1
  (36, 5)	2
  (37, 1)	1
  (39, 5)	3
  (40, 1)	1
  (41, 2)	1
  (41, 5)	1
  (42, 5)	1
  (43, 5)	1
  (45, 5)	1
  :	:
  (1339, 5)	2
  (1341, 4)	1
  (1346, 5)	2
  (1347, 5)	2
  (1348, 5)	1
  (1349, 1)	1
  (1350, 5)	1
  (1352, 5)	5
  (1354, 5)	1
  (1357, 2)	1
  (1357, 5)	1
  (1362, 5)	2
  (1363, 3)	1
  (1371, 1)	1
  (1371, 5)	2
  (1374, 1)	1
  (1374, 5)	1
  (1376, 1)	1
  (1377, 1)	1
  (1377, 5)	1
  (1378, 5)	1
  (1382, 5)	2
  (1384, 1)	1
  (1389, 5)	1
  (1392, 2)	1


In [52]:
# Get the length  and names of the feature names
train_features = vect.get_feature_names()
len(train_features)

6

In [53]:
train_features[:50]

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [25]:
train_features[-50:]

[u'yeovil',
 u'yep',
 u'yer',
 u'yes',
 u'yest',
 u'yesterday',
 u'yet',
 u'yetunde',
 u'yijue',
 u'ym',
 u'ymca',
 u'yo',
 u'yoga',
 u'yogasana',
 u'yor',
 u'yorge',
 u'you',
 u'youdoing',
 u'youi',
 u'youphone',
 u'your',
 u'youre',
 u'yourjob',
 u'yours',
 u'yourself',
 u'youwanna',
 u'yowifes',
 u'yoyyooo',
 u'yr',
 u'yrs',
 u'ything',
 u'yummmm',
 u'yummy',
 u'yun',
 u'yunny',
 u'yuo',
 u'yuou',
 u'yup',
 u'zac',
 u'zaher',
 u'zealand',
 u'zebra',
 u'zed',
 u'zeros',
 u'zhong',
 u'zindgi',
 u'zoe',
 u'zoom',
 u'zouk',
 u'zyada']

In [54]:
# convert train_dtm to a regular array
train_arr = train_dtm.toarray()
train_arr

array([[0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1],
       ..., 
       [0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0]], dtype=int64)

In [55]:

# Revisit Numpy
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print arr[0, 0]
print arr[1, 3]
print arr[0, :]
print arr[:, 0]
print np.sum(arr)
print np.sum(arr,axis = 0)
print np.sum(arr,axis = 1)




1
8
[1 2 3 4]
[1 5]
36
[ 6  8 10 12]
[10 26]


In [56]:
# exercise: calculate the number of tokens in the 0th message in train_arr
print np.sum(train_arr[0,:])

1


In [57]:

# exercise: count how many times the 0th token appears across ALL messages in train_arr
print np.sum(train_arr[:,0])

1


In [58]:
# exercise: count how many times EACH token appears across ALL messages in train_arr
print np.sum(train_arr, axis=0)

[   1  443  601  103   47 1660]


In [None]:
# exercise: create a DataFrame of tokens with their counts.


Let's build the model with Naive Bayes Now

http://scikit-learn.org/stable/modules/naive_bayes.html

In [59]:
# train a Naive Bayes model using train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [60]:
# make predictions on test data using test_dtm
preds = nb.predict(test_dtm)
preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [61]:
# compare predictions to true labels
from sklearn import metrics
print metrics.accuracy_score(y_test, preds)
print metrics.confusion_matrix(y_test, preds)
# confusion matrix: http://en.wikipedia.org/wiki/Confusion_matrix

0.877243359655
[[1204    4]
 [ 167   18]]


In [62]:
# exercise: show the message text for the false positives
X_test[(y_test == 0) & (preds == 1)]

array(['You call him now ok i said call him', "I'm at home. Please call",
       "I'm at work. Please call",
       "Hi. I'm sorry i missed your call. Can you pls call back."], dtype=object)

In [63]:
# exercise: show the message text for the false negatives
X_test[y_test > preds]
# or
X_test[(y_test == 1) & (preds == 0)]

array([ "FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end",
       'Congrats! 2 mobile 3G Videophones R yours. call 09061744553 now! videochat wid ur mates, play java games, Dload polyH music, noline rentl. bx420. ip4. 5we. 150pm',
       'FREE MESSAGE Activate your 500 FREE Text Messages by replying to this message with the word FREE For terms & conditions, visit www.07781482378.com',
       'Someone has conacted our dating service and entered your phone because they fancy you!To find out who it is call from landline 09111030116. PoBox12n146tf15',
       "ree entry in 2 a weekly comp for a chance to win an ipod. Txt POD to 80182 to get entry (std txt rate) T&C's apply 08452810073 for details 18+",
       'Ur cash-balance is currently 500 pounds - to maximize ur cash-in now send GO to 86688 only 150p/msg. CC 08718720201 HG/Suite342/2Lands Row/W1J6HL',
       'This is the 2nd t

In [None]:
## USING ALL DATA AND CROSS-VALIDATION and run NB again


In [None]:
## EXERCISE: CALCULATE THE 'SPAMMINESS' OF EACH TOKEN

# create separate DataFrames for ham and spam ( df_ham and df_spam)


In [None]:
# learn the vocabulary of ALL messages and save it


In [None]:
# create document-term matrix of ham, then convert to a regular array


In [None]:
# create document-term matrix of spam, then convert to a regular array


In [None]:
# count how many times EACH token appears across ALL messages in ham_arr


In [None]:
# count how many times EACH token appears across ALL messages in spam_arr


In [None]:
# create a DataFrame of tokens with their separate ham and spam counts


In [None]:
# add one to ham counts and spam counts so that ratio calculations (below) make more sensse


In [None]:
# calculate ratio of spam-to-ham for each token


In [None]:
# advanced: implement your own naive bayes classifier
