# Working with Text Data and Naive Bayes in scikit-learn

## Agenda

**Working with text data**

- Representing text as data
- Reading SMS data
- Vectorizing SMS data
- Examining the tokens and their counts
- Bonus: Calculating the "spamminess" of each token

**Naive Bayes classification**

- Building a Naive Bayes model
- Comparing Naive Bayes with logistic regression

## Part 1: Representing text as data

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!', 'help']

In [10]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
# vect.get_feature_names()
vect.vocabulary_

{'cab': 0, 'call': 1, 'help': 2, 'me': 3, 'please': 4, 'tonight': 5, 'you': 6}

In [11]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<4x7 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [6]:
# print the sparse matrix
print(simple_train_dtm)

  (0, 1)	1
  (0, 5)	1
  (0, 6)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (2, 1)	1
  (2, 3)	1
  (2, 4)	2
  (3, 2)	1


In [14]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 0, 1, 1],
       [1, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 2, 0, 0],
       [0, 0, 1, 0, 0, 0, 0]])

In [16]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,help,me,please,tonight,you
0,0,1,0,0,0,1,1
1,1,1,0,1,0,0,0
2,0,1,0,1,2,0,0
3,0,0,1,0,0,0,0


In [9]:
# create a document-term matrix on your own
simple_train = ["call call Sorry, Ill later", 
                "K Did you me call ah just now", 
                "I call you later, don't have network. If urgnt, sms me"]

In [10]:
#complete your work below
# instantiate vectorizer
# fit
# transform
# convert to dense matrix

vec2 = CountVectorizer(binary=True)
vec2.fit(simple_train)
my_dtm2 = vec2.transform(simple_train)

pd.DataFrame(my_dtm2.toarray(), columns=vec2.get_feature_names())

Unnamed: 0,ah,call,did,don,have,if,ill,just,later,me,network,now,sms,sorry,urgnt,you
0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0
1,1,1,1,0,0,0,0,1,0,1,0,1,0,0,0,1
2,0,1,0,1,1,1,0,0,1,1,1,0,1,0,1,1


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [10]:
vect.get_feature_names()

['cab', 'call', 'help', 'me', 'please', 'tonight', 'you']

In [11]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me devon"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 0, 1, 1, 0, 0]])

In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,help,me,please,tonight,you
0,0,1,0,1,1,0,0


**Summary:**

- `vect.fit(train)` learns the vocabulary of the training data
- `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data
- `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

## Part 2: Reading SMS data

In [13]:
# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print(sms.shape)

(5572, 2)


In [14]:
sms.head(5)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [16]:
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [17]:
# define X and y
X = sms.message
y = sms.label

In [21]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(4179,) (4179,)
(1393,) (1393,)


## Part 3: Vectorizing SMS data

In [27]:
# instantiate the vectorizer
vect = CountVectorizer()

In [28]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 55328 stored elements in Compressed Sparse Row format>

In [29]:
# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4179x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 55328 stored elements in Compressed Sparse Row format>

In [30]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7373 sparse matrix of type '<class 'numpy.int64'>'
	with 17393 stored elements in Compressed Sparse Row format>

## Part 4: Examining the tokens and their counts

In [31]:
# store token names
X_train_tokens = vect.get_feature_names()

In [32]:
# first 50 tokens
print(X_train_tokens[:50])

['00', '000', '000pes', '008704050406', '0089', '01223585236', '01223585334', '0125698789', '02', '0207', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07046744435', '07090298926', '07099833605', '07123456789', '0721072', '07732584351', '07734396839', '07742676969', '07753741225', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '07808', '07808247860', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705', '08000938767', '08001950382']


In [33]:
# last 50 tokens
print(X_train_tokens[-50:])

['yet', 'yetty', 'yetunde', 'yhl', 'yifeng', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'young', 'younger', 'youphone', 'your', 'youre', 'yourinclusive', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'yupz', 'zaher', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'ú1', '〨ud']


In [34]:
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [35]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

array([ 7, 20,  1, ...,  1,  1,  1], dtype=int64)

In [36]:
# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values(by='count', ascending=True)

Unnamed: 0,count,token
3686,1,juan
4123,1,mailbox
4120,1,mahfuuz
4119,1,mahal
4117,1,magicalsongs
4115,1,maggi
4114,1,magazine
4111,1,madstini
4110,1,madoke
4109,1,madodu


## Bonus: Calculating the "spamminess" of each token

In [29]:
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0] # ham
sms_spam = sms[sms.label==1] # spam

In [30]:
# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()

In [31]:
# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)

In [32]:
ham_dtm.shape, spam_dtm.shape

((4825, 8713), (747, 8713))

In [33]:
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)

In [34]:
ham_counts

array([0, 0, 1, ..., 1, 0, 1])

In [35]:
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)

In [36]:
spam_counts

array([10, 29,  0, ...,  0,  1,  0])

In [37]:
all_tokens[0:5]

['00', '000', '000pes', '008704050406', '0089']

In [38]:
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})

In [39]:
token_counts

Unnamed: 0,ham,spam,token
0,0,10,00
1,0,29,000
2,1,0,000pes
3,0,2,008704050406
4,0,1,0089
5,0,1,0121
6,0,1,01223585236
7,0,2,01223585334
8,1,0,0125698789
9,0,8,02


In [40]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1

In [41]:
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort_values(by='spam_ratio', ascending=False)

Unnamed: 0,ham,spam,token,spam_ratio
2067,1,114,claim,114.000000
6113,1,94,prize,94.000000
352,1,72,150p,72.000000
7837,1,61,tone,61.000000
369,1,52,18,52.000000
3688,1,51,guaranteed,51.000000
617,1,45,500,45.000000
2371,1,45,cs,45.000000
299,1,42,1000,42.000000
1333,1,39,awarded,39.000000


In [43]:
#observe spam messages that contain the word 'claim'
claim_messages = sms.message[sms.message.str.contains('claim')]

for message in claim_messages[0:5]:
    print(message, '\n')

WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. 

Urgent UR awarded a complimentary trip to EuroDisinc Trav, Aco&Entry41 Or £1000. To claim txt DIS to 87121 18+6*£1.50(moreFrmMob. ShrAcomOrSglSuplt)10, LS1 3AJ 

You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)  

PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires 

Todays Voda numbers ending 7548 are selected to receive a $350 award. If you have a match please call 08712300220 quoting claim code 4041 standard rates app 



## Part 5: Building a Naive Bayes model

We will use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [37]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB, GaussianNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [38]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [39]:
# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.988513998564


In [41]:
print(metrics.classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1206
          1       0.98      0.94      0.96       187

avg / total       0.99      0.99      0.99      1393



In [43]:
metrics.confusion_matrix(y_test, y_pred_class)


array([[1202,    4],
       [  12,  175]])

In [47]:
?metrics.confusion_matrix

In [48]:
# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[1202    4]
 [  12  175]]


In [49]:
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  7.82887542e-08,   3.02868734e-08,   1.38606514e-11, ...,
         1.00000000e+00,   1.00000000e+00,   2.62417931e-06])

In [50]:
# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))

0.987471288832


In [51]:
# print message text for the false positives
X_test[y_test < y_pred_class]

5475    Dhoni have luck to win some big title.so we wi...
2173     Yavnt tried yet and never played original either
4557                              Gettin rdy to ship comp
4382               Mathews or tait or edwards or anderson
Name: message, dtype: object

In [52]:
# print message text for the false negatives
X_test[y_test > y_pred_class]

4213    Missed call alert. These numbers called but le...
3360    Sorry I missed your call let's talk when you h...
2575    Your next amazing xxx PICSFREE1 video will be ...
788     Ever thought about living a good life with a p...
5370    dating:i have had two of these. Only started a...
3530    Xmas & New Years Eve tickets are now on sale f...
2352    Download as many ringtones as u like no restri...
3742                                        2/2 146tf150p
2558    This message is brought to you by GMW Ltd. and...
4144    In The Simpsons Movie released in July 2007 na...
955             Filthy stories and GIRLS waiting for your
1638    0A$NETWORKS allow companies to bill for SMS, s...
Name: message, dtype: object

In [None]:
# what do you notice about the false negatives?
# X_test[3132]

## Part 6: Comparing Naive Bayes with logistic regression

In [None]:
#Create a logitic regression
# import/instantiate/fit


In [None]:
# class predictions and predicted probabilities


In [None]:
# calculate accuracy and AUC
