# Working with Text Data and Naive Bayes in scikit-learn

## Agenda

**Working with text data**

- Representing text as data
- Reading SMS data
- Vectorizing SMS data
- Examining the tokens and their counts
- Bonus: Calculating the "spamminess" of each token

**Naive Bayes classification**

- Building a Naive Bayes model
- Comparing Naive Bayes with logistic regression

## Part 1: Representing text as data

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [4]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
vect.get_feature_names()

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [5]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<type 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [6]:
# print the sparse matrix
# (0,1) denotes (simple_train[0], vect.get_feature_names()[1] present in simple_train[0])
# (0, 4) denotes (simple_train[0], vect.get_feature_names()[4] present in simple_train[0])
# ...
# (1,0) denotes (simple_train[1], vect.get_feature_names()[0] present in simple_train[1])
print simple_train_dtm

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


In [7]:
# convert sparse matrix to a dense matrix
# each row pertains to simple_train element
# bag of words
# disadvantage: completely ignores order in which words appear in strings
# [0, 1, 0, 0, 1, 1] denotes in simple_train[0], 0 vect.get_feature_names()[0], 1 vect.get_feature_names()[1], 0 vect.get_feature_names()[2]...
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [8]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [10]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]])

In [11]:
# examine the vocabulary and document-term matrix together
# don't not counted because it doesn't exist in existing vocabulary
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


**Summary:**

- `vect.fit(train)` learns the vocabulary of the training data
- `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data
- `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and **ignores tokens it hasn't seen before**)

## Part 2: Reading SMS data - Exercise

In [12]:
# read tab-separated file
url = '../../DAT-DC-10/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)


In [13]:
# Investigate the shape of this DataFrame, take a look at a few rows
sms.shape

(5572, 2)

In [14]:
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [15]:
# how many occurences are there of ham and of spam?
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [16]:
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [17]:
# define X and y
from sklearn.cross_validation import train_test_split
X = sms.message
y = sms.label

In [18]:
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Part 3: Vectorizing SMS data - Exercise

In [19]:
# instantiate the vectorizer
vect = CountVectorizer()

In [20]:
simple_train = [item for item in sms.message]

# learn training data vocabulary with vect.fit() 
vect.fit(X_train)
# vect.get_feature_names()

# then use vect.tranform() to create a document-term matrix called X_train_dtm
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4179x7444 sparse matrix of type '<type 'numpy.int64'>'
	with 55511 stored elements in Compressed Sparse Row format>

In [23]:
# transform testing data (using fitted vocabulary) into a document-term matrix, call it X_test_dtm
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7444 sparse matrix of type '<type 'numpy.int64'>'
	with 17268 stored elements in Compressed Sparse Row format>

## Part 4: Examining the tokens and their counts

In [24]:
# store token names
X_train_tokens = vect.get_feature_names()

In [25]:
# first 50 tokens
print X_train_tokens[:50]

[u'00', u'000', u'008704050406', u'0089', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090298926', u'07099833605', u'07123456789', u'0721072', u'07732584351', u'07734396839', u'07742676969', u'07753741225', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07815296484', u'07821230901', u'0789xxxxxxx', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705', u'08000938767']


In [26]:
# last 50 tokens
print X_train_tokens[-50:]

[u'yest', u'yesterday', u'yet', u'yetunde', u'yhl', u'yi', u'yifeng', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'young', u'youphone', u'your', u'youre', u'yourinclusive', u'yourjob', u'yours', u'yourself', u'youuuuu', u'youwanna', u'yoville', u'yoyyooo', u'yr', u'yrs', u'ystrday', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zindgi', u'zoe', u'zoom', u'zyada', u'\xfa1', u'\u3028ud']


In [27]:
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [28]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

array([ 7, 21,  1, ...,  1,  1,  1])

In [29]:
X_train_counts.shape

(7444,)

In [30]:
# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values(by='count', ascending = False)

Unnamed: 0,count,token
7409,1710,you
6665,1706,to
6560,979,the
952,737,and
3604,674,is
3504,670,in
4482,575,my
4232,572,me
3614,566,it
2846,533,for


## Calculating the "spamminess" of each token

In [31]:
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0]
sms_spam = sms[sms.label==1]

In [35]:
# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()

In [36]:
# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)

In [38]:
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)
# ham_counts

In [39]:
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)

In [45]:
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})
token_counts.sort_values(by='ham', ascending=False).head()

Unnamed: 0,ham,spam,token
8668,1948,297,you
7806,1562,691,to
7674,1133,206,the
1097,858,122,and
4114,823,80,in


In [140]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1

In [141]:
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort_values(by='spam_ratio')

Unnamed: 0,ham,spam,token,spam_ratio
3684,319,1,gt,0.003135
4793,317,1,lt,0.003155
3805,232,1,he,0.004310
6843,168,1,she,0.005952
4747,163,1,lor,0.006135
2428,151,1,da,0.006623
4550,136,1,later,0.007353
1247,90,1,ask,0.011111
6626,90,1,said,0.011111
2714,89,1,doing,0.011236


## Part 5: Building a Naive Bayes model

We will use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [142]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [143]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [144]:
# calculate accuracy of class predictions
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

0.983488872936


In [145]:
# confusion matrix
print metrics.confusion_matrix(y_test, y_pred_class)

[[1203    6]
 [  17  167]]


In [146]:
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  9.99998990e-01,   5.27733588e-06,   7.90711592e-06, ...,
         1.31571005e-01,   1.05481616e-09,   9.85453565e-09])

In [147]:
# calculate AUC
print metrics.roc_auc_score(y_test, y_pred_prob)

0.97634588413


In [148]:
# print message text for the false positives
X_test[y_test < y_pred_class]

45            No calls..messages..missed calls
4382    Mathews or tait or edwards or anderson
574                     Waiting for your call.
3375                   Also andros ice etc etc
4702                    I liked the new mobile
228             Hey company elama po mudyadhu.
Name: message, dtype: object

In [149]:
# print message text for the false negatives
X_test[y_test > y_pred_class]

672     SMS. ac sun0819 posts HELLO:"You seem cool, wa...
4373    Ur balance is now £600. Next question: Complet...
2575    Your next amazing xxx PICSFREE1 video will be ...
5037    You won't believe it but it's true. It's Incre...
5370    dating:i have had two of these. Only started a...
3742                                        2/2 146tf150p
2354    Please CALL 08712402902 immediately as there i...
3419    LIFE has never been this much fun and great un...
3981                                   ringtoneking 84484
3360    Sorry I missed your call let's talk when you h...
1430    For sale - arsenal dartboard. Good condition b...
4144    In The Simpsons Movie released in July 2007 na...
2823    ROMCAPspam Everyone around should be respondin...
869     Hello. We need some posh birds and chaps to us...
1638    0A$NETWORKS allow companies to bill for SMS, s...
684     Hi I'm sue. I am 20 years old and work as a la...
3391    Please CALL 08712402972 immediately as there i...
Name: message,

In [43]:
# what do you notice about the false negatives?
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

## Part 6: Comparing Naive Bayes with logistic regression

In [44]:
# import/instantiate/fit
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)

In [45]:
# class predictions and predicted probabilities
y_pred_class = logreg.predict(X_test_dtm)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]

In [46]:
# calculate accuracy and AUC
print metrics.accuracy_score(y_test, y_pred_class)
print metrics.roc_auc_score(y_test, y_pred_prob)

0.989231873654
0.994144889923
