# Machine Learning with Text in scikit-learn  

Forked from [justmarkham/pycon-2016-tutorial](http://github.com/justmarkham/pycon-2016-tutorial)

## Agenda

1. Model building in scikit-learn (refresher)
2. Representing text as numerical data
3. Reading a text-based dataset into pandas
4. Vectorizing our dataset
5. Building and evaluating a model
6. Comparing models
7. Examining a model for further insight

## Part 1: Model building in scikit-learn (refresher)

In [1]:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()

In [2]:
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

**"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output.

In [3]:
# check the shapes of X and y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


**"Observations"** are also known as samples, instances, or records.

In [4]:
# examine the first 5 rows of the feature matrix (including the feature names)
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
# examine the response vector
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.

In [6]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [7]:
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])

array([1])

## Part 2: Representing text as numerical data

In [8]:
# example text for model training (Article Titles)
simple_train = ['9 Shape Tips for Wedding season', 'Barack leaves office in January']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [9]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [10]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
# examine the fitted vocabulary
vect.get_feature_names()

['barack',
 'for',
 'in',
 'january',
 'leaves',
 'office',
 'season',
 'shape',
 'tips',
 'wedding']

In [12]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<2x10 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [13]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 0, 0, 1, 1, 1, 1],
       [1, 0, 1, 1, 1, 1, 0, 0, 0, 0]])

In [14]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,barack,for,in,january,leaves,office,season,shape,tips,wedding
0,0,1,0,0,0,0,1,1,1,1
1,1,0,1,1,1,1,0,0,0,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [15]:
# check the type of the document-term matrix
type(simple_train_dtm)

scipy.sparse.csr.csr_matrix

In [16]:
# examine the sparse matrix contents
print(simple_train_dtm)

  (0, 1)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (1, 0)	1
  (1, 2)	1
  (1, 3)	1
  (1, 4)	1
  (1, 5)	1


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [17]:
# example text for model testing
simple_test = ["5 tips and tricks for college"]

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [18]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 1, 0]])

In [19]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,barack,for,in,january,leaves,office,season,shape,tips,wedding
0,0,1,0,0,0,0,0,0,1,0


**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

## Part 3: Reading a text-based dataset into pandas

In [20]:
# read file into pandas using a relative path
path = 'data/links.csv'
links = pd.read_csv(path, header=0 ,names=['label', 'title'])

In [21]:
# examine the shape
links.shape

(9500, 2)

In [22]:
# examine the first 10 rows
links.head(10)

Unnamed: 0,label,title
0,genuine,Joseph Schooling beats Michael Phelps to claim...
1,genuine,Bill Clinton: Email controversy is the 'bigges...
2,genuine,"Hacker releases cell phone numbers, personal e..."
3,genuine,Lionel Messi announces Argentina return
4,genuine,Fighting the male biological clock by banking ...
5,genuine,The face of the Olympics will never look the same
6,genuine,"Trump: If Clinton wins Pennsylvania, she cheated"
7,genuine,Texas baby found dead after nine hours in hot car
8,genuine,Malawi is moving 500 elephants across the country
9,genuine,Thomas Gibson fired from 'Criminal Minds' afte...


In [23]:
# examine the class distribution
links.label.value_counts()

genuine      6657
clickbait    2843
Name: label, dtype: int64

In [24]:
# convert label to a numerical variable
links['label_num'] = links.label.map({'genuine':0, 'clickbait':1})

In [25]:
# check that the conversion worked
links.head(5)

Unnamed: 0,label,title,label_num
0,genuine,Joseph Schooling beats Michael Phelps to claim...,0
1,genuine,Bill Clinton: Email controversy is the 'bigges...,0
2,genuine,"Hacker releases cell phone numbers, personal e...",0
3,genuine,Lionel Messi announces Argentina return,0
4,genuine,Fighting the male biological clock by banking ...,0


In [26]:
links.tail(5)

Unnamed: 0,label,title,label_num
9495,clickbait,Upworthy Video,1
9496,clickbait,The things some adopted kids are afraid to tal...,1
9497,clickbait,Playgrounds for senior citizens? Genius idea.,1
9498,clickbait,What happens when this 7-year-old elephant reu...,1
9499,clickbait,Upworthy Video,1


In [27]:
# how to define X and y (from the iris data) for use with a MODEL
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [28]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = links.title
y = links.label_num
print(X.shape)
print(y.shape)

(9500,)
(9500,)


In [29]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(7125,)
(2375,)
(7125,)
(2375,)


## Part 4: Vectorizing our dataset

In [30]:
# instantiate the vectorizer
vect = CountVectorizer()

In [31]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [32]:
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

In [33]:
# examine the document-term matrix
X_train_dtm

<7125x9662 sparse matrix of type '<class 'numpy.int64'>'
	with 61386 stored elements in Compressed Sparse Row format>

In [34]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<2375x9662 sparse matrix of type '<class 'numpy.int64'>'
	with 18916 stored elements in Compressed Sparse Row format>

## Part 5: Building and evaluating a model

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [35]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [36]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.02 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [37]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [38]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.89473684210526316

In [39]:
new_title = vect.transform(['President leaves office in January'])
# new_title = vect.transform(['20 Amazing Funny Gadgets you should try'])
pd.DataFrame(new_title.toarray(), columns=vect.get_feature_names())
labels = {0: 'genuine', 1: 'clickbait'}
print('{} == {}'.format(nb.predict(new_title), labels[int(nb.predict(new_title))]))

[0] == genuine


In [40]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1492,  162],
       [  88,  633]])

In [41]:
# print message text for the false positives (geniune incorrectly classified as clickbait)
X_test[y_test < y_pred_class]

4935         Think Like a Doctor: A Cough That Won't Stop
5364    Disney Princesses Do Change Girls -- and Boys,...
941          The last VCR will be manufactured this month
5955                     Lessons of Hiroshima and Orlando
473     This wireless speaker is as loud as a rock con...
229     Bride is walked down the aisle by man who rece...
1475    Hughley: Parenting advice from Giuliani is lik...
3657                     The Brain That Couldn't Remember
6386    Almost As Good As A Full Night's Sleep: 25 Dis...
911     Kristen Bell releases first ever photos from $...
3583            Are Smoothies Better for You Than Juices?
5816                  The Week in Pictures: June 17, 2016
6646    Rappers Based Lyrics on Their Credit Card Frau...
98      Facebook will now show you ads even if you use...
337     10 strange sports you didn't know were in the ...
5297    You're Going to Sell Your Home. Should You Men...
2656    Comfort dogs' from around the U.S. are providi...
5631    How Ma

In [42]:
# print message text for the false negatives (clickbait incorrectly classified as genuine)
X_test[y_test > y_pred_class]

8460    Princess Charlotte Made Her First Public Appea...
6788    State Lawmaker's Son Dies On World's Tallest W...
7501    Turkey's Military Says It Has Taken Control Of...
8017    Jesse Williams Gave A Powerful Speech About Ra...
8649          Nice Guys Can Commit Domestic Violence, Too
6867    Let's Take A Moment To Talk About The Men's Ol...
8780    Kenya's unique approach to rape prevention sho...
8218    Hundreds Of Professors Sign Letter Condeming Y...
8826    Meet Jordan, whose love of black cats helps he...
7294     26 Photos Of Prince George Bossing It As A Royal
9180        A letter to my mother-in-law about my 3 boys.
6777           Oh My God, Beach Volleyball Is Magnificent
7298    Several Dead And Injured In A German Shopping ...
9039    Another major country lifted its gay blood-don...
7843    Stop Trying To Rescue Baby Animals, Wildlife O...
7864    The Official Ranking Of All Of Jessica Simpson...
8670                              7 Times Trump Was Right
9160    Listen

In [43]:
# example false negative
X_test[9479]

"Scientists in Antarctica photographed a new species of crab. It's extraordinary he even exists."

In [44]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  2.21282792e-03,   6.82442823e-02,   9.99995069e-01, ...,
         9.89507319e-01,   1.53084126e-03,   1.42533725e-05])

In [45]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.94528374033780171

## Part 6: Comparing models

We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [46]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [47]:
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.4 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [48]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [49]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([ 0.00469539,  0.05418011,  0.98822526, ...,  0.74744416,
        0.02012881,  0.00312435])

In [50]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

0.92042105263157892

In [51]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.95880788304568254

## Part 7: Examining a model for further insight

We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.

In [52]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

9662

In [53]:
# examine the first 50 tokens
print(X_train_tokens[0:50])

['000', '00s', '01pm', '06pm', '09pm', '10', '100', '101', '102', '10m', '10th', '11', '110', '112', '116', '12', '120', '125', '126', '13', '130', '130m', '13b', '13pm', '13th', '14', '140', '142', '15', '150', '15k', '16', '160', '161', '166', '17', '174', '17th', '18', '187', '18pm', '19', '1900s', '196', '1960s', '1964', '1970s', '1982', '1993', '1994']


In [54]:
# examine the last 50 tokens
print(X_train_tokens[-50:])

['yiannopoulos', 'yield', 'yo', 'yoga', 'yogaday', 'yogurt', 'yoko', 'york', 'yorker', 'you', 'young', 'younger', 'youngest', 'your', 'youree', 'yours', 'yourself', 'yourselves', 'youth', 'youths', 'youtube', 'yulia', 'yuliya', 'yup', 'yves', 'zabar', 'zac', 'zack', 'zara', 'zealand', 'zen', 'zeppelin', 'zero', 'zika', 'zimbabwe', 'zimbabwean', 'zimmer', 'zip', 'zodiac', 'zombie', 'zombies', 'zone', 'zones', 'zoo', 'zookeeper', 'zoos', 'zootopia', 'zubabox', 'zucchini', 'zuckerberg']


In [55]:
# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_

array([[ 27.,   0.,   1., ...,   0.,   3.,   4.],
       [  3.,  11.,   0., ...,   1.,   0.,   0.]])

In [56]:
# rows represent classes, columns represent tokens
nb.feature_count_.shape

(2, 9662)

In [57]:
# number of times each token appears across all HAM messages
genuine_token_count = nb.feature_count_[0, :]
genuine_token_count

array([ 27.,   0.,   1., ...,   0.,   3.,   4.])

In [58]:
# number of times each token appears across all SPAM messages
clickbait_token_count = nb.feature_count_[1, :]
clickbait_token_count

array([  3.,  11.,   0., ...,   1.,   0.,   0.])

In [59]:
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'genuine':genuine_token_count, 'clickbait':clickbait_token_count}).set_index('token')
tokens.head()

Unnamed: 0_level_0,clickbait,genuine
token,Unnamed: 1_level_1,Unnamed: 2_level_1
000,3.0,27.0
00s,11.0,0.0
01pm,0.0,1.0
06pm,0.0,2.0
09pm,0.0,1.0


In [60]:
# examine 5 random DataFrame rows
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,clickbait,genuine
token,Unnamed: 1_level_1,Unnamed: 2_level_1
torrent,0.0,1.0
retailers,0.0,1.0
19,85.0,5.0
subjects,0.0,2.0
market,0.0,11.0


In [61]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

array([ 5003.,  2122.])

Before we can calculate the "clickbaitness" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**.

In [62]:
# add 1 to ham and spam counts to avoid dividing by 0
tokens['genuine'] = tokens.genuine + 1
tokens['clickbait'] = tokens.clickbait + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,clickbait,genuine
token,Unnamed: 1_level_1,Unnamed: 2_level_1
torrent,1.0,2.0
retailers,1.0,2.0
19,86.0,6.0
subjects,1.0,3.0
market,1.0,12.0


In [63]:
# convert the ham and spam counts into frequencies
tokens['genuine'] = tokens.genuine / nb.class_count_[0]
tokens['clickbait'] = tokens.clickbait / nb.class_count_[1]
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,clickbait,genuine
token,Unnamed: 1_level_1,Unnamed: 2_level_1
torrent,0.000471,0.0004
retailers,0.000471,0.0004
19,0.040528,0.001199
subjects,0.000471,0.0006
market,0.000471,0.002399


In [64]:
# calculate the ratio of spam-to-ham for each token
tokens['clickbait_ratio'] = tokens.clickbait / tokens.genuine
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,clickbait,genuine,clickbait_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
torrent,0.000471,0.0004,1.178841
retailers,0.000471,0.0004,1.178841
19,0.040528,0.001199,33.793434
subjects,0.000471,0.0006,0.785894
market,0.000471,0.002399,0.196473


In [65]:
# examine the DataFrame sorted by spam_ratio
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('clickbait_ratio', ascending=False)

Unnamed: 0_level_0,clickbait,genuine,clickbait_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
21,0.045712,0.000200,228.695099
upworthy,0.039114,0.000200,195.687559
23,0.028275,0.000200,141.460886
buzzfeed,0.026861,0.000200,134.387842
obsessed,0.014609,0.000200,73.088124
laugh,0.013195,0.000200,66.015080
funny,0.010368,0.000200,51.868992
17,0.028746,0.000600,47.939522
af,0.009425,0.000200,47.153629
hacks,0.008954,0.000200,44.795947


In [66]:
# look up the spam_ratio for a given token
tokens.loc['amazing', 'clickbait_ratio']

37.722902921771912