# Tutorial: Machine Learning with Text in scikit-learn

## Outline

1. Model building in scikit-learn (preliminary)
1. Representing text as numerical data (refresher)
1. Reading a text-based dataset into pandas
1. Vectorizing our dataset
1. Building and evaluating a model
1. Comparing models
1. Examining a model for further insight

In [None]:
# for Python 2: use print only as a function
from __future__ import print_function

## Part 1: Model building in scikit-learn (preliminary)

In [None]:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

**"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output.

In [None]:
# check the shapes of X and y
print(X.shape)
print(y.shape)

**"Observations"** are also known as samples, instances, or records.

In [None]:
# examine the first 5 rows of the feature matrix (including the feature names)
import pandas as pd


In [None]:
# examine the response vector


In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.

In [None]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)


# fit the model with data (occurs in-place)


In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [None]:
# predict the response for a new observation


## Part 2: Representing text as numerical data (refresher)

In [None]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [None]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [None]:
# learn the 'vocabulary' of the training data (occurs in-place)


In [None]:
# examine the fitted vocabulary


In [None]:
# transform training data into a 'document-term matrix'


In [None]:
# convert sparse matrix to a dense matrix


In [None]:
# examine the vocabulary and document-term matrix together


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while **completely ignoring the relative position information** of the words in the document.

In [None]:
# check the type of the document-term matrix


In [None]:
# examine the sparse matrix contents


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [None]:
# example text for model testing
simple_test = ["please don't call me"]

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [None]:
# transform testing data into a document-term matrix (using existing vocabulary)


In [None]:
# examine the vocabulary and document-term matrix together


**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

## Part 3: Reading a text-based dataset into pandas

In [None]:
# read file into pandas using a relative path
path = 'data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [None]:
# alternative: read file into pandas from a URL
# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
# sms = pd.read_table(url, header=None, names=['label', 'message'])

In [None]:
# examine the shape


In [None]:
# examine the first 10 rows


In [None]:
# examine the class distribution


In [None]:
# convert label to a numerical variable


In [None]:
# check that the conversion worked


In [None]:
# how to define X and y (from the iris data) for use with a MODEL


In [None]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER


In [None]:
# split X and y into training and testing sets


## Part 4: Vectorizing our dataset

In [None]:
# instantiate the vectorizer
vect = CountVectorizer()

In [None]:
# learn training data vocabulary, then use it to create a document-term matrix


In [None]:
# equivalently: combine fit and transform into a single step


In [None]:
# examine the document-term matrix


In [None]:
# transform testing data (using fitted vocabulary) into a document-term matrix


## Part 5: Building and evaluating a model

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

![](./figure/naive_bayes.png)

In [None]:
# import and instantiate a Multinomial Naive Bayes model


In [None]:
# train the model using X_train_dtm (timing it with an IPython "magic command")


In [None]:
# make class predictions for X_test_dtm


In [None]:
# calculate accuracy of class predictions


In [None]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

In [None]:
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

data = {
    "y_Actual": y_test,
    "y_Predicted": y_pred_class,
}

df = pd.DataFrame(data, columns=["y_Actual", "y_Predicted"])
confusion_matrix = pd.crosstab(
    df["y_Actual"], df["y_Predicted"], rownames=["Actual"], colnames=["Predicted"]
)

sn.heatmap(confusion_matrix, annot=True)
plt.show()

In [None]:
# print message text for the false positives (ham incorrectly classified as spam)


In [None]:
# print message text for the false negatives (spam incorrectly classified as ham)


In [None]:
# example false negative


## Part 6: Comparing models

We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

![](./figure/logistic_regression.jpg)

In [None]:
# import and instantiate a logistic regression model


In [None]:
# train the model using X_train_dtm


In [None]:
# make class predictions for X_test_dtm


In [None]:
# calculate accuracy


## Part 7: Examining a model for further insight

We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.

In [None]:
# store the vocabulary of X_train


In [None]:
# examine the first 50 tokens


In [None]:
# examine the last 50 tokens


In [None]:
# Naive Bayes counts the number of times each token appears in each class


In [None]:
# rows represent classes, columns represent tokens


In [None]:
# number of times each token appears across all HAM messages


In [None]:
# number of times each token appears across all SPAM messages


In [None]:
# create a DataFrame of tokens with their separate ham and spam counts


In [None]:
# examine 5 random DataFrame rows


In [None]:
# Naive Bayes counts the number of observations in each class


Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**.

In [None]:
# add 1 to ham and spam counts to avoid producing zero probability


In [None]:
# convert the ham and spam counts into frequencies


In [None]:
# calculate the ratio of spam-to-ham for each token


In [None]:
# examine the DataFrame sorted by spam_ratio
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier


In [None]:
# look up the spam_ratio for a given token
