# Text Classification with Supervised Machine Learning

The goal of text classification is to categorize texts into any number of predefined categories. This method is most similar to traditional content analysis, or text coding, in that it does the same thing as a team of trained coders: place texts into categories. Unlike regression, the outcome in text classification must be categorical. 

Using supervised machine learning, we give an algorithm a dataset of training examples that say "here are specific features, and this is the category it belongs to". The algorithm trains a model to recognize important features and determine the category of an object (for us, a document); this model can then be used to predict the class of a new object given its known features. 

To train and evaluate our models, we will use [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html), which provides a full machine learning pipeline and many convenient functions in Python.

We will apply supervised machine learning to a corpus distributed by Ted Underwood and Jordan Sellers in support of their own [literary historical study](https://tedunderwood.com/2015/05/18/how-quickly-do-literary-standards-change/) on nineteenth- and early-twentieth century volumes of poetry. Some poems were reviewed in prestigious magazines, while others weren't--being reviewed is our outcome variable. Even a negative review indicates valuable, critical engagement, so predicting whether a poem is reviewed will tell us something about the vocabulary of literary prestige.

## Learning Goals
* Get comfortable with the basic vocabulary of text classification
* Understand the intuition behind supervised machine learning
* Learn how to implement logistic regression and its basic mechanics
* Learn how to test for accuracy
* Use scikit-learn to identify important features for each category
* Gain foundational knowledge for continued learning

## Outline

* [Introduction](#intro)
    * [Text classification in action](#example)
    * [Text classification basics](#basics)
* [Data preparation](#prepare)
    * [Import modules](#import)
    * [Read and preprocess data](#preprocess)
    * [Divide data into training and test sets](#split)
    * [Vectorize texts](#vectorize)
* [Classification with supervised machine learning](#supervised)
    * [Train logistic classifier model](#model)
    * [Prediction on new data](#prediction)
    * [Identify features](#features)

## Key Terms

* *supervised machine learning* :
    * Building a model to infer objects' classifications from labeled training data based on input variables, known as features. Learning a function that maps features to outputs.
* *model training* :
    * Using a machine learning algorithm to train a classifier to predict categories on unseen sets
* *features*:
    * Way of representing the object that will be classified. For images, features are often pixels. For text, features are usually word counts or weighted word counts, but they can also be things like a word's part of speech, proportion of capitals, or specific words that are used.
* *unsupervised machine learning*:
    * Learning patterns or structure across a set of variables through their statistical associations. Uses include discovering latent clusters or dimensions, reducing dimensionality, outlier detection, probability density estimation.
* *loss function*:
    * Machine learning models seek to minimize this function, which measures model effectiveness at estimating the relationship between an input (features) and output variable. This is typically expressed as a difference between the predicted value and the actual value. 
* *text vectorization*:
    * Converting natural language documents into numbers that can be fed into a classifier. Usually this means converting a corpus into a document-term matrix.
* *document-term matrix (DTM)*:
    * A common text vectorization method: essentially a big table where the rows are articles, the columns are words, and each cell indicates the number of a times a given word appears in a given article. The cells can be either raw counts or weighted proportions, like with term frequency inverse document frequency (TF-IDF) weighting--which gives a word greater weight both when it is more frequent in a text AND when it is rare across the corpus. 
* *train-test split*:
    * Dividing labeled data into a training set and a test set. Often done once in supervised machine learning model training, but can be done several times independently as in cross-validation (we'll see this tomorrow)
* *training set*:
    * A selection of labeled data that is used to train the machine learning algorithm
* *test set*:
    * A selection of labeled data that is used to test the accuracy of the machine learning algorithm
* *unseen set*:
    * A selection of *unlabeled* data - the machine learning algorithm predicts the label for these data. Also called the *holdout set*

# Introduction<a id='intro'></a>

## Text classification in action<a id='example'></a>

Imagine that you work at [YouTube](https://www.youtube.com/) (if you haven't heard of it, YouTube is a video-sharing website). Your job is to remove comments on videos that are spam (unsolicited and inappropriate comments). You look through each video and read the comments yourself, deciding which are spam and which are not spam. Perhaps you see comments like those below. Which would you consider to be spam and which not spam?

- _Hey @dancer317, love ur videos so much! Thanks for all the tips on dancing!_
- _OUR  LASER PRINTER/FAX/COPIER TONER CARTRIDGE PRICES NOW AS LOW AS 39 DOLLARS. SPECIALS WEEKLY ON ALL LASER PRINTER SUPPLIES. WE CARRY MOST ALL LASER PRINTER CARTRIDGES, FAX SUPPLIES AND COPIER TONERS AT WAREHOUSE PRICES_
- _I'm not sold on your first point about crossing national boundaries, but I see what you mean about non-economic alternatives._
- _Some of the most beautiful women in the world bare it all for you. Denise Richards, Britney  Spears, Jessica Simpson, and many more. CLICK HERE FOR NUDE CELEBS_

How did you decide which were spam and which weren't? Maybe one thing you noted was the high number of words in all capitals. The topics can also give you a clue, as the spam-like comments talk about selling things and nudity, which are often found in spam comments.

However you decided, we can think about the task you were doing like this:

<img src='../assets/human-classification.jpg' />

You took a comment written in English, and you classified it into one of two classes: spam or not spam. This is text classification, performed by a human. Wouldn't it be nice to have a computer do this for you? That would look like this: 

<img src='../assets/computer-classification.jpg' />

How are we going to do this? We could count the number of times each YouTube comment mentions nudity or tries to sell something, and we could measure the proportion of capital letters. Using this approach, we would get two numbers for each comment, one for each of these _features_. We could also use your human judgements in a third column telling us whether that comment is spam or not.

| Comment                                                 | Selling or nudity | Proportion capital letters | Is it spam? |
|---------------------------------------------------------|-------------------|----------------------------|-------------|
| Hey @dancer317, love ur videos so much! Thanks for ...  | 0                 | 0.1                        | No          |
| OUR LASER PRINTER/FAX/COPIER TONER CARTRIDGE PRICES ... | 4                 | 1.0                        | Yes         |
| I'm not sold on your first point ...                    | 1                 | 0.05                       | No          |
|  Some of the most beautiful women in the world ...      | 3                 | 0.15                       | Yes         |

We can treat these two numbers as geometric coordinates and plot them, with spam comments in red and non-spam comments in green, like so:

<img src='../assets/classification-no-line.jpg' />

<img src='../assets/classification-with-line.jpg' />

## Text classification basics<a id='basics'></a>

Text classification requires labeled text, or text that is already categorized into predefined categories. In some cases, this is built into the data; in others, the labels are assigned by hand. Once we have a good number of labeled texts, usually between 200 and 500, we can use supervised machine learning algorithms to train a model to recognize the categories and place the remaining, un-coded texts into a category. This method has two benefits: (1) It allows us to scale our coding up almost indefinitely, and (2) it identifies what *features* (in our case, words) are most defining of each category. This can help us learn more about the content of our categories.

Text classification involves two primary tasks:
- **Turning natural language into numbers.** (This is called _vectorization_.)
- **Training a classifier to use those numbers and distinguish between the classes.**

This is distinct from inductive natural language processing methods--such as topic modeling, which discover the categories making up texts rather than defining them from the outset; clustering texts based on language patterns; or dimensionality reduction. If you're testing hypotheses about buckets or categories of texts, you'll usually use classification; if you're exploring in the dark, you'll use unsupervised machine learning for exploration. Text categories to classify can also include genre, language, author, or [affective states][1].

[1]: https://en.wikipedia.org/wiki/Affect_(psychology)

Many applied natural language processing problems can be tackled as text classification:

- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)
- Genre classification
- Language identification
- Authorship attribution
- Is this document relevant to this legal case?
- Is the patient in need of urgent care?

# Data preparation<a id='preparation'></a>

## Import modules<a id='import'></a>

In [None]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

#scikit-learn is a huge library. We import what we need.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split #shortcut for dividing into train and test data
from sklearn.metrics import confusion_matrix #shows confusion matrix--false/true positives/negatives
from sklearn.metrics import accuracy_score #to asses the accuracy of the algorithm
from sklearn.model_selection import cross_val_score #to compute cross validation for assessment purposes
from sklearn.svm import LinearSVC #Linear Support Vector Classifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV #Logistic regression classifier w/ and w/o cross-validation
from sklearn.naive_bayes import MultinomialNB #Naive Bayes classifier

## Read and preprocess data<a id='preprocess'></a>

In [None]:
#read our texts and turn them into lists
import os
review_path = 'poems/reviewed/'
random_path = 'poems/random/'
review_files = os.listdir(review_path)
random_files = os.listdir(random_path)

review_texts = [open(review_path+file_name, encoding='utf-8').read() for file_name in review_files]
random_texts = [open(random_path+file_name, encoding='utf-8').read() for file_name in random_files]

review_texts[0] #notice the strange output here. These poems are saved in a bag of words format

In [None]:
#transform and concat these lists into a Pandas dataframe
df1 = pandas.DataFrame(review_texts, columns = ['body'])
df1['label'] = "review"
df2 = pandas.DataFrame(random_texts, columns = ['body'])
df2['label'] = "random"
df = pandas.concat([df1,df2])
df.head()

### Challenge

Output some summary statistics for this dataframe. 
- How many poems have the review label, and how many bear the random label? 
- What is the total number of words in each category? 
- What is the average number of words per poem in each category?

In [None]:
# your code here

## Divide data into training and test sets<a id='split'></a>

To avoid overfitting, let's train our classifier on a different dataset from what we test it on. To accomplish this, we'll split it into training and test sets: we'll train on the first 500 poems, and test the accuracy on the rest.

In [None]:
#randomize our rows
df = df.sample(720, random_state=0)
df.head(10)

In [None]:
#create two new dataframes
df_train = df[:500]
df_test = df[500:]
df_train['label'].value_counts()

## Vectorize texts<a id='vectorize'></a>

Next, we turn the text into numbers for our classifier. We will use a "bag of words" approach to create our features: frequency counts of all the words that appear in a text. The "bag" name is because we ignore the order of the words; we just care about what words are in the text, as if we threw all the words into a bag and shook them up. To do this, we use `scikit-learn`'s `CountVectorizer` to build a document-term matrix (DTM). You can think of a DTM as a big table where the rows are articles, the columns are words, and each cell indicates the number of a times a given word appears in a given article. 

In [None]:
#transform the 'body' column into a document term matrix
countvec = CountVectorizer(stop_words = 'english', min_df = 1, binary=True)

training_dtm_tf = countvec.fit_transform(df_train.body)
test_dtm_tf = countvec.transform(df_test.body)

#create an array for labels
training_labels = df_train.label
test_labels = df_test.label
test_labels.value_counts()

# Classification with supervised machine learning<a id='supervised'></a>

## Train logistic classifier model<a id='model'></a>

Now that we've turned our data into numbers, we're ready to feed it into a classifier. Let's start with one of the more common algorithms for classification, called [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression). It's good to know what our code is doing, so let's go over how this works at a high level.

Logistic models fit where the probability of being positive ($y=1$) is described by a sigmoid function of the form:
$f(X)=P(y=1|X)=\frac{exp(-X'\theta)}{1+exp(-X'\theta)}$. Once $\theta$ - a vector of word _weights_ or _loadings_ - is estimated we can predict outcomes $\hat{f}$ conditional on observed word count $X$. If $\hat{f}>0.5$ then the observation is classified as positive. This is a _linear classifier_ as the decision boundary is defined by $\frac{exp(-X'\theta)}{1+exp(-X'\theta)}=0.5$, which after rearranging and taking logs appears equivalent to $-X'\theta=0$ - a linear function of the features $X$. 

Logistic regression is estimated through penalized Maximum Likelihood. This is a form of a _loss (cost) function_:

"A cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y. This is typically expressed as a difference or distance between the predicted value and the actual value. ... The objective of a ML model, therefore, is to find parameters, weights or a structure that minimises the cost function." <br/>-Conor McDonald

In [None]:
#define a container for our chosen algorithm, in this case logistic regression
#one object contains all the functions for fitting data, making predictions, and storing parameters
logit = LogisticRegression()

#fit a model on our training set
logit.fit(training_dtm_tf, training_labels)

#predict the labels on the test set using the trained model
predictions_logit = logit.predict(test_dtm_tf) 
predictions_logit

We can use the built-in `accuracy_score` function to calculate the accuracy of our classifier. Basically, this number represents the proportion of texts an algorithm correctly classifies (we will go over this more tomorrow). 

In [None]:
accuracy_score(predictions_logit, test_labels)

In [None]:
#let's see if tf-idf weighting improves the accuracy
tfidfvec = TfidfVectorizer(stop_words = 'english', min_df = 1, binary=True)
training_dtm_tfidf = tfidfvec.fit_transform(df_train.body)
test_dtm_tfidf = tfidfvec.transform(df_test.body)

logit.fit(training_dtm_tfidf, training_labels)
predictions_tfidf = logit.predict(test_dtm_tfidf) 
accuracy_score(predictions_tfidf, test_labels)

### Challenge

A common practice in machine learning is to quickly try a bunch of algorithms, see which one works the best, and optimize it. In this spirit...

**Part 1**

Apply these two additional algorithms without bothering to learn how they work: 
- Linear Support Vector Classification (`LinearSVC()`) and 
- Multinomial Naive Bayes (`MultinomialNB()`)
    
To avoid a long delay, with `LinearSVC()` set `max_iter` to 20. 

Train each of these on the training set and evaluate accuracy on the test set.

In [None]:
# your code here

**Part 2**

Try to improve the accuracy of either of these models by changing a few model parameters or the vectorizer function. (Decreases in accuracy are also informative.) This is basically a brute-force method of model optimization, which we'll learn more about tomorrow.

In [None]:
# your code here

## Prediction on new data<a id='prediction'></a>

Often what we want to do next is predict the label for unlabeled texts. Let's predict the label for two poems where we do not know the classification.

In [None]:
dickinson_canonic = """Because I could not stop for Death – 
He kindly stopped for me –  
The Carriage held but just Ourselves –  
And Immortality.

We slowly drove – He knew no haste
And I had put away
My labor and my leisure too,
For His Civility – 

We passed the School, where Children strove
At Recess – in the Ring –  
We passed the Fields of Gazing Grain –  
We passed the Setting Sun – 

Or rather – He passed us – 
The Dews drew quivering and chill – 
For only Gossamer, my Gown – 
My Tippet – only Tulle – 

We paused before a House that seemed
A Swelling of the Ground – 
The Roof was scarcely visible – 
The Cornice – in the Ground – 

Since then – ‘tis Centuries – and yet
Feels shorter than the Day
I first surmised the Horses’ Heads 
Were toward Eternity – """


anthem_patriotic = """O! say can you see, by the dawn's early light,
What so proudly we hailed at the twilight's last gleaming,
Whose broad stripes and bright stars through the perilous fight,
O'er the ramparts we watched, were so gallantly streaming?
And the rockets' red glare, the bombs bursting in air,
Gave proof through the night that our flag was still there;
O! say does that star-spangled banner yet wave
O'er the land of the free and the home of the brave?"""

In [None]:
# Transform these into DTMs with the same feature-columns as previously
unknown_dtm = countvec.transform([dickinson_canonic,anthem_patriotic]).toarray()

In [None]:
# Return binary classification
logit.predict(unknown_dtm)

In [None]:
# Return probability of classification. 
# Bayes theorem assigns a probability of membership in either category.
# Just how confident is our classifier of its predictions?

logit.predict_proba(unknown_dtm)

In [None]:
#Let's zip this together with the name of the poems to make sense of the output
list(zip(['dickinson', 'anthem'], logit.predict(unknown_dtm), logit.predict_proba(unknown_dtm)))

### Challenge

Import and process the 'canonic' (albeit unreviewed) volumes of poetry. Use the poetry classifier to predict whether they might have been reviewed.

What do you think of the output? Does it make sense?

In [None]:
# your code here

## Identify features<a id='features'></a>

We can use supervised machine learning to classify unseen documents using the above code. But we can also use it to learn more about the content of each category, by extracting the most defining features of this category. So even if we do not have unseen text, we can use this method to better understand given categories (for example, canonized and non-canonized text). 

In [None]:
# Hand-waving the underlying statistics here...
nb = MultinomialNB() # naive bayes works best for this math
nb.fit(training_dtm_tf, training_labels) # fit model

def most_informative_features(text_class, vectorizer = countvec, classifier = nb, top_n = 20):
    """Returns feature name and odds ratio for a given class"""

    feature_names = vectorizer.get_feature_names()
    class_index = np.where(classifier.classes_==(text_class))[0][0]
    
    class_prob_distro = np.exp(classifier.feature_log_prob_[class_index])
    alt_class_prob_distro = np.exp(classifier.feature_log_prob_[1 - class_index])
    
    odds_ratios = class_prob_distro / alt_class_prob_distro
    odds_with_fns = sorted(zip(odds_ratios, feature_names), reverse = True)
    
    return odds_with_fns[:top_n]

In [None]:
# Look at most informative features for poems that were reviewed
most_informative_features('review')

In [None]:
most_informative_features('random')

### Challenge

Look at the top fifty informative words for each category. 

What kinds of patterns do you notice?

In [None]:
# your code here