# Text Classification with Supervised Machine Learning

The goal of text classification is to classify texts into any number of predefined categories. This method is most similar to traditional content analysis, or text coding, in that it does the same thing as a team of trained coders: place texts into categories. 

It does require labeled text, or text that is already categorized into predefined categories. In some cases, like ours today, this is built into the data. Labeling text can also be done by hand, as was the case in the reading for today. Once we have a good number of labeled texts, usually between 200 and 500, we can use superivsed machine learning algorithms to train a computer to recognize the categories and place the remaining, un-coded texts into a category. This method has two benefits: (1) It allows us to scale our hand-coding up almost indefinitely, and (2) it identifies what *features* (in our case, words), are most defining of each category. This can help us learn more about the content of our categories.

We will apply supervised machine methods to a corpus distributed by Ted Underwood and Jordan Sellers in support of their own literary historical study on nineteenth- and early-twentieth century volumes of poetry that were reviewed in prestigious magazines versus not at all. (The idea being that even a negative review indicates valuable, critical engagement.)

In essence, our task will be to learn the vocabulary of literary prestige, building on the Canon/Archive paper we read early in the course.

### Learning Goals
* Get you comfortable with the basic vocabulary around supervised machine learning and text classification
* Understand the intuition behind supervised machine learning
* Learn how to implement a few key supervised machine learning algorithms
* Understand how to test for accuracy
* Use scikit-learn to identify important features for each category
* Be equipped with foundational knowledge so you can continue to learn, either on your own or by taking more advanced math, machine learning, or text analysis courses

### Outline
0 Reading in and pre-processing data

1. Training and test sets
2. Supervised Machine Learning Classification
3. Prediction
4. Cross Validation
5. Identifying Features


### Key Terms

* *supervised machine learning* :
    * a machine learning task of inferring a classification from labeled training data.
* *features*:
    * way of representing the object that will be classified. For images, features are often pixels. For text, features are usually word counts or weighted word counts, but they also also be things like a word's part of speech.
* *training set*:
    * a selection of labeled data that is used to train the machine learning algorithm
* *test set*:
    * a selection of labeled data that is used to test the accuracy of the machine learning algorithm
* *unseen set*:
    * a selection of *unlabeled* data - the machine learning algorithm predicts the label for these data
* *accuracy*:
    * the number of texts the algorithms correctly classifies
* *cross validation*:
    * a way to assess how the algorithm will perform on an unseen data set
    
### Additional Resources:

The [full documentation](http://scikit-learn.org/stable/supervised_learning.html) for supervised machine learning using scikit-learn

[Identifying what types of blog posts are censored in China](http://gking.harvard.edu/publications/how-censorship-china-allows-government-criticism-silences-collective-expression), using supervised machine learning, Gary King, Jennifer Pan, and Margaret E Roberts


[Literary Pattern Recognition](https://lucian.uchicago.edu/blogs/literarynetworks/files/2015/12/LONG_SO_CI.pdf), Hoyt Long, Richard So

[How Quickly Do Literary Standards Change?](https://tedunderwood.com/2015/05/18/how-quickly-do-literary-standards-change/), Ted Underwood, Jordan Sellers


### 0. Reading in and pre-processing data

In [1]:
#first, import the necessary modules
import pandas
import numpy as np
#scikit-learn is a huge libaray. We import what we need.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC #Linear Suppot Vector Classifier
from sklearn.naive_bayes import MultinomialNB #Naive Bayes classifier
from sklearn.neighbors import KNeighborsClassifier #nearest neighbors classifier
from sklearn.metrics import accuracy_score #to asses the accuracy of the algorithm
from sklearn.model_selection import cross_val_score #to compute cross validation for assessment purposes
from sklearn.cross_validation import cross_val_score #to compute cross validation for assessment purposes



In [2]:
#read our texts and turn them into lists
import os
review_path = '../data/poems/reviewed/'
random_path = '../data/poems/random/'
review_files = os.listdir(review_path)
random_files = os.listdir(random_path)

review_texts = [open(review_path+file_name).read() for file_name in review_files]
random_texts = [open(random_path+file_name).read() for file_name in random_files]

review_texts[0] #notice the strange output here. These poems are saved in a bag of words format

"the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the

In [3]:
#transform and concat these lists into a Pandas dataframe
df1 = pandas.DataFrame(review_texts, columns = ['body'])
df1['label'] = "review"
df2 = pandas.DataFrame(random_texts, columns = ['body'])
df2['label'] = "random"
df = pandas.concat([df1,df2])
df

Unnamed: 0,body,label
0,the the the the the the the the the the the th...,review
1,the the the the the the the the the the the th...,review
2,the the the the the the the the the the the th...,review
3,the the the the the the the the the the the th...,review
4,the the the the the the the the the the the th...,review
5,the the the the the the the the the the the th...,review
6,the the the the the the the the the the the th...,review
7,the the the the the the the the the the the th...,review
8,the the the the the the the the the the the th...,review
9,the the the the the the the the the the the th...,review


In [4]:
##EX: Output some summary statistics for this dataframe. How many poems with the review label, and how many with the random label?
##What is the total number of words in each category? What is the average number of words per poem in each category?

print(df['label'].value_counts())

random    360
review    360
Name: label, dtype: int64


In [5]:
df['tokens'] = df['body'].str.split()
df['tokens'] = df['tokens'].str.len()
grouped = df.groupby('label')
print(grouped['tokens'].sum())
print(grouped['tokens'].mean())

label
random    7069809
review    8260352
Name: tokens, dtype: int64
label
random    19638.358333
review    22945.422222
Name: tokens, dtype: float64


### 1. Divide data into training and test sets

First we need to create a training set and a test set. We'll train on the first 500 poems, and test the accuracy on the rest.

In [6]:
#randomize our rows
df = df.sample(720, random_state=0)
df

Unnamed: 0,body,label,tokens
143,the the the the the the the the the the the th...,random,9045
285,the the the the the the the the the the the th...,random,4155
245,the the the the the the the the the the the th...,review,54961
142,the the the the the the the the the the the th...,review,18352
133,the the the the the the the the the the the th...,random,9404
358,the the the the the the the the the the the th...,review,16262
261,the the the the the the the the the the the th...,random,30139
40,the the the the the the the the the the the th...,review,15687
9,the the the the the the the the the the the th...,random,3185
85,the the the the the the the the the the the th...,review,30603


In [7]:
#create two new dataframes
df_train = df[:500]
df_test = df[500:]
print(df_test['label'].value_counts())
df_train['label'].value_counts()

review    114
random    106
Name: label, dtype: int64


random    254
review    246
Name: label, dtype: int64

### 2. Supervised Machine Learning Classification

Next we need to create a dtm for each review, and an array containing the classification label for each review (for us, this is called 'label')

In [8]:
#transform the 'body' column into a document term matrix
tfidfvec = TfidfVectorizer(stop_words = 'english', min_df = 1, binary=True)
countvec = CountVectorizer(stop_words = 'english', min_df = 1, binary=True)

training_dtm_tf = countvec.fit_transform(df_train.body)
test_dtm_tf = countvec.transform(df_test.body)

#create an array for labels
training_labels = df_train.label
test_labels = df_test.label
test_labels.value_counts()

review    114
random    106
Name: label, dtype: int64

In [9]:
#define a container for our chosen algorithm, in this case multinomial naive bayes
nb = MultinomialNB()

#fit a model on our training set
nb.fit(training_dtm_tf, training_labels)

#predict the labels on the test set using the trained model
predictions_nb = nb.predict(test_dtm_tf) 
predictions_nb

array(['random', 'review', 'random', 'random', 'review', 'review',
       'review', 'review', 'random', 'review', 'review', 'random',
       'random', 'random', 'review', 'review', 'random', 'random',
       'review', 'review', 'random', 'review', 'review', 'review',
       'review', 'review', 'review', 'review', 'review', 'random',
       'random', 'random', 'review', 'review', 'review', 'review',
       'random', 'random', 'review', 'random', 'random', 'random',
       'review', 'random', 'random', 'random', 'random', 'review',
       'random', 'review', 'review', 'review', 'review', 'review',
       'review', 'review', 'review', 'review', 'review', 'random',
       'review', 'review', 'review', 'review', 'review', 'review',
       'random', 'review', 'review', 'review', 'review', 'review',
       'random', 'review', 'random', 'random', 'random', 'review',
       'review', 'review', 'random', 'review', 'random', 'random',
       'review', 'review', 'review', 'review', 'review', 'revi

We can use the built-in function "accuracy-score" to calculate the accuracy of our classifier. For binary and multiclass classification, which is our case, this function calculates Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets. It is used to compare the set of predicted labels (the labels the algorithms assigned to the test set) to the true labels for the test set.

In [10]:
accuracy_score(predictions_nb, test_labels)

0.75454545454545452

In [None]:
#let's see if tf-idf weigting improves the accuracy

training_dtm_tfidf = tfidfvec.fit_transform(df_train.body)
test_dtm_tfidf = tfidfvec.transform(df_test.body)
nb.fit(training_dtm_tfidf, training_labels)
predictions_tfidf = nb.predict(test_dtm_tfidf) 
accuracy_score(predictions_tfidf, test_labels)

In [None]:
#let's try a different classifier, LinearSVC
svc = LinearSVC()
svc.fit(training_dtm_tf, training_labels)
predictions_svc = svc.predict(test_dtm_tf) 
accuracy_score(predictions_svc, test_labels)

In [None]:
#k nearest neighbor
knn = KNeighborsClassifier()
knn.fit(training_dtm_tf, training_labels)
predictions_knn = knn.predict(test_dtm_tf) 
accuracy_score(predictions_knn, test_labels)

In [None]:
##EX: Try to improve the accuracy by changing some of the options 
##You can change options for either the vectorizer function, or one of the algorithms.
###How does it change the accuracy of the classifier? (Lowering the accuracy can tell you something too)

### 3. Cross Validation

We know the accuracy for the test set we created, but what if we have a bunch of poems that are not yet classified and we want to classify them? Can we be sure we'll get a similar accuracy?

In [None]:
##We need to define a new container for CountVectorizer. Not this, and we'll come back to it below
countvec_cv = CountVectorizer(stop_words = 'english', min_df = 1, binary=True)
dtm = countvec_cv.fit_transform(df.body)
scores = cross_val_score(nb, dtm, df.label, cv=5)
scores   

In [None]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

*NOTE: This assumes any unseen data has similar features as your training and test data. If the features are different, you can not assume you will get similar accuracy. When might the unseen data be different from the labeled data?

Question: Why is the cross validation accuracy different than our classifier? 

Turn to the person next to you and talk through the above code. What is happening in each line?

### 4. Prediction

Often what we want to do next is predict the label for unlabeled texts. Let's predict the label for two poems where we do not know the classification.

In [None]:
dickinson_canonic = """Because I could not stop for Death – 
He kindly stopped for me –  
The Carriage held but just Ourselves –  
And Immortality.

We slowly drove – He knew no haste
And I had put away
My labor and my leisure too,
For His Civility – 

We passed the School, where Children strove
At Recess – in the Ring –  
We passed the Fields of Gazing Grain –  
We passed the Setting Sun – 

Or rather – He passed us – 
The Dews drew quivering and chill – 
For only Gossamer, my Gown – 
My Tippet – only Tulle – 

We paused before a House that seemed
A Swelling of the Ground – 
The Roof was scarcely visible – 
The Cornice – in the Ground – 

Since then – ‘tis Centuries – and yet
Feels shorter than the Day
I first surmised the Horses’ Heads 
Were toward Eternity – """


anthem_patriotic = """O! say can you see, by the dawn's early light,
What so proudly we hailed at the twilight's last gleaming,
Whose broad stripes and bright stars through the perilous fight,
O'er the ramparts we watched, were so gallantly streaming?
And the rockets' red glare, the bombs bursting in air,
Gave proof through the night that our flag was still there;
O! say does that star-spangled banner yet wave
O'er the land of the free and the home of the brave?"""

In [None]:
# Transform these into DTMs with the same feature-columns as previously
# Notice which CountVectorizer container we use
# Question: Why do we use this one? Why did we need to define a new one for the cross validation step above?
unknown_dtm = countvec.transform([dickinson_canonic,anthem_patriotic]).toarray()


In [None]:
nb.predict(unknown_dtm)

In [None]:
# We can either return a binary classification, but we know that Bayes theorem assigns
# a probability of membership in either category

# Just how confident is our classifier of its predictions?

nb.predict_proba(unknown_dtm)

In [None]:
#Let's zip this together with the name of the poems to make sense of the output
list(zip(['dickinson', 'anthem'], nb.predict(unknown_dtm), nb.predict_proba(unknown_dtm)))

In [None]:
## EX. Import and process the 'canonic' (albeit unreviewed) volumes of poetry.
##     Use the poetry classifier to predict whether they might have been reviewed.
##     Does the output make sense? 

canonic_path = '../data/poems/canonic/'
...

In [None]:
canonic_files = os.listdir(canonic_path)
print(canonic_files)

canonic_texts = [open(canonic_path+file_name).read() for file_name in canonic_files]

In [None]:
canonic_dtm = countvec.transform(canonic_texts).toarray()
nb.predict(canonic_dtm)

In [None]:
nb.predict_proba(canonic_dtm)

In [None]:
list(zip(canonic_files, nb.predict(canonic_dtm), nb.predict_proba(canonic_dtm)))

### 5. Identifying Features

We can use supervised machine learning to classify unseen documents using the above code. But we can also use it to learn more about the content of each category, by extracting the most defining features of this category. So even if we do not have unseen text, we can use this method to better understand given categories (for example, canonized and non-canonized text). 

In [None]:
# Hand-waving the underlying statistics here...

def most_informative_features(text_class, vectorizer = countvec, classifier = nb, top_n = 20):

    feature_names = vectorizer.get_feature_names()
    class_index = np.where(classifier.classes_==(text_class))[0][0]
    
    class_prob_distro = np.exp(classifier.feature_log_prob_[class_index])
    alt_class_prob_distro = np.exp(classifier.feature_log_prob_[1 - class_index])
    
    odds_ratios = class_prob_distro / alt_class_prob_distro
    odds_with_fns = sorted(zip(odds_ratios, feature_names), reverse = True)
    
    return odds_with_fns[:top_n]

In [None]:
# Returns feature name and odds ratio for a given class
most_informative_features('review')

In [None]:
most_informative_features('random')

In [None]:
## Q.  What kinds of patterns do you notice among the 'most informative features'?
##     Try looking at the top fifty most informative words for each category.