The following notebook gives you an interactive look at a using machine learning to classify texts into different categories. Although the activity was designed to classify plain-text novels by genre, in principle you can use it to classify any two groups of labeled texts. 

Using this notebook is very straightforward. For the most part, you will simply click "run" on each cell and look at the results it produces. Additional instructions may appear above a corresponding cell.

First, let's upload some text files. Run the first cell, then select a series of text files saved on your hard drive. These should all be of the same category. Depending on the speed of your internet connection and the number of text files you select, this could take anywhere from a few seconds to many minutes. For our class, it will probably take at least a few minutes.


In [1]:
from google.colab import files
classA = files.upload()

Saving 103.txt to 103.txt
Saving 119.txt to 119.txt
Saving 135.txt to 135.txt
Saving 138.txt to 138.txt
Saving 148.txt to 148.txt
Saving 153.txt to 153.txt
Saving 183.txt to 183.txt
Saving 185.txt to 185.txt
Saving 199.txt to 199.txt
Saving 243.txt to 243.txt
Saving 298.txt to 298.txt
Saving 352.txt to 352.txt
Saving 356.txt to 356.txt
Saving 364.txt to 364.txt
Saving 377.txt to 377.txt
Saving 387.txt to 387.txt
Saving 406.txt to 406.txt
Saving 436.txt to 436.txt
Saving 440.txt to 440.txt
Saving 471.txt to 471.txt
Saving 658.txt to 658.txt
Saving 688.txt to 688.txt
Saving 698.txt to 698.txt
Saving 713.txt to 713.txt
Saving 721.txt to 721.txt
Saving 736.txt to 736.txt
Saving 737.txt to 737.txt
Saving 738.txt to 738.txt
Saving 739.txt to 739.txt
Saving 740.txt to 740.txt
Saving 741.txt to 741.txt
Saving 742.txt to 742.txt
Saving 743.txt to 743.txt
Saving 744.txt to 744.txt
Saving 745.txt to 745.txt
Saving 746.txt to 746.txt
Saving 747.txt to 747.txt
Saving 748.txt to 748.txt
Saving 749.t

Next, do the same thing for the second category of text.

In [2]:
classB = files.upload()

Saving 113.txt to 113.txt
Saving 114.txt to 114.txt
Saving 121.txt to 121.txt
Saving 139.txt to 139.txt
Saving 158.txt to 158.txt
Saving 256.txt to 256.txt
Saving 265.txt to 265.txt
Saving 269.txt to 269.txt
Saving 291.txt to 291.txt
Saving 292.txt to 292.txt
Saving 300.txt to 300.txt
Saving 306.txt to 306.txt
Saving 312.txt to 312.txt
Saving 319.txt to 319.txt
Saving 323.txt to 323.txt
Saving 328.txt to 328.txt
Saving 335.txt to 335.txt
Saving 344.txt to 344.txt
Saving 346.txt to 346.txt
Saving 347.txt to 347.txt
Saving 348.txt to 348.txt
Saving 350.txt to 350.txt
Saving 354.txt to 354.txt
Saving 358.txt to 358.txt
Saving 363.txt to 363.txt
Saving 376.txt to 376.txt
Saving 380.txt to 380.txt
Saving 397.txt to 397.txt
Saving 475.txt to 475.txt
Saving 485.txt to 485.txt
Saving 729.txt to 729.txt
Saving 781.txt to 781.txt
Saving 782.txt to 782.txt
Saving 783.txt to 783.txt
Saving 784.txt to 784.txt
Saving 786.txt to 786.txt
Saving 787.txt to 787.txt
Saving 789.txt to 789.txt
Saving 790.t

The next cell loads the Python libraries we'll need for this activity. We'll be using SciKit-Learn, an open source library for machine learning. Luckily, Google has everything we need pre-installed, so we just have to import a few things.

In [3]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

When we uploaded our text files, the Notebook saved them as bytes. Here, we decode them into text strings. Then, we assign a numerical label to each text category: 0 for the first one, 1 for the second.

In [4]:
# Decode text byte files into strings
textsA = []
for text in classA.values():
  textsA.append(text.decode('Latin-1'))

textsB = []
for text in classB.values():
  textsB.append(text.decode('Latin-1'))

# Create an array of class labels: 0 for class A, 1 for class B
labelsA = np.zeros(len(textsA), dtype=int)
labelsB = np.ones(len(textsB), dtype=int)

# Final training data
texts = np.array(textsA + textsB)
labels = np.append(labelsA, labelsB)


In this cell, we calculate TF-IDF scores (term frequency, inverse document frequency) for each text document.

In [5]:
tfidf = TfidfVectorizer(sublinear_tf=True, 
                        max_features=20000,
                        norm='l2',
                        encoding='latin-1', 
                        ngram_range=(1, 1),
                        stop_words='english')
features = tfidf.fit_transform(texts)

Then, we randomly separate the text data into a training set and a validation set. This lets us check how accurate the classification algorithm was in its decisions.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    features, 
    labels,
    test_size=.25,
    random_state=99,
)

Now it's time to fit the classifier ot the training data. 

In [7]:
clf = LogisticRegression(C=1, class_weight='balanced').fit(X_train, y_train)

All done! Let's see how accurate our algorithm's predictions are when we ask it to look at new data...

In [8]:
y_predict = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_predict)
print('Done! Accuracy = ', accuracy * 100, '%' )

Done! Accuracy =  97.67441860465115 %


One of the handy features of SciKit-Learn is that we can look "under the hood" and see how the classifier is making its classification decisions. The following cell prints the 25 most distinctive words that the classifier associates with each class. You can see more or fewer words by changing the value of "words" and running the cell again.


In [11]:
# Get feature names
feature_names = tfidf.get_feature_names()
words = 100
topn_classA = sorted(zip(clf.coef_[0], feature_names))[:words]
topn_classB = sorted(zip(clf.coef_[0], feature_names))[-words:]
topn_classB.reverse()

# Print for each class
print(f"Top words for Class A")
for coef, feat in topn_classA:
    print(round(coef, 3), feat)

print(f"\nTop words for Class B")
for coef, feat in topn_classB:
    print(round(coef, 3), feat)

Top words for Class A
-0.388 murder
-0.339 police
-0.296 detective
-0.266 crime
-0.261 body
-0.25 case
-0.219 death
-0.207 ii
-0.204 dr
-0.203 mystery
-0.201 evidence
-0.193 prints
-0.192 jones
-0.186 killed
-0.184 marks
-0.184 attorney
-0.181 murderer
-0.18 chief
-0.179 coroner
-0.179 floor
-0.177 pocket
-0.174 wall
-0.173 arrest
-0.173 detectives
-0.168 physician
-0.167 question
-0.167 headquarters
-0.166 prison
-0.164 affair
-0.162 knife
-0.161 double
-0.16 beard
-0.158 finger
-0.158 papers
-0.157 entered
-0.156 policeman
-0.155 crook
-0.155 possible
-0.154 swear
-0.154 work
-0.154 inspector
-0.153 gun
-0.151 revolver
-0.148 sergeant
-0.146 prisoner
-0.146 removed
-0.146 office
-0.144 corridor
-0.143 statement
-0.143 hell
-0.142 district
-0.142 opinion
-0.14 paper
-0.14 locked
-0.138 criminal
-0.137 guy
-0.135 committed
-0.135 newspaper
-0.134 stated
-0.133 shop
-0.133 jury
-0.132 details
-0.132 alibi
-0.132 story
-0.132 directly
-0.131 murdered
-0.131 automobile
-0.13 explanation
-

What do you notice about the top 25 words for each class? Which words are expected and which words are surprising to you? What does this tell you about the stylistic or thematic differences of these "classes" of texts?

Expand the number of top words to 50, 100, or even 200. How does this change the picture of the two classes you see? Do you begin to see more surprising words further down the list?
