<h1 align='center'>It Starts with a Research Question...</h1>
<img src='Long, So 263, Fig 8.png' width="66%" height="66%">
<img src='Long, So 257, Fig 5.png' width="66%" height="66%">

# Literary Distinction (Probably)
<ul><li>Preview</li>
<li>Review</li>
<li>Pre-Processing</li>
<ul>
<li>Import Corpus</li>
<li>Stop Words</li>
<li>Feature Selection</li></ul>
<li>Classification</li>
<ul>
<li>Training, Feature Importance, & Prediction</li>
<li>Literary Distinction</li>
<li>Extra: Cross-Validation</li></ul>
</ul>

We will work through supervised machine learning via a classification task. This lesson is based on the paper by Hoyt Long and Richard So: ["Literary Pattern Recognition: Modernism between Close Reading and Machine Learning."](https://www.journals.uchicago.edu/doi/abs/10.1086/684353)

Although Long and So's study of modernist haiku motivates this lesson, a substantial portion of their corpus remains under copyright so they have not made it available publicly. Instead we will apply their methods to the corpus distributed by Ted Underwood and Jordan Sellers in support of their own literary historical study on nineteenth- and early-twentieth century volumes of poetry that were reviewed in prestigious magazines versus not at all. (The idea being that even a negative review indicates valuable, critical engagement.)

In essence, our task will be to learn the vocabulary of literary prestige, rather than that of haiku. We will however be deliberate in using Long and So's methods, since they reflect assumptions about language that are more appropriate to a general introduction.

Our task: build a machine learning classifier that will "learn" how to distinguish between prestigious poems and poems that have not been recognized as being prestigious. Our input will simply be 360 examples of poems that have been reviewed in literary journals - a mark of prestige (labeled *reviewed*), and 360 examples of poems that have not been reviewed (labeled *random*). We will then use an algorithm, in our case, the relatively simple *Naive Bayes Classifier* to determine what features, in our case words, distinguish reviewed poems from random poems.

We will use the large and powerful Python library [Scikit-Learn](https://scikit-learn.org/stable/) to implement our classifier.

# 0. Preview

In [1]:
import nltk
nltk.download('stopwords')

from sklearn.naive_bayes import MultinomialNB
import pandas

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jinyang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Get texts of interest that belong to identifiably different categories

unladen_swallow = 'high air-speed velocity'
swallow_grasping_coconut = 'low air-speed velocity'

In [3]:
# Transform them into a format scikit-learn can use

columns = ['high','low','air-speed','velocity']
indices = ['unladen', 'coconut']
dtm = [[1,0,1,1],[0,1,1,1]]
dtm_df = pandas.DataFrame(dtm, columns = columns, index = indices)

dtm_df

Unnamed: 0,high,low,air-speed,velocity
unladen,1,0,1,1
coconut,0,1,1,1


In [4]:
# Train the Naive Bayes classifier

nb = MultinomialNB()
nb.fit(dtm,indices)

MultinomialNB()

In [5]:
# Make a prediction!

unknown_swallow = "high velocity"
unknown_features = [1,0,0,1]

nb.predict([unknown_features])

array(['unladen'], dtype='<U7')

# 1. Pre-Process

In their paper, Long and So describe their pre-processing as consisting of three major steps: stop word removal, lemmatization of nouns, and feature selection (based on document frequency). In this workshop, we will focus on the first and third steps, since they can be integrated seamlessly with our workflow and Underwood and Sellers use them as well.

Lemmatization -- the transformation of words into their dictionary forms; e.g. plural nouns become singular -- is particularly useful to Long and So, since they partly aim to study imagery. That is, they find it congenial to collapse the words <i>mountains</i> and <i>mountain</i> into the same token, since they express a similar image. For an introduction to Lemmatization (and a related technique, Stemming), see NLTK: http://www.nltk.org/book/ch03.html#sec-normalizing-text

### Import Corpus

Note that due to issues of copyright, volumes' word order has not been retained, although their total word counts have been. Fortunately, our methods do not require word-order information.

Underwood and Sellers's literary corpus has been divided into three folders: "reviewed", "random", "canonic". (The last of these are canonic poets but who did not have the opportunity to be reviewed, such as Emily Dickinson.)

In [6]:
import glob #allows us to access our file structure (danger! it has access to your computer!)

In [7]:
# Assign file paths to each set of poems

review_path = 'poems/reviewed/*.txt'
random_path = 'poems/random/*.txt'

In [8]:
# Get lists of text files in each directory

review_files = glob.glob(review_path)
random_files = glob.glob(random_path)

In [9]:
# Inspect

review_files[:10]

['poems/reviewed/689 Hood, Thomas, The plea of the midsummer fairies 1827.txt',
 "poems/reviewed/524 Mackay, Charles, A man's heart 1860.txt",
 'poems/reviewed/383 Colman, James F. The knightly heart, and other poems 1873.txt',
 'poems/reviewed/580 Browning, Elizabeth Barrett, Poems 1853.txt',
 'poems/reviewed/431 Myers, F. W. H. Poems 1870.txt',
 'poems/reviewed/229 Dorr, Julia C. R. Poems 1892.txt',
 'poems/reviewed/544 Tennyson, Alfred Tennyson, Idyls of the King 1859.txt',
 'poems/reviewed/675 Davidson, Lucretia Maria, Amir Khan, and other poems 1829.txt',
 "poems/reviewed/84 Sweeny, Mildred M' Neal. Men of no land, and other poems 1912.txt",
 'poems/reviewed/666 Lytton, Edward Bulwer Lytton, The Siamese twins 1831.txt']

In [10]:
# Read-in texts as strings from each location using list comprehension



review_texts = [open(file_name, encoding='utf-8').read() for file_name in review_files]
random_texts = [open(file_name, encoding='utf-8').read() for file_name in random_files]

In [11]:
# Inspect

review_texts[0][:100]

'the the the the the the the the the the the the the the the the the the the the the the the the the '

Whoa - WTH is that? It's the text, but sorted. Most classification tasks, and much of machine learning as it's used in text analysis, is done on what is called a "bag of words." It's just words and their counts, not words in context. So having it sorted this way is no issue, as we won't look at context.

We'll look at words in context more next week.

In [12]:
# Collect all texts in single list

all_texts = review_texts + random_texts

#let's make sure we understand what we're doing
print("Number of Reviewed Texts:")
print(len(review_texts))

print("Number of Random Texts:")
print(len(random_texts))

print("Number of TotalTexts:")
print(len(all_texts))

Number of Reviewed Texts:
357
Number of Random Texts:
352
Number of TotalTexts:
709


In [13]:
# Get all file names together

all_file_names = review_files + random_files

print(all_file_names[:2])
print(len(all_file_names))

['poems/reviewed/689 Hood, Thomas, The plea of the midsummer fairies 1827.txt', "poems/reviewed/524 Mackay, Charles, A man's heart 1860.txt"]
709


In [14]:
# Keep track of classes with labels

all_labels = ['reviewed'] * len(review_texts) + ['random'] * len(random_texts) 

print(all_labels[:5])
print(len(all_labels))

['reviewed', 'reviewed', 'reviewed', 'reviewed', 'reviewed']
709


We now have three lists of the same length:

* `all_file_names` = a list of the file names
* `all_texts` = a list where the contents of the file as a string associated with each filename is an element
* `all_labels` = a list where each element is a label associated with the file

The indices should match: all_file_names[0] should be the filename associated with the text all_texts[0] and the label all_labels[0], and so on.

Take a second to really understand this data structure - it's important!

### An aside: Stop Words

<i>Stop words</i>, sometimes refered to as <i>function words</i>, include articles, prepositions, pronouns, and conjunctions among others. Although their frequencies encode information about textual features like authorship, they do not convey semantic meanings and are often removed before analysis.

In [15]:
# By default scikit-learn uses this list of English stop words

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [16]:
# Inspect

ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [17]:
# How many are here?

len(ENGLISH_STOP_WORDS)

318

In [18]:
# Reminder: NLTK has its own collection of stop words

from nltk.corpus import stopwords

In [19]:
# Pull up NLTK's list of English-language stop words

stopwords.words('english')[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

In [20]:
# How many stop words are in the list?
# Big difference!

len(stopwords.words('english'))

179

### The Document-Term Matrix (DTM)

A DTM is a different way of representing text, called *vector representation.* The goal is to turn the text into a matrix (or an array) of numbers. Once we have the texts in an array, or vector, format, we can do matrix manipulation and linear algebra on it (similar to what we did with relational data). To create the DTM we transform each document into a vector, where each number in the vector represents the count of a particular word. With a DTM, each row is a document, each column is a unique word (each unique word in the entire corpus gets a column), and the cells are the number of times that word appears in the document.

To get a feel for this we'll jump right into an example.

We've learned how to count each word using Python's NLTK. We could, then, construct a DTM manually. Luckily, however, `scikit-learn` with a built-in function to do this called `CountVectorizer()`.

[Let's first look at the documentation for CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

We'll implement this by creating a `CountVectorizer()` object.

In [21]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

#use the fit_transform function to transform our text list (all_texts) into a DTM
sklearn_dtm = countvec.fit_transform(all_texts)
print(sklearn_dtm)

  (0, 127733)	1439
  (0, 4452)	1207
  (0, 90849)	667
  (0, 130798)	559
  (0, 64364)	507
  (0, 127641)	433
  (0, 146192)	391
  (0, 58162)	348
  (0, 59129)	277
  (0, 48348)	244
  (0, 17088)	241
  (0, 86910)	217
  (0, 91666)	191
  (0, 76426)	172
  (0, 3286)	170
  (0, 127841)	164
  (0, 67324)	157
  (0, 118589)	149
  (0, 49855)	146
  (0, 114724)	139
  (0, 6737)	138
  (0, 57296)	128
  (0, 17230)	127
  (0, 78471)	121
  (0, 67659)	116
  :	:
  (708, 112990)	1
  (708, 106449)	1
  (708, 105806)	1
  (708, 101706)	1
  (708, 100321)	1
  (708, 99364)	1
  (708, 98468)	1
  (708, 85654)	1
  (708, 80096)	1
  (708, 30364)	1
  (708, 70574)	1
  (708, 58852)	1
  (708, 58851)	1
  (708, 25937)	1
  (708, 24155)	1
  (708, 20554)	1
  (708, 19786)	1
  (708, 18197)	1
  (708, 13020)	1
  (708, 10633)	1
  (708, 5722)	1
  (708, 2681)	1
  (708, 2381)	1
  (708, 1247)	1
  (708, 1181)	1


This format is called Compressed Sparse Format. How do we know what each number indicates? We can access the words themselves through the CountVectorizer function `get_feature_names`.

In [22]:
print(countvec.get_feature_names_out()[:10])

['00' '01' '02' '05' '0l' '0li' '0ltr' '0m' '0u' '0ug']


In [23]:
print(countvec.get_feature_names_out()[61364])

hybrid


### Feature Selection

In the above DTM I included all words.

Long and So did not use all of the words that appear in their respective corpora when constructing their matrices. The process of choosing which words to comprise the columns is referred to as <i>feature selection</i>.

While there are several approaches one may take when selecting features, both of the literary studies under consideration use <i>document frequency</i> as the deciding criterion. The intuition is that a word that appears in a single text out of hundreds will not carry much weight when trying to determine the text's class membership.

In order to be selected as a feature, Long and So require that words appear in at least 2 texts, whereas Underwood and Sellers require that a word appear in about a quarter of all texts. Although this is quite a large difference (a minimum of 2 texts vs. ~180 texts), it perhaps makes sense since the texts are of very different lengths: individual haiku vs entire volumes of poetry. The latter will have much greater overlap in its vocabulary.

The process of feature selection is intimately tied to the object under study and the statistical model chosen.

In [24]:
# Intitialize the function that will transform our list of texts to a DTM
# 'min_df' and 'max_features' are arguments that enable flexible feature selection
# 'binary' tells CountVectorizer only to record whether a word appeared in a text or not

cv = CountVectorizer(stop_words = 'english', min_df=180, binary = True, max_features = None)

In [61]:
# Transform our texts to DTM

a=cv.fit_transform(all_texts)
print(a)

  (0, 652)	1
  (0, 692)	1
  (0, 1189)	1
  (0, 326)	1
  (0, 1179)	1
  (0, 1146)	1
  (0, 648)	1
  (0, 243)	1
  (0, 1173)	1
  (0, 1135)	1
  (0, 795)	1
  (0, 1192)	1
  (0, 1166)	1
  (0, 335)	1
  (0, 529)	1
  (0, 327)	1
  (0, 1191)	1
  (0, 981)	1
  (0, 644)	1
  (0, 128)	1
  (0, 677)	1
  (0, 537)	1
  (0, 782)	1
  (0, 497)	1
  (0, 640)	1
  :	:
  (708, 997)	1
  (708, 755)	1
  (708, 1245)	1
  (708, 1116)	1
  (708, 770)	1
  (708, 720)	1
  (708, 372)	1
  (708, 1255)	1
  (708, 887)	1
  (708, 413)	1
  (708, 168)	1
  (708, 117)	1
  (708, 55)	1
  (708, 1358)	1
  (708, 1345)	1
  (708, 1207)	1
  (708, 829)	1
  (708, 717)	1
  (708, 1127)	1
  (708, 220)	1
  (708, 1251)	1
  (708, 27)	1
  (708, 896)	1
  (708, 587)	1
  (708, 174)	1


In [26]:
# Transform our texts to a dense DTM

#This format should make more sense - think rows and columns
# Think back to social network analysis - this is the same data format we used to analyze social networks
# Data science is great because it all uses the same data formats!

cv.fit_transform(all_texts).toarray()

array([[0, 0, 1, ..., 0, 1, 0],
       [0, 1, 1, ..., 1, 0, 1],
       [1, 0, 1, ..., 1, 1, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [0, 1, 0, ..., 1, 0, 0]])

In [37]:
# Assign this to a variable

dtm = cv.fit_transform(all_texts).toarray()
type(dtm)

numpy.ndarray

In [28]:
# Get the column headings

cv.get_feature_names_out()[:20]

array(['abide', 'abode', 'abroad', 'absence', 'absent', 'abyss',
       'accents', 'accept', 'accursed', 'ache', 'aching', 'act', 'action',
       'acts', 'adam', 'add', 'added', 'adieu', 'adore', 'adorn'],
      dtype=object)

In [29]:
# Assign to a variable

feature_list = cv.get_feature_names_out()

In [30]:
# Place this in a dataframe for readability

dtm_df = pandas.DataFrame(dtm, columns = feature_list, index = all_file_names)

In [31]:
# Check out the dataframe
# The DTM, in all it's wonderful, full (memory-heavy) glory (don't use this format with large data)
dtm_df.head()

Unnamed: 0,abide,abode,abroad,absence,absent,abyss,accents,accept,accursed,ache,...,yon,yonder,yore,young,younger,youth,youthful,zeal,zephyr,zone
"poems/reviewed/689 Hood, Thomas, The plea of the midsummer fairies 1827.txt",0,0,1,1,1,1,1,0,0,1,...,0,0,0,1,0,1,0,0,1,0
"poems/reviewed/524 Mackay, Charles, A man's heart 1860.txt",0,1,1,1,0,1,0,1,0,0,...,0,0,0,1,1,1,1,1,0,1
"poems/reviewed/383 Colman, James F. The knightly heart, and other poems 1873.txt",1,0,1,1,1,1,1,1,1,0,...,1,1,1,1,0,1,1,1,1,1
"poems/reviewed/580 Browning, Elizabeth Barrett, Poems 1853.txt",0,1,1,1,1,0,1,1,1,1,...,0,1,1,1,1,1,1,0,0,1
"poems/reviewed/431 Myers, F. W. H. Poems 1870.txt",1,0,0,0,0,0,0,0,1,1,...,0,0,0,1,0,1,0,0,0,0


In [32]:
# Get the dataframe's dimensions (# texts, # features)

dtm_df.shape

(709, 3606)

# 2. Classification

### Training, Feature Importance, and Prediction

Long and So selected a classification algorithm that specifically relies on <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes' Theorem</a> to model relationships between textual features and categories in our corpus of poetry volumes. (See link for more information about the method and its assumptions.)

Two ways that we learn about the model are its feature weights and predictions on new texts. The algorithm can explicity report to us which direction each word leans category-wise and how strongly. Based on those weights, it makes further predictions about the valences of previously unseen poetry volumes.

In [38]:
from sklearn.naive_bayes import MultinomialNB

In [39]:
# Train the classifier and assign it to a variable

#We use the fit function to fit the model to our data

nb = MultinomialNB()
nb.fit(dtm, all_labels)

MultinomialNB()

In [40]:
# Hand-waving the underlying statistics here...

def most_informative_features(text_class, vectorizer = cv, classifier = nb, top_n = 50):

    import numpy as np

    feature_names = vectorizer.get_feature_names_out()
    class_index = np.where(classifier.classes_==(text_class))[0][0]
    
    class_prob_distro = np.exp(classifier.feature_log_prob_[class_index])
    alt_class_prob_distro = np.exp(classifier.feature_log_prob_[1 - class_index])
    
    odds_ratios = class_prob_distro / alt_class_prob_distro
    odds_with_fns = sorted(zip(odds_ratios, feature_names), reverse = True)
    
    return odds_with_fns[:top_n]

In [41]:
# Returns feature name and odds ratio for a given class

most_informative_features('reviewed')

[(2.2574987525392105, 'dusk'),
 (2.2162169507398977, 'windy'),
 (2.170421139765935, 'utterly'),
 (2.164877923481417, 'vague'),
 (2.1293064820834307, 'roofs'),
 (2.0836042941752964, 'perilous'),
 (2.077372177642372, 'visible'),
 (2.0427493080149963, 'yearned'),
 (2.0269008113545772, 'stair'),
 (2.012973640135456, 'shrank'),
 (1.9989807747124708, 'remembering'),
 (1.9695086222647848, 'sleepy'),
 (1.9656644284672604, 'moods'),
 (1.9154075671821176, 'whirled'),
 (1.900651280585991, 'alien'),
 (1.8638644816069039, 'stark'),
 (1.843996505786981, 'curled'),
 (1.8299760364867796, 'muttered'),
 (1.8252865650745336, 'ghosts'),
 (1.8148154163014614, 'wherewith'),
 (1.8108114109704032, 'passionate'),
 (1.810469296590852, 'folk'),
 (1.8065148052497686, 'miracle'),
 (1.7812555295135246, 'haunting'),
 (1.751051631404383, 'eclipse'),
 (1.749713929623325, 'topmost'),
 (1.746779616039069, 'smote'),
 (1.739799198775482, 'porch'),
 (1.723217764225026, 'shuddering'),
 (1.723046197342888, 'hedge'),
 (1.7145

In [54]:
# Similarly, for words that indicate 'random' class membership

most_informative_features('random')

[(2.323890855405321, 'emblem'),
 (2.135859350418144, 'cheering'),
 (2.10956562735429, 'mission'),
 (2.109077414989705, 'caused'),
 (1.9534155016450552, 'united'),
 (1.9534155016450552, 'inspire'),
 (1.950679625592335, 'zephyr'),
 (1.9242916597934763, 'impart'),
 (1.908442196668769, 'display'),
 (1.8769673079432159, 'lasting'),
 (1.838894948755733, 'raging'),
 (1.834578293946445, 'unite'),
 (1.834578293946445, 'choicest'),
 (1.8296697232603005, 'fully'),
 (1.754989326392534, 'beaming'),
 (1.7461728906511993, 'varied'),
 (1.7145659923081469, 'saviour'),
 (1.7065300466472273, 'matchless'),
 (1.6989790287417097, 'peerless'),
 (1.6989790287417066, 'patriot'),
 (1.6989790287417066, 'dire'),
 (1.6932585606314674, 'jesus'),
 (1.6898447328882535, 'blessings'),
 (1.6757053434164795, 'firmly'),
 (1.6750497466467533, 'diamonds'),
 (1.6734688331149532, 'reigns'),
 (1.6612239392141168, 'climes'),
 (1.650436770777658, 'pleasing'),
 (1.6474948157495353, 'fetters'),
 (1.6423463944503165, 'views'),
 (1.

In [44]:
# Let's load up two poems that aren't in the training set and make predictions
# You could do the same thing by reading in files

dickinson_canonic = """Because I could not stop for Death – 
He kindly stopped for me –  
The Carriage held but just Ourselves –  
And Immortality.

We slowly drove – He knew no haste
And I had put away
My labor and my leisure too,
For His Civility – 

We passed the School, where Children strove
At Recess – in the Ring –  
We passed the Fields of Gazing Grain –  
We passed the Setting Sun – 

Or rather – He passed us – 
The Dews drew quivering and chill – 
For only Gossamer, my Gown – 
My Tippet – only Tulle – 

We paused before a House that seemed
A Swelling of the Ground – 
The Roof was scarcely visible – 
The Cornice – in the Ground – 

Since then – ‘tis Centuries – and yet
Feels shorter than the Day
I first surmised the Horses’ Heads 
Were toward Eternity – """


anthem_patriotic = """O! say can you see, by the dawn's early light,
What so proudly we hailed at the twilight's last gleaming,
Whose broad stripes and bright stars through the perilous fight,
O'er the ramparts we watched, were so gallantly streaming?
And the rockets' red glare, the bombs bursting in air,
Gave proof through the night that our flag was still there;
O! say does that star-spangled banner yet wave
O'er the land of the free and the home of the brave?"""

In [45]:
# Transform these into DTMs with the same feature-columns as previously
# Remind yourself what the cv object is (see above, where we created it)

unknown_dtm = cv.transform([dickinson_canonic,anthem_patriotic]).toarray()

In [46]:
# What does the classifier think?

nb.predict(unknown_dtm)

array(['reviewed', 'random'], dtype='<U8')

In [47]:
# Although our classification is binary, Bayes theorem assigns
# a probability of membership in either category

# Just how confident is our classifier of its predictions?

nb.predict_proba(unknown_dtm)

array([[0.26167891, 0.73832109],
       [0.79942107, 0.20057893]])

## 4. Cross-Validation

Just how good is our classifier? We can evaluate it by randomly selecting texts from each category and setting them aside before training. We then see how well the classifier predicts their (known) categories.

Remember that if the classifier is trying to predict membership for just two categories, we would expect it to be correct about 50% of the time based on random chance. As a rule of thumb, if this kind of classifier has 65% accuracy or better under cross-validation, it has often identified a meaningful pattern.

In [33]:
# Randomize the order of our texts
import numpy
randomized_review = numpy.random.permutation(review_texts)
randomized_random = numpy.random.permutation(random_texts)

In [34]:
# We'll train our classifier on the first 90% of texts in the randomized list
# Then, we'll test it using the last 10%

per90_review = int(len(randomized_review)*.90)
per90_random = int(len(randomized_random)*.90)
print("90% index for random list:")
print(per90_random)
print()

training_set = list(randomized_review[:per90_review]) + list(randomized_random[:per90_random])
test_set = list(randomized_review[per90_review:]) + list(randomized_random[per90_random:])

training_labels = ['reviewed'] * per90_review + ['random'] * per90_random
test_labels = ['reviewed'] * (len(randomized_review) - per90_review) + ['random'] * (len(randomized_random) - per90_random)

print(len(training_set))
print(len(training_labels))

print(len(test_set))
print(len(test_labels))

90% index for random list:
316

637
637
72
72


In [35]:
# Transform training and test texts into DTMs
# Note that 'min_df' has been adjusted to one quarter of the size of the training set

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english', min_df = 162, binary=True)
training_dtm = cv.fit_transform(training_set)
test_dtm = cv.transform(test_set)

In [36]:
# Train, Predict, Evaluate

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nb = MultinomialNB()
nb.fit(training_dtm, training_labels)
predictions = nb.predict(test_dtm)
accuracy_score(predictions, test_labels)

0.6944444444444444

# Exercises!

* Re-initialize the CountVectorizer function above with the the argument min_df = 1. How many unique words are in there in the total vocabulary of the corpus?

* Repeat the exercise above with min_df = 360. (That is, words are only included if they appear in at least half of all documents.) What is the size of the vocabulary now? Does the list of these very common words look as you would expect?

* What kinds of patterns do you notice among the 'most informative features' from our original model? Try looking at the top fifty most informative words for each category. Does this challenge what you think you know about literary prestige?

* CHALLENGE: Another way to do cross-validation is to they do so by setting aside a single author's texts (one or more) from the training set and making a prediction for that author alone. After doing this for all authors, they tally the number of texts that were correctly predicted to calculate their overall accuracy. Implement this.
    * Hint: look at the title of each text - they are all in a standarized format. Use your string splicing techniques to create a list, the same length as `all_texts` and `all_labels`, that contains the author name. Go from there to check how accurate the classifier is for each author, and the average accuracy overall all authors.
    * Hint 2: You won't be able to actually complete this (it takes a long time!), but just work toward this. Write a few functions. Think through how to remove texts from lists, etc. It will help you get more comfortable with these data structures and these scikit-learn functions.

In [55]:
#import the function CountVectorizer
cv_1 = CountVectorizer(stop_words = 'english', min_df=1, binary = True, max_features = None)
dtm = cv_1.fit_transform(all_texts).toarray()
feature_list = cv_1.get_feature_names_out()
dtm_df = pandas.DataFrame(dtm, columns = feature_list, index = all_file_names)
dtm_df.shape

len(cv_1.get_feature_names_out())

149636

In [56]:
cv_360 = CountVectorizer(stop_words = 'english', min_df=360, binary = True, max_features = None)
cv_360.fit_transform(all_texts)
dtm = cv_360.fit_transform(all_texts).toarray()
feature_list = cv_360.get_feature_names_out()
dtm_df = pandas.DataFrame(dtm, columns = feature_list, index = all_file_names)
dtm_df.shape
#1362 words

(709, 1362)

In [51]:
cv.get_feature_names_out()[:20]

array(['abroad', 'afar', 'age', 'ages', 'ago', 'agony', 'ah', 'aid',
       'air', 'alas', 'alike', 'altar', 'amid', 'ancient', 'angel',
       'angels', 'angry', 'anguish', 'answer', 'answered'], dtype=object)

In [57]:
def most_informative_features(text_class, vectorizer = cv_360, classifier = nb, top_n = 50):

    import numpy as np

    feature_names = vectorizer.get_feature_names_out()
    class_index = np.where(classifier.classes_==(text_class))[0][0]
    
    class_prob_distro = np.exp(classifier.feature_log_prob_[class_index])
    alt_class_prob_distro = np.exp(classifier.feature_log_prob_[1 - class_index])
    
    odds_ratios = class_prob_distro / alt_class_prob_distro
    odds_with_fns = sorted(zip(odds_ratios, feature_names), reverse = True)
    
    return odds_with_fns[:top_n]

In [58]:
most_informative_features('random')

[(2.323890855405321, 'shout'),
 (2.135859350418144, 'hands'),
 (2.109077414989705, 'gleams'),
 (1.908442196668769, 'race'),
 (1.834578293946445, 'hear'),
 (1.8296697232603005, 'wondering'),
 (1.754989326392534, 'chill'),
 (1.6989790287417066, 'proud'),
 (1.6898447328882535, 'doom'),
 (1.6757053434164795, 'thoughts'),
 (1.6750497466467533, 'pray'),
 (1.6612239392141168, 'humble'),
 (1.6474948157495353, 'swept'),
 (1.6327850406089144, 'pleasant'),
 (1.6263730873424889, 'growing'),
 (1.6180752654682917, 'mingled'),
 (1.6167703660606536, 'veil'),
 (1.6161020029494277, 'arch'),
 (1.5990390858745465, 'knows'),
 (1.5922798626854662, 'band'),
 (1.5922798626854633, 'better'),
 (1.5627739589269727, 'holds'),
 (1.557397443013232, 'best'),
 (1.548436583157001, 'short'),
 (1.5459178549811932, 'mourn'),
 (1.5423355721910534, 'answered'),
 (1.5392459576634272, 'close'),
 (1.5353736407888015, 'laughter'),
 (1.529565165476866, 'calling'),
 (1.5277640878607595, 'sweetly'),
 (1.527172160666705, 'stream')

In [59]:
most_informative_features('reviewed')

[(2.2574987525392105, 'sea'),
 (1.900651280585991, 'bare'),
 (1.843996505786981, 'march'),
 (1.810469296590852, 'used'),
 (1.751051631404383, 'shell'),
 (1.6644518882339494, 'dancing'),
 (1.6571038063193357, 'right'),
 (1.639640183067731, 'tune'),
 (1.6289816336653362, 'oft'),
 (1.6233658388189178, 'tears'),
 (1.616125134809911, 'ring'),
 (1.6041116850597574, 'felt'),
 (1.5975981270916337, 'fires'),
 (1.5918651194585287, 'vision'),
 (1.5910290558453675, 'know'),
 (1.5851766105532423, 'shook'),
 (1.5593780762042724, 'places'),
 (1.5585590751138287, 'dumb'),
 (1.5479885010398409, 'dust'),
 (1.5387392487250964, 'sweeter'),
 (1.5379255315126605, 'glowing'),
 (1.53433095547712, 'floating'),
 (1.5212756254734903, 'join'),
 (1.5185590618565723, 'looking'),
 (1.5060948287907157, 'mock'),
 (1.5022191344294444, 'wine'),
 (1.5022191344294444, 'foe'),
 (1.4992355810343545, 'saw'),
 (1.4954959911496868, 'rude'),
 (1.4869611376808562, 'sought'),
 (1.4772995708828047, 'maiden'),
 (1.4760703090357317,