# Assignment:

Using the collocation module in nltk, find the bigrams and trigrams that occur three times or more in one of Austen’s texts. Print their counts. Compare them to the collocations that are shown in Voyant for this text. Try to get yours as close to those in Voyant as possible by using stop words, window size, etc.

Extra points if you can print the ngrams in order of frequency.

In [1]:
with open("1815_Emma.txt", "r") as f: 
    emmaString = f.read()

Tokenization using nltk library and its word_tokenize() function:

In [2]:
import nltk

emmaTokens = nltk.word_tokenize(emmaString.lower()) #make the string lower case
emmaTokens[:10]


['emma',
 'by',
 'jane',
 'austen',
 'volume',
 'i',
 'chapter',
 'i',
 'emma',
 'woodhouse']

I wanted to use the stopwords used in the default stopword list in Voyant to properly compare my results to those found using Voyant. Instead of importing the nltk stopwords and comparing the two, I decided to copy the Voyant stopwords into a text document, read them, and make them into a (lowercase) list.

In [3]:
with open("Voyant_Stopwords.txt", "r") as f:
    stopwords = f.read()

stopwords = stopwords.lower().split()
stopwords[:10]

['!', '$', '%', '&', '-', '.', '0', '1', '10', '100']

I used list comprehension to filter out any punctuation tokens that were left:

In [5]:
emmaTokensCleaner = [word for word in emmaTokens if word[0].isalpha() \
                    and word not in stopwords] #if the word starts with a letter and it's not in stopwords

I imported the nltk collocations module and looked for bigrams. Based on the  nltk documentation, it seems that the collocation window is counted differently than for Voyant. Voyant's window counts the words on either side of the keyword (so a window of 2 counts two words on either side) while the nltk window represents all the words counted (so a window of 5 counts two words on either side).

In [6]:
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(emmaTokensCleaner, 5) # window = 5 words 
finder.apply_freq_filter(3) # ignores collocations if they appear less than three times in corpus

finder.nbest(bigram_measures.pmi, 20) #show 20

[('dr.', 'hughes'),
 ('mermaid', 'shark'),
 ('behold', 'monarch'),
 ('behold', 'seas'),
 ('caro', 'sposo'),
 ('kitty', 'frozen'),
 ('monarch', 'seas'),
 ('luxurious', 'selfish'),
 ('conjecture', 'conjectures'),
 ('gathering', 'strawberries'),
 ('husbands', 'wives'),
 ('friday', 'saturday'),
 ('proud', 'luxurious'),
 ('eating', 'drinking'),
 ('basin', 'gruel'),
 ('nicely', 'dressed'),
 ('frozen', 'maid'),
 ('kitty', 'maid'),
 ('lovely', 'reigns'),
 ('sore', 'throat')]

Then I wanted to sort the bigrams by how frequently they appeared. I used this solution found on the internet:

In [7]:
for bigram, frequency in sorted(finder.ngram_fd.items(), key=lambda f: f[1], reverse=True)[:20]: #sort by the second element, ie the frequency, from highest to lowest
    print(bigram, frequency)

('mr.', 'knightley') 315
('mrs.', 'weston') 253
('mr.', 'elton') 230
('mr.', 'weston') 190
('miss', 'woodhouse') 177
('mrs.', 'elton') 151
('frank', 'churchill') 142
('mr.', 'woodhouse') 141
('miss', 'fairfax') 131
('miss', 'bates') 119
('jane', 'fairfax') 106
('young', 'man') 90
('said', 'emma') 82
('mr.', 'mr.') 78
('mr.', 'churchill') 73
('miss', 'smith') 69
('great', 'deal') 65
('said', 'mr.') 64
('miss', 'miss') 63
('emma', 'mr.') 60


These are pretty close to the counts found in Voyant, but I can't get them  exactly the same even after playing around with window size. My guess is that the collocation algorithms are different between the two. 

Next I did the same as above with trigrams:

In [8]:
Trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(emmaTokensCleaner, 7) # window = 7 words
finder.apply_freq_filter(3) 

finder.nbest(Trigram_measures.pmi, 20) 

[('fried', 'grease', 'roast'),
 ('cautious', 'advances', 'inch'),
 ('behold', 'monarch', 'seas'),
 ('approval', 'beam', 'soft'),
 ('shedding', 'nicely', 'dressed'),
 ('conundrum', 'reckon', 'low'),
 ('proud', 'luxurious', 'selfish'),
 ('agreed', 'conundrum', 'reckon'),
 ('kitty', 'frozen', 'maid'),
 ('modern', 'ease', 'disgusts'),
 ('angel', 'gesture', 'observe'),
 ('approval', 'beam', 'eye'),
 ('card-room', 'card-room', 'cards'),
 ('william', 'coxe', 'pert'),
 ('alphabet', 'express', 'perfection'),
 ('day', 'party.', 'professional'),
 ('fried', 'smallest', 'roast'),
 ('conjecture', 'aye', 'conjectures'),
 ('humours', 'house', 'large.'),
 ('sick', 'prosperity', 'indulgence')]

In [9]:
#sorting by frequency:

for trigram, frequency in sorted(finder.ngram_fd.items(), key=lambda f: f[1], reverse=True)[:20]:
    print(trigram, frequency)

('mr.', 'frank', 'churchill') 52
('mr.', 'john', 'knightley') 37
('mr.', 'elton', 'mr.') 36
('mr.', 'knightley', 'mr.') 36
('said', 'mr.', 'knightley') 31
('mr.', 'mrs.', 'weston') 28
('dear', 'miss', 'woodhouse') 27
('mrs.', 'weston', 'emma') 27
('mr.', 'mr.', 'knightley') 26
('emma', 'mr.', 'knightley') 25
('emma', 'mrs.', 'weston') 24
('mr.', 'knightley', 'emma') 24
('mr.', 'mr.', 'elton') 23
('mr.', 'elton', 'harriet') 22
('miss', 'woodhouse', 'miss') 21
('said', 'mrs.', 'weston') 21
('miss', 'smith', 'miss') 20
('mrs.', 'weston', 'said') 20
('emma', 'mr.', 'elton') 20
('mrs.', 'miss', 'bates') 19


That is the end of my assignment, but I wanted to play around with other things in nltk. I decided to try to filter out specific words that I didn't add to the stopwords list.

First I tried a function that showed the raw frequency of the trigrams:

In [10]:
finder = TrigramCollocationFinder.from_words(emmaTokensCleaner)
finder.score_ngrams(Trigram_measures.raw_freq)[:20] #Shows how frequent these trigrams are (I guess how much each occurs per word?)

[(('mr.', 'frank', 'churchill'), 0.0007748487396343374),
 (('mr.', 'john', 'knightley'), 0.00046161201510130733),
 (('dear', 'miss', 'woodhouse'), 0.0003956674415154063),
 (('said', 'mr.', 'knightley'), 0.00032972286792950524),
 (('mr.', 'mrs.', 'weston'), 0.00028026443774007944),
 (('oh', 'miss', 'woodhouse'), 0.00023080600755065366),
 (('mrs.', 'john', 'knightley'), 0.00021431986415417842),
 (('said', 'mrs.', 'weston'), 0.00021431986415417842),
 (('said', 'mr.', 'woodhouse'), 0.0001813475773612279),
 (('poor', 'miss', 'taylor'), 0.00014837529056827735),
 (('said', 'frank', 'churchill'), 0.00014837529056827735),
 (('colonel', 'mrs.', 'campbell'), 0.0001318891471718021),
 (('mr.', 'knightley', 'mr.'), 0.0001318891471718021),
 (('mrs.', 'miss', 'bates'), 0.0001318891471718021),
 (('fine', 'young', 'man'), 0.00011540300377532683),
 (('miss', 'smith', 'miss'), 0.00011540300377532683),
 (('said', 'mr.', 'weston'), 0.00011540300377532683),
 (('frank', 'churchill', 'miss'), 9.891686037885158

Then I used the length function to see how many trigrams had been found:

In [11]:
len(finder.score_ngrams(Trigram_measures.raw_freq)) #showing how many trigrams there are

59455

Then I applied word filters as shown in the nltk documentation.

In [16]:
finder.apply_word_filter(lambda w: w in ("mr.", "mrs.", "miss")) #I wanted to filter out these words
finder.score_ngrams(Trigram_measures.raw_freq)[0:20]

[(('said', 'frank', 'churchill'), 0.00014837529056827735),
 (('fine', 'young', 'man'), 0.00011540300377532683),
 (('amiable', 'young', 'man'), 8.243071698237631e-05),
 (('dare', 'say', 'shall'), 8.243071698237631e-05),
 (('great', 'deal', 'better'), 8.243071698237631e-05),
 (('know', 'dare', 'say'), 8.243071698237631e-05),
 (('said', 'emma', 'smiling'), 8.243071698237631e-05),
 (('said', 'john', 'knightley'), 8.243071698237631e-05),
 (('box', 'hill', 'party'), 6.594457358590106e-05),
 (('charming', 'young', 'man'), 6.594457358590106e-05),
 (('marrying', 'jane', 'fairfax'), 6.594457358590106e-05),
 (('said', 'emma', 'laughing'), 6.594457358590106e-05),
 (('saw', 'jane', 'fairfax'), 6.594457358590106e-05),
 (('talked', 'great', 'deal'), 6.594457358590106e-05),
 (('young', 'man', 'young'), 6.594457358590106e-05),
 (('bad', 'sore', 'throat'), 4.945843018942579e-05),
 (('behold', 'monarch', 'seas'), 4.945843018942579e-05),
 (('child', 'good', 'fortune'), 4.945843018942579e-05),
 (('day', 'b

Showing that there are fewer trigrams counted now:

In [13]:
len(finder.score_ngrams(Trigram_measures.raw_freq))

53432

Being able to remove words after seeing what the results are without having to go back and edit the stopword list could be useful.

Then I decided to try this mystery sorting function, which ended up sorting the trigrams in alphabetical order, though I don't understand how:

In [15]:
sorted(finder.above_score(Trigram_measures.raw_freq, 1.0 /
                         len(tuple(nltk.trigrams(emmaTokensCleaner)))))[:20]

[('affection', 'sixteen', 'years'),
 ('ah', 'dear', 'perry'),
 ('amiable', 'young', 'man'),
 ('approval', 'beam', 'soft'),
 ('ask', 'frank', 'churchill'),
 ('bad', 'sore', 'throat'),
 ('barouche-landau', 'jane', 'fairfax'),
 ('bates', 'said', 'emma'),
 ('bates', 'said', 'great'),
 ('beam', 'soft', 'eye'),
 ('beautiful', 'little', 'friend'),
 ('became', 'perfectly', 'satisfied'),
 ('behold', 'monarch', 'seas'),
 ('believe', 'half', 'hour'),
 ('bends', 'slave', 'woman'),
 ('best', 'blessings', 'existence'),
 ('better', 'home', 'directly'),
 ('blushed', 'smiled', 'said'),
 ('boasted', 'power', 'freedom'),
 ('body', 'come', 'sit')]

The End.