## Part 3. N-grams

The NLTK library contains some practical functions for n-gram extraction and frequency calculations. First, we will import NLTK again:

In [None]:
import sys
!{sys.executable} -m pip install nltk
import nltk
nltk.download('gutenberg')

### 3.1 Unigram frequencies

We will start with the n-grams by computing unigram frequencies, that is the frequencies of all word types in a corpus. We then print some words and their frequencies:

In [None]:
tokenized = nltk.corpus.gutenberg.words('austen-emma.txt')
unigram_freqs = nltk.FreqDist(tokenized)

print('"Emma" occurs', unigram_freqs['Emma'], 'times.')
print('"handsome" occurs', unigram_freqs['handsome'], 'times.')
print('"rich" occurs', unigram_freqs['rich'], 'times.')

Often we don't care what capitalization is used in a word, because we want to analyze all variants as the same word (for instance, "same" vs. "Same" vs. "SAME").

To be able to unify all capitalization variants, we need to convert all words to the same format, for instance, lower-case. Lower-case means that all letters are converted to small letters. Let us retake the example with all words in lower-case:

In [None]:
tokenized = [w.lower() for w in nltk.corpus.gutenberg.words('austen-emma.txt')]
unigram_freqs = nltk.FreqDist(tokenized)

print('"emma" occurs', unigram_freqs['emma'], 'times.')
print('"handsome" occurs', unigram_freqs['handsome'], 'times.')
print('"rich" occurs', unigram_freqs['rich'], 'times.')

Which of the unigram frequencies changed and which ones did not? What might be the explanation?

Next, let us find out which words are the most frequent ones in the corpus. Here we retrieve the 40 most frequent word forms and how many times they occur in the corpus:

In [None]:
print(unigram_freqs.most_common(40))

Let us plot this information graphically as a bar chart:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

n = 40  # number of most frequent words to plot

words = [w for w, _ in unigram_freqs.most_common(n)]
freqs = [f for _, f in unigram_freqs.most_common(n)]

plt.figure()
plt.bar(range(n), freqs)
plt.xticks(range(n), words, rotation=90)
plt.title("Unigrams")
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.show()

What happens if you change the number of words to plot?

### 3.2 Bigram and trigram frequencies and beyond

Similarly as for unigrams (that is, single words) it is possible to find the frequencies of bigrams, trigrams and higher-order n-grams. We start with bigrams:

In [None]:
bigrams = nltk.bigrams(tokenized)
bigram_freqs = nltk.FreqDist(bigrams)

print('"emma thought" occurs', bigram_freqs['emma', 'thought'], 'times.')
print('"very handsome" occurs', bigram_freqs['very', 'handsome'], 'times.')
print('"not rich" occurs', bigram_freqs['not', 'rich'], 'times.')
print()

print("The 20 most common bigrams are:", bigram_freqs.most_common(20))

Similarly for trigrams:

In [None]:
trigrams = nltk.trigrams(tokenized)
trigram_freqs = nltk.FreqDist(trigrams)

print('"emma thought infinitely" occurs', trigram_freqs['emma', 'thought', 'infinitely'], 'times.')
print('"rich and handsome" occurs', trigram_freqs['rich', 'and', 'handsome'], 'times.')
print()

print("The 20 most common trigrams are:", trigram_freqs.most_common(20))

And, generally for any order of n-grams. Here we use fourgrams:

In [None]:
fourgrams = nltk.ngrams(tokenized, 4)
fourgram_freqs = nltk.FreqDist(fourgrams)

print('"emma was particularly pleased" occurs', fourgram_freqs['emma', 'was', 'particularly', 'pleased'], 'times.')
print('"she had often been" occurs', fourgram_freqs['she', 'had', 'often', 'been'], 'times.')
print()

print("The 20 most common fourgrams are:", fourgram_freqs.most_common(20))

### 3.3 Conditional frequencies

Typically, when we deal with n-grams we are not just interested in general n-gram frequencies in the corpus, but also interested in _conditional_ frequencies. Thinking about bigrams, for instance, we are interested in knowing which words follow a particular word and how many times that happens:

In [None]:
bigrams = nltk.bigrams(tokenized)

conditional_bigram_dist = nltk.ConditionalFreqDist(bigrams)

print('The 20 most common followers of "emma" are:', conditional_bigram_dist['emma'].most_common(20))
print()

print('The 20 most common followers of "particularly" are:', conditional_bigram_dist['particularly'].most_common(20))


We can also display this information in a table. The table shows us for every word at the beginning of the line how many times the words at the top of table occur after them:

In [None]:
first_words = [ 'emma', 'harriet', 'always', 'particularly', 'was', 'not' ]
second_words = [ 'said', 'thought', 'admired', 'fond', 'you', 'good', 'handsome', 'horrible' ]

conditional_bigram_dist.tabulate(conditional_bigram_dist, conditions=first_words, samples=second_words)

For instance, we can see from the table that _"not fond"_ occurs twice in the corpus, and _"particularly horrible"_ does not occur at all in the corpus.

It is now your turn to change the vocabulary that you want to analyze.

After this, you can continue to Part 4.