# More Corpus fun with NLTK

16 Feb 2022

## Let's load one of NLTK's corpora

NLTK can read in an entire corpus from a directory (the “root” directory).

As it reads in a corpus, it applies word tokenization: `.words()` and sentence tokenization: `.sents()`. 

I downloaded the inagural addresses corpus from here, which is what we'll be working on today: http://www.nltk.org/nltk_data/. I'm also putting it on our github.

Load these into your files directory (on the tab at the left)

In [None]:
import nltk 
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/content/inagural/'  
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt' 

And then let's print out the list of the files

In [None]:
%pprint  # turn off pretty printing, which prints too many lines
# .txt file names as file IDs
inaug.fileids()

NLTK loads all these into one database, and we can look at the first 50 words of our corpus. Since they're stored in order, these are the first 50 words of Washington's inagural address. 

In [None]:
# NLTK automatically tokenizes the corpus for us. First 50 words: 
inaug.words()[:50]

If you know which file you're looking for, you can specify the ID, and look into files that way too:

In [None]:
# First 50 words from Obama 2013:
inaug.words('2013-Obama.txt')[:50]

NLTK automatically segments sentences too, which are accessed through .sents() -- you do have to download the punkt package to do this, but we've specified this already.

In [None]:

print(inaug.sents('2013-Obama.txt')[0])   # first sentence
print(inaug.sents('2013-Obama.txt')[1])   # 2nd sentence

So let's do some comparison across these presidents.

In [None]:
# How long are these speeches in terms of word and sentence count?
print('Washington -- Word count: ', len(inaug.words('1789-Washington.txt')), 'Sentence count: ', len(inaug.sents('1789-Washington.txt')))
print('Obama -- Word count:', len(inaug.words('2013-Obama.txt')), 'Sentence count: ', len(inaug.sents('2013-Obama.txt')))


We can write a loop that will give us these stats for all the presidents & their speeches

In [None]:
# for-loop through file IDs and print out various stats. 
# While looping, populate fid_avsent which holds avg sent lengths.

speech_stats = []    # initialize an empty list

for file in inaug.fileids():
    wcount = len(inaug.words(file))
    scount = len(inaug.sents(file))
    print(f'Words: {wcount} Sents: {scount} Avg. w/s: {round(wcount/scount)} {file}', sep='\t')  
    speech_stats.append( (wcount/scount, file) )      # append a pair (x, y) to list


### OH NO! AN ERROR!

Look at the error output -- the 2005-Bush.txt file produces a **Unicode encoding error**. 

WHAT CAN WE DO??

http://www.presidency.ucsb.edu/inaugurals.php

Ideas: 
- save a new copy of the file, making sure it's in utf-8 format
- use regex to find & replace all of the errors (if we can understand them)



In [None]:
# Turn pretty print back on 
%pprint
# sorted() returns an alphabetically/numerically sorted list
sorted(speech_stats)

How can we write the same thing, but with a list comprehension, you ask??

In [None]:
speech_stats2 = [(len(inaug.words(f))/len(inaug.sents(f)), f) for f in inaug.fileids()]
sorted(speech_stats2)

## Doing more things with NLTK

Find the size of the corpus (aka, the length of the words list):

In [None]:
# Corpus size in number of words
len(inaug.words())

We can build a word frequency distribution of the words in the corpus, too -- these are all things we did on Monday.

In [None]:
# Build word frequency distribution for the entire corpus
inaug_freq = nltk.FreqDist(inaug.words())
inaug_freq.most_common(30)

### Your turn

- Explore the corpus! 
- Are the following words getting more or less frequent: 'we', 'the', 'America', 'people'?
- Are _words_ themselves getting longer or shorter? Hint: use `sum([1, 2, 3, 4])`
- Can you print plots of these summaries?
- What do our n-gram statistics look like?

In [None]:
shake = "shall i compare thee to a summer's day".split()
shake

Make a list of the word lengths

In [None]:
[len(word) for word in shake]

What's this doing? Make sure you can explain it!

In [None]:
sum([len(word) for word in shake])

In [None]:
len(shake)

Can you describe what this line of code is doing?

In [None]:
sum([len(word) for word in shake]) / len(shake)

Here's an example for how you might find the average word length...

Look at the last line here -- what is this sep="\t" doing in the print statement? What happens if you change it?

In [None]:
# Average word length. 
# Trending DOWN! 
for file in inaug.fileids():
    wcount = len(inaug.words(file))
    wlen_sum = sum([len(x) for x in inaug.words(file)])
    wlen_avg = wlen_sum/wcount
    print(wlen_avg, file, sep="\t")

Look at what happens when you look at the frequency distribution for the word 'we' -- can you draw any conclusions about this? 

What about any of the other words mentioned, or other words you try?

In [None]:
# 'we' and related forms.  
for file in inaug.fileids():
    fd = nltk.FreqDist(inaug.words(file))  # frequency distribution for each speech
    wecount = fd.freq("We") + fd.freq("we") + fd.freq("us") + fd.freq("Our") + fd.freq("our")
    print(wecount, file, sep="\t")