# Day 3: Corpus processing with NLTK

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Preparation

#### Data

- Download and unzip the “C-Span Inaugural Address Corpus”, available on NLTK’s corpora page: http://www.nltk.org/nltk_data/
- Place the unzipped `inaugural` folder **on your desktop** 

## Processing a  corpus

- NLTK can read in an entire corpus from a directory (the “root” directory).
- As it reads in a corpus, it applies word tokenization: `.words()` and sentence tokenization: `.sents()`. 

In [None]:
import nltk 
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/narae/Desktop/inaugural'  # Use your own userid; Mac users should omit C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt' 

In [None]:
%pprint  # turn off pretty printing, which prints too many lines
# .txt file names as file IDs
inaug.fileids()

In [None]:
# NLTK automatically tokenizes the corpus. First 50 words: 
inaug.words()[:50]

In [None]:
# You can also specify individual file ID. First 50 words from Obama 2009:
inaug.words('2009-Obama.txt')[:50]

In [None]:
# NLTK automatically segments sentences too, which are accessed through .sents()
print(inaug.sents('2009-Obama.txt')[0])   # first sentence
print(inaug.sents('2009-Obama.txt')[1])   # 2nd sentence

In [None]:
# How long are these speeches in terms of word and sentence count?
print('Washington:', len(inaug.words('1789-Washington.txt')), len(inaug.sents('1789-Washington.txt')))
print('Obama:', len(inaug.words('2009-Obama.txt')), len(inaug.sents('2009-Obama.txt')))

In [None]:
# for-loop through file IDs and print out various stats. 
# While looping, populate fid_avsent which holds avg sent lengths.

fid_avsent = []    # initialize an empty list

for f in inaug.fileids():
    wcount = len(inaug.words(f))
    scount = len(inaug.sents(f))
    print(wcount, scount, wcount/scount, f, sep='\t')  # separate by tab for readability
    fid_avsent.append( (wcount/scount, f) )      # append a pair (x, y) to list

### Trouble shooting 

- Unfortunately, 2005 Bush file produces a **Unicode encoding error**. 
- Let's make a new text file from [http://www.presidency.ucsb.edu/inaugurals.php](http://www.presidency.ucsb.edu/inaugurals.php)
- The text files are locked; We will need to save and halt this notebook first. 

**Mac**:
1. Launch TextEdit. It is Mac's default text editor.  
1. Visit the web page and copy the text: highlight and `Command+C`. 
1. Come back to the TextEdit window, paste `Command+V`. 
1. **Convert to plain text**: `Shift+Command+T`
1. Save. Choose the "inaugural" directory and give the appropriate file name. Make sure to choose "**Unicode (UTF-8)**" as the Encoding. Overwrite the existing file. 

**Windows**: 
1. First, delete the offending file. 
1. Then, right-click empty space in the folder, create a new text file with the same name. 
1. Double-clicking it will open it in your default text editor (Notepad)
1. Visit the web page and copy the text: highlight and `Control+C`. 
1. Come back to Notepad, paste in (`Control+V`). 
1. Save: make sure to choose **UTF-8** encoding and **not ANSI**.  

In [None]:
# Turn pretty print back on 
%pprint
# sorted() returns an alphabetically sorted list
sorted(fid_avsent)

In [None]:
# Same thing, with list comprehension! 
fid_avsent2 = [(len(inaug.words(f))/len(inaug.sents(f)), f) for f in inaug.fileids()]
sorted(fid_avsent2)

In [None]:
# Corpus size in number of words
len(inaug.words())

In [None]:
# Building word frequency distribution for the entire corpus
inaug_fd = nltk.FreqDist(inaug.words())
inaug_fd.most_common(30)

### Your turn

- Explore the corpus! 
- Are the following words getting more or less frequent: 'we', 'the'?
- Are _words_ getting longer or shorter? Hint: use `sum([1, 2, 3, 4])`

## More tomorrow

- NLTK's other corpus tools: Text, concordancer, ngrams

We will learn on [Day 4 (Thursday)](day4.ipynb)