# Day 3: Corpus processing with NLTK

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Preparation

#### Data

- Download and unzip the “C-Span Inaugural Address Corpus”, available on NLTK’s corpora page: http://www.nltk.org/nltk_data/
- Place the unzipped `inaugural` folder **on your desktop** 

## Processing a  corpus

- NLTK can read in an entire corpus from a directory (the “root” directory).
- As it reads in a corpus, it applies word tokenization: `.words()` and sentence tokenization: `.sents()`. 

In [1]:
import nltk 
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/narae/Desktop/inaugural'  # Use your own userid; Mac users should omit C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt' 

In [2]:
%pprint  # turn off pretty printing, which prints too many lines
# .txt file names as file IDs
inaug.fileids()

Pretty printing has been turned OFF


['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reaga

In [3]:
# NLTK automatically tokenizes the corpus. First 50 words: 
inaug.words()[:50]

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Among', 'the', 'vicissitudes', 'incident', 'to', 'life', 'no', 'event', 'could', 'have', 'filled', 'me', 'with', 'greater', 'anxieties', 'than', 'that', 'of', 'which', 'the', 'notification', 'was', 'transmitted', 'by', 'your', 'order', ',', 'and', 'received', 'on', 'the', '14th', 'day', 'of', 'the', 'present', 'month']

In [4]:
# You can also specify individual file ID. First 50 words from Obama 2009:
inaug.words('2009-Obama.txt')[:50]

['My', 'fellow', 'citizens', ':', 'I', 'stand', 'here', 'today', 'humbled', 'by', 'the', 'task', 'before', 'us', ',', 'grateful', 'for', 'the', 'trust', 'you', 'have', 'bestowed', ',', 'mindful', 'of', 'the', 'sacrifices', 'borne', 'by', 'our', 'ancestors', '.', 'I', 'thank', 'President', 'Bush', 'for', 'his', 'service', 'to', 'our', 'nation', ',', 'as', 'well', 'as', 'the', 'generosity', 'and', 'cooperation']

In [5]:
# NLTK automatically segments sentences too, which are accessed through .sents()
print(inaug.sents('2009-Obama.txt')[0])   # first sentence
print(inaug.sents('2009-Obama.txt')[1])   # 2nd sentence

['My', 'fellow', 'citizens', ':']
['I', 'stand', 'here', 'today', 'humbled', 'by', 'the', 'task', 'before', 'us', ',', 'grateful', 'for', 'the', 'trust', 'you', 'have', 'bestowed', ',', 'mindful', 'of', 'the', 'sacrifices', 'borne', 'by', 'our', 'ancestors', '.']


In [6]:
# How long are these speeches in terms of word and sentence count?
print('Washington:', len(inaug.words('1789-Washington.txt')), len(inaug.sents('1789-Washington.txt')))
print('Obama:', len(inaug.words('2009-Obama.txt')), len(inaug.sents('2009-Obama.txt')))

Washington: 1538 24
Obama: 2726 112


In [7]:
# for-loop through file IDs and print out various stats. 
# While looping, populate fid_avsent which holds avg sent lengths.

fid_avsent = []    # initialize an empty list

for f in inaug.fileids():
    wcount = len(inaug.words(f))
    scount = len(inaug.sents(f))
    print(wcount, scount, wcount/scount, f, sep='\t')  # separate by tab for readability
    fid_avsent.append( (wcount/scount, f) )      # append a pair (x, y) to list

1538	24	64.08333333333333	1789-Washington.txt
147	4	36.75	1793-Washington.txt
2585	37	69.86486486486487	1797-Adams.txt
1935	42	46.07142857142857	1801-Jefferson.txt
2384	45	52.977777777777774	1805-Jefferson.txt
1265	21	60.23809523809524	1809-Madison.txt
1304	33	39.515151515151516	1813-Madison.txt
3693	122	30.270491803278688	1817-Monroe.txt
4909	129	38.054263565891475	1821-Monroe.txt
3150	74	42.567567567567565	1825-Adams.txt
1208	25	48.32	1829-Jackson.txt
1267	30	42.233333333333334	1833-Jackson.txt
4171	95	43.90526315789474	1837-VanBuren.txt
9165	210	43.642857142857146	1841-Harrison.txt
5196	153	33.96078431372549	1845-Polk.txt
1182	22	53.72727272727273	1849-Taylor.txt
3657	104	35.16346153846154	1853-Pierce.txt
3098	89	34.80898876404494	1857-Buchanan.txt
4005	138	29.02173913043478	1861-Lincoln.txt
785	27	29.074074074074073	1865-Lincoln.txt
1239	41	30.21951219512195	1869-Grant.txt
1478	44	33.59090909090909	1873-Grant.txt
2724	59	46.16949152542373	1877-Hayes.txt
3239	112	28.919642857142858	

### Trouble shooting 

- Unfortunately, 2005 Bush file produces a **Unicode encoding error**. 
- Let's make a new text file from [http://www.presidency.ucsb.edu/inaugurals.php](http://www.presidency.ucsb.edu/inaugurals.php)
- The text files are locked; We will need to save and halt this notebook first. 

**Mac**:
1. Launch TextEdit. It is Mac's default text editor.  
1. Visit the web page and copy the text: highlight and `Command+C`. 
1. Come back to the TextEdit window, paste `Command+V`. 
1. **Convert to plain text**: `Shift+Command+T`
1. Save. Choose the "inaugural" directory and give the appropriate file name. Make sure to choose "**Unicode (UTF-8)**" as the Encoding. Overwrite the existing file. 

**Windows**: 
1. First, delete the offending file. 
1. Then, right-click empty space in the folder, create a new text file with the same name. 
1. Double-clicking it will open it in your default text editor (Notepad)
1. Visit the web page and copy the text: highlight and `Control+C`. 
1. Come back to Notepad, paste in (`Control+V`). 
1. Save: make sure to choose **UTF-8** encoding and **not ANSI**.  

In [8]:
# Turn pretty print back on 
%pprint
# sorted() returns an alphabetically sorted list
sorted(fid_avsent)

Pretty printing has been turned ON


[(18.24468085106383, '1965-Johnson.txt'),
 (18.71034482758621, '1989-Bush.txt'),
 (18.814432989690722, '2001-Bush.txt'),
 (20.83695652173913, '1957-Eisenhower.txt'),
 (21.03125, '1937-Roosevelt.txt'),
 (21.79310344827586, '1949-Truman.txt'),
 (21.982142857142858, '1997-Clinton.txt'),
 (22.055118110236222, '1981-Reagan.txt'),
 (22.548223350253807, '1925-Coolidge.txt'),
 (22.5609756097561, '1953-Eisenhower.txt'),
 (22.58823529411765, '1941-Roosevelt.txt'),
 (22.87735849056604, '1969-Nixon.txt'),
 (22.901234567901234, '1993-Clinton.txt'),
 (23.38095238095238, '1985-Reagan.txt'),
 (23.575757575757574, '2005-Bush.txt'),
 (24.270588235294117, '1933-Roosevelt.txt'),
 (24.339285714285715, '2009-Obama.txt'),
 (24.5, '1901-McKinley.txt'),
 (24.5, '1945-Roosevelt.txt'),
 (24.620253164556964, '1929-Hoover.txt'),
 (25.20805369127517, '1921-Harding.txt'),
 (26.037735849056602, '1977-Carter.txt'),
 (27.6, '1917-Wilson.txt'),
 (28.014705882352942, '1913-Wilson.txt'),
 (28.919642857142858, '1881-Garfie

In [9]:
# Same thing, with list comprehension! 
fid_avsent2 = [(len(inaug.words(f))/len(inaug.sents(f)), f) for f in inaug.fileids()]
sorted(fid_avsent2)

[(18.24468085106383, '1965-Johnson.txt'),
 (18.71034482758621, '1989-Bush.txt'),
 (18.814432989690722, '2001-Bush.txt'),
 (20.83695652173913, '1957-Eisenhower.txt'),
 (21.03125, '1937-Roosevelt.txt'),
 (21.79310344827586, '1949-Truman.txt'),
 (21.982142857142858, '1997-Clinton.txt'),
 (22.055118110236222, '1981-Reagan.txt'),
 (22.548223350253807, '1925-Coolidge.txt'),
 (22.5609756097561, '1953-Eisenhower.txt'),
 (22.58823529411765, '1941-Roosevelt.txt'),
 (22.87735849056604, '1969-Nixon.txt'),
 (22.901234567901234, '1993-Clinton.txt'),
 (23.38095238095238, '1985-Reagan.txt'),
 (23.575757575757574, '2005-Bush.txt'),
 (24.270588235294117, '1933-Roosevelt.txt'),
 (24.339285714285715, '2009-Obama.txt'),
 (24.5, '1901-McKinley.txt'),
 (24.5, '1945-Roosevelt.txt'),
 (24.620253164556964, '1929-Hoover.txt'),
 (25.20805369127517, '1921-Harding.txt'),
 (26.037735849056602, '1977-Carter.txt'),
 (27.6, '1917-Wilson.txt'),
 (28.014705882352942, '1913-Wilson.txt'),
 (28.919642857142858, '1881-Garfie

In [10]:
# Corpus size in number of words
len(inaug.words())

145693

In [11]:
# Building word frequency distribution for the entire corpus
inaug_fd = nltk.FreqDist(inaug.words())
inaug_fd.most_common(30)

[('the', 9282),
 ('of', 6970),
 (',', 6822),
 ('and', 4995),
 ('.', 4677),
 ('to', 4311),
 ('in', 2528),
 ('a', 2135),
 ('our', 1905),
 ('that', 1688),
 ('be', 1460),
 ('is', 1403),
 ('we', 1141),
 ('for', 1075),
 ('by', 1036),
 ('it', 1010),
 ('which', 1002),
 ('have', 994),
 ('not', 918),
 ('as', 888),
 ('with', 886),
 ('will', 846),
 ('I', 830),
 ('are', 774),
 ('all', 758),
 ('their', 719),
 ('this', 700),
 ('The', 619),
 ('has', 611),
 ('people', 559)]

### Your turn

- Explore the corpus! 
- Are the following words getting more or less frequent: 'we', 'the'?
- Are _words_ getting longer or shorter? Hint: use `sum([1, 2, 3, 4])`

## More tomorrow

- NLTK's other corpus tools: Text, concordancer, ngrams

We will learn on [Day 4 (Thursday)](day4.ipynb)