# Day 4: More NLTK and Corpus Tools

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Preparation

- Import NLTK
- Load up the Inaugural corpus. 

In [1]:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/narae/Desktop/inaugural'  # Use your own userid; Mac users should omit C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt' 

In [2]:
%pprint
inaug.fileids()

Pretty printing has been turned OFF


['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reaga

In [3]:
print(inaug.words()[:50])

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Among', 'the', 'vicissitudes', 'incident', 'to', 'life', 'no', 'event', 'could', 'have', 'filled', 'me', 'with', 'greater', 'anxieties', 'than', 'that', 'of', 'which', 'the', 'notification', 'was', 'transmitted', 'by', 'your', 'order', ',', 'and', 'received', 'on', 'the', '14th', 'day', 'of', 'the', 'present', 'month']


## n-grams

In [4]:
chom = 'colorless green ideas sleep furiously'.split()
chom

['colorless', 'green', 'ideas', 'sleep', 'furiously']

In [5]:
nltk.bigrams(chom)
# fundtion returns a "generator" object: it is memory-efficient but won't let us take a peak

<generator object bigrams at 0x0000025345E0EF10>

In [6]:
# generator object works well in a loop environment
for x in nltk.bigrams(chom):
    print(x)

('colorless', 'green')
('green', 'ideas')
('ideas', 'sleep')
('sleep', 'furiously')


In [7]:
# Force it into a list type
list(nltk.bigrams(chom))

[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously')]

In [8]:
# trigram function also available
list(nltk.trigrams(chom))

[('colorless', 'green', 'ideas'), ('green', 'ideas', 'sleep'), ('ideas', 'sleep', 'furiously')]

In [9]:
# let's build a bigram list of the entire inaugural corpus
inaug_bigrams = list(nltk.bigrams(inaug.words()))
inaug_bigrams[:10]  

[('Fellow', '-'), ('-', 'Citizens'), ('Citizens', 'of'), ('of', 'the'), ('the', 'Senate'), ('Senate', 'and'), ('and', 'of'), ('of', 'the'), ('the', 'House'), ('House', 'of')]

In [10]:
# last 10 bigrams
inaug_bigrams[-10:]

[('you', '.'), ('.', 'And'), ('And', 'God'), ('God', 'bless'), ('bless', 'the'), ('the', 'United'), ('United', 'States'), ('States', 'of'), ('of', 'America'), ('America', '.')]

In [11]:
# What are the most frquent bigrams? 
inaug_bigrams_fd = nltk.FreqDist(inaug_bigrams)
inaug_bigrams_fd.most_common(30)

[(('of', 'the'), 1754), ((',', 'and'), 1278), (('in', 'the'), 746), (('to', 'the'), 689), (('of', 'our'), 596), (('.', 'The'), 581), (('.', 'We'), 468), (('and', 'the'), 452), ((',', 'the'), 365), (('.', 'It'), 343), ((',', 'but'), 314), (('to', 'be'), 305), (('by', 'the'), 305), (('for', 'the'), 290), ((',', 'we'), 263), (('of', 'a'), 252), (('the', 'people'), 243), (('.', 'I'), 226), (('with', 'the'), 220), (('the', 'world'), 215), ((',', 'to'), 212), (('that', 'the'), 210), (("'", 's'), 202), (('.', 'In'), 202), (('have', 'been'), 201), ((',', 'in'), 200), ((',', 'I'), 188), (('will', 'be'), 182), (('has', 'been'), 178), (('is', 'the'), 176)]

In [12]:
inaug_bigrams_fd[('of', 'the')]

1754

In [13]:
# What functions are available with this object? 
dir(inaug_bigrams_fd)

['B', 'N', 'Nr', '_N', '__add__', '__and__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__iand__', '__init__', '__ior__', '__isub__', '__iter__', '__le__', '__len__', '__lt__', '__missing__', '__module__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__unicode__', '__weakref__', '_cumulative_frequencies', '_keep_positive', 'clear', 'copy', 'elements', 'freq', 'fromkeys', 'get', 'hapaxes', 'items', 'keys', 'max', 'most_common', 'pformat', 'plot', 'pop', 'popitem', 'pprint', 'r_Nr', 'setdefault', 'subtract', 'tabulate', 'unicode_repr', 'update', 'values']

In [14]:
# over 1% of all bigrams are 'of the'! 
inaug_bigrams_fd.freq(('of', 'the'))

0.012039096175493506

## Conditional frequency distribution: by preceding word
- What are the most common words following 'shall'? 
  - 'shall' becomes the condition for the next word: conditional frequency distribution. 
  - Stats can be compiled from a list of bigrams (w1, w2). 

In [15]:
# cfd is built from bigrams: a list of (w1, w2) 
inaug_bigrams_cfd = nltk.ConditionalFreqDist(inaug_bigrams)

In [16]:
# 'shall' as the w1 condition. Value is a FreqDist! 
inaug_bigrams_cfd['shall']

FreqDist({'be': 64, 'not': 19, 'have': 16, 'endeavor': 8, 'strive': 5, 'make': 5, 'do': 5, 'never': 5, 'continue': 4, 'we': 4, ...})

In [17]:
inaug_bigrams_cfd['shall']['not']

19

In [18]:
# total count of 'shall'
inaug_bigrams_cfd['shall'].N()

310

In [19]:
# likelihood of 'not' following 'shall' 
inaug_bigrams_cfd['shall'].freq('not')

0.06129032258064516

In [20]:
inaug_bigrams_cfd['shall'].most_common(10)

[('be', 64), ('not', 19), ('have', 16), ('endeavor', 8), ('strive', 5), ('make', 5), ('do', 5), ('never', 5), ('continue', 4), ('we', 4)]

## Conditional frequency distribution: count per year
- Are words such as 'freedom', 'liberty', 'god' more frequent or less over time? 
- We will try out NLTK's book chapter on the Inaugural corpus: http://www.nltk.org/book/ch02.html#inaugural-address-corpus

**Plotting/visualization**
- If plotting breaks on you, matplotlib is not installed. Install it via `!pip install matplotlib`. 
- If plot graphs are too small, you can:
```
import matplotlib.pyplot as plt 
plt.figure(figsize=(20,10))
cfd.plot()
```

## nltk.Text object and other corpus tools
- NLTK's Text object class provides a concordancer and other classic corpus tools
- A Text object can be built from a token list

In [21]:
inaug_Text = nltk.Text(inaug.words())
inaug_Text.concordance("shall")
# try "women", "men"

Displaying 25 of 314 matches:
ur consideration such measures as he shall judge necessary and expedient ." The
ived from official opportunities , I shall again give way to my entire confiden
ccasion which brings us together , I shall take my present leave ; but not with
te . When the occasion proper for it shall arrive , I shall endeavor to express
asion proper for it shall arrive , I shall endeavor to express the high sense I
 , and in your presence : That if it shall be found during my administration of
determination to support it until it shall be altered by the judgments and wish
es and the public opinion , until it shall be otherwise ordained by Congress ; 
gree to comply with your wishes , it shall be my strenuous endeavor that this s
gacious injunction of the two Houses shall not be without effect . With this gr
ities provided by our Constitution I shall find resources of wisdom , of virtue
a wise and frugal Government , which shall restrain men from injuring one anoth
rain men f

In [22]:
help(inaug_Text.concordance)

Help on method concordance in module nltk.text:

concordance(word, width=79, lines=25) method of nltk.text.Text instance
    Print a concordance for ``word`` with the specified context window.
    Word matching is not case-sensitive.
    :seealso: ``ConcordanceIndex``



In [23]:
# What other handy functions are available? 
dir(inaug_Text)

['_CONTEXT_RE', '_COPY_TOKENS', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_concordance_index', '_context', 'collocations', 'common_contexts', 'concordance', 'count', 'dispersion_plot', 'findall', 'generate', 'index', 'name', 'plot', 'readability', 'similar', 'tokens', 'unicode_repr', 'vocab']

In [24]:
# Collocations found in this corpus. Try window size of 3. 
inaug_Text.collocations()

United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties


In [25]:
# More info on the method. Doesn't say what stats are used...
help(inaug_Text.collocations)

Help on method collocations in module nltk.text:

collocations(num=20, window_size=2) method of nltk.text.Text instance
    Print collocations derived from the text, ignoring stopwords.
    
    :seealso: find_collocations
    :param num: The maximum number of collocations to print.
    :type num: int
    :param window_size: The number of tokens spanned by a collocation (default=2)
    :type window_size: int



In [26]:
# common context (surrounding words) shared by a list of words
inaug_Text.common_contexts(['shall', 'will'])

we_always ,_we i_not i_to which_secure ?_we we_have as_prevent
we_stand i_ask this_be ,_make they_not government_be we_be there_be
we_not we_give future_be ,_always


## More tomorrow

- Advanced processing: lemmatization, POS tagging
- Bring your own corpus: We will try on 1-2 corpora from your suggestions

Last meeting on [Day 5 (Friday)](day5.ipynb)