# Day 4: More NLTK and Corpus Tools

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Preparation

- Import NLTK
- Load up the Inaugural corpus. 

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Users/narae/Desktop/inaugural'  # Use your own userid; Mac users should omit C:
inaug = PlaintextCorpusReader(corpus_root, '.*txt')  # all files ending in 'txt' 

In [None]:
%pprint
inaug.fileids()

In [None]:
print(inaug.words()[:50])

## n-grams

In [None]:
chom = 'colorless green ideas sleep furiously'.split()
chom

In [None]:
nltk.bigrams(chom)
# fundtion returns a "generator" object: it is memory-efficient but won't let us take a peak

In [None]:
# generator object works well in a loop environment
for x in nltk.bigrams(chom):
    print(x)

In [None]:
# Force it into a list type
list(nltk.bigrams(chom))

In [None]:
# trigram function also available
list(nltk.trigrams(chom))

In [None]:
# let's build a bigram list of the entire inaugural corpus
inaug_bigrams = list(nltk.bigrams(inaug.words()))
inaug_bigrams[:10]  

In [None]:
# last 10 bigrams
inaug_bigrams[-10:]

In [None]:
# What are the most frquent bigrams? 
inaug_bigrams_fd = nltk.FreqDist(inaug_bigrams)
inaug_bigrams_fd.most_common(30)

In [None]:
inaug_bigrams_fd[('of', 'the')]

In [None]:
# What functions are available with this object? 
dir(inaug_bigrams_fd)

In [None]:
# over 1% of all bigrams are 'of the'! 
inaug_bigrams_fd.freq(('of', 'the'))

## Conditional frequency distribution: by preceding word
- What are the most common words following 'shall'? 
  - 'shall' becomes the condition for the next word: conditional frequency distribution. 
  - Stats can be compiled from a list of bigrams (w1, w2). 

In [None]:
# cfd is built from bigrams: a list of (w1, w2) 
inaug_bigrams_cfd = nltk.ConditionalFreqDist(inaug_bigrams)

In [None]:
# 'shall' as the w1 condition. Value is a FreqDist! 
inaug_bigrams_cfd['shall']

In [None]:
inaug_bigrams_cfd['shall']['not']

In [None]:
# total count of 'shall'
inaug_bigrams_cfd['shall'].N()

In [None]:
# likelihood of 'not' following 'shall' 
inaug_bigrams_cfd['shall'].freq('not')

In [None]:
inaug_bigrams_cfd['shall'].most_common(10)

## Conditional frequency distribution: count per year
- Are words such as 'freedom', 'liberty', 'god' more frequent or less over time? 
- We will try out NLTK's book chapter on the Inaugural corpus: http://www.nltk.org/book/ch02.html#inaugural-address-corpus

**Plotting/visualization**
- If plotting breaks on you, matplotlib is not installed. Install it via `!pip install matplotlib`. 
- If plot graphs are too small, you can:
```
import matplotlib.pyplot as plt 
plt.figure(figsize=(20,10))
cfd.plot()
```

## nltk.Text object and other corpus tools
- NLTK's Text object class provides a concordancer and other classic corpus tools
- A Text object can be built from a token list

In [None]:
inaug_Text = nltk.Text(inaug.words())
inaug_Text.concordance("shall")

In [None]:
help(inaug_Text.concordance)

In [None]:
# What other handy functions are available? 
dir(inaug_Text)

In [None]:
# Collocations found in this corpus
inaug_Text.collocations()

In [None]:
# More info on the method. Doesn't say what stats are used...
help(inaug_Text.collocations)

In [None]:
# common context (surrounding words) shared by a list of words
inaug_Text.common_contexts(['shall', 'will'])

## More tomorrow

- Advanced processing: lemmatization, POS tagging
- Bring your own corpus: We will try on 1-2 corpora from your suggestions

Last meeting on [Day 5 (Friday)](day5.ipynb)