# Day 2: NLTK, processing a text file

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Preparation

#### Data

- Download and unzip the “C-Span Inaugural Address Corpus”, available on NLTK’s corpora page: http://www.nltk.org/nltk_data/
- Place the unzipped `inaugural` folder **on your desktop** 

#### Jupyter tips
- Click `+` to create a new cell, ► to run (Also: Ctrl+ENTER)
- `Alt+ENTER` to run cell, create a new cell below
- `Shift+ENTER` to run cell, go to next cell
- More on [this page](https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/)    

## Using NLTK

* NLTK ([Natural Language Toolkit](http://www.nltk.org/)) is an external library; you must import it first. 

In [1]:
import nltk

* Let's first download some data files: 

In [2]:
#nltk.download('popular')

In [3]:
# Tokenizing function: turns a text (a single string) into a list of word & symbol tokens
greet = "Hello, world!"
nltk.word_tokenize(greet)

['Hello', ',', 'world', '!']

In [4]:
help(nltk.word_tokenize)

Help on function word_tokenize in module nltk.tokenize:

word_tokenize(text, language='english', preserve_line=False)
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    
    :param text: text to split into words
    :param text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserver_line: bool



In [5]:
sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)

['You', 'have', "n't", 'seen', 'Star', 'Wars', '...', '?']

* `nltk.FreqDist()` is is another useful NLTK function. 
* It builds a frequency count dictionary from a list. 

In [6]:
# First "Rose" is capitalized. How to lowercase? 
sent = 'Rose is a rose is a rose is a rose.'
toks = nltk.word_tokenize(sent)
print(toks)

['Rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', '.']


In [7]:
freq = nltk.FreqDist(toks)
freq

FreqDist({'.': 1, 'Rose': 1, 'a': 3, 'is': 3, 'rose': 3})

In [8]:
freq.most_common()

[('a', 3), ('rose', 3), ('is', 3), ('.', 1), ('Rose', 1)]

In [9]:
freq['rose']

3

In [10]:
len(freq)

5

In [11]:
freq.keys()

dict_keys(['.', 'a', 'rose', 'Rose', 'is'])

## Processing a single text file

### Reading in a text file
* `open(filename).read()` opens a text file and reads in the content as a *single continuous string*. 

In [12]:
myfile = 'C:/Users/narae/Desktop/inaugural/1789-Washington.txt'  # Use your own userid; Mac users should omit C:
wtxt = open(myfile).read()
print(wtxt)

Fellow-Citizens of the Senate and of the House of Representatives:

Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years -- a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not bu

In [13]:
len(wtxt)     # Number of characters in text

8619

In [14]:
'American' in wtxt  # phrase as a substring. try "Americans"

True

In [15]:
'th' in wtxt

True

### Tokenize text, compile frequency count

In [None]:
# Turn off/on pretty printing (prints too many lines)
%pprint    

In [None]:
# Tokenize text
nltk.word_tokenize(wtxt)

In [None]:
wtokens = nltk.word_tokenize(wtxt.lower())
len(wtokens)     # Number of words in text

In [None]:
# Build a dictionary of frequency count
wfreq = nltk.FreqDist(wtokens)
wfreq['the']

In [None]:
'Fellow-Citizens' in wfreq

In [None]:
len(wfreq)      # Number of unique words in text

In [None]:
wfreq.most_common(30)     # 30 most common words

In [None]:
# dir() prints out all functions defined on the type of object. 
dir(wfreq)

In [None]:
# Hmm. Wonder what .freq does... let's find out. 
help(wfreq.freq)

In [None]:
wfreq.freq('the')

In [None]:
len(wfreq.hapaxes())

### Average sentence length, frequency of long words

In [None]:
sentcount = wfreq['.'] + wfreq['?'] + wfreq['!']  # Assuming every sentence ends with ., ! or 
print(sentcount)

In [None]:
# Tokens include symbols and punctuation. First 50 tokens:
wtokens[:50]

In [None]:
wtokens_nosym = [t for t in wtokens if t.isalnum()]    # alpha-numeric tokens only
len(wtokens_nosym)

In [None]:
# Try "n't", "20th", "."
"n't".isalnum()

In [None]:
# First 50 tokens, alpha-numeric tokens only: 
wtokens_nosym[:50]

In [None]:
len(wtokens_nosym)/sentcount     # Average sentence length in number of words

In [None]:
[w for w in wfreq if len(w) >= 13]       # all 13+ character words

In [None]:
long = [w for w in wfreq if len(w) >= 13] 
# sort long alphabetically using sorted()
for w in sorted(long) :
    print(w, len(w), wfreq[w])               # long words tend to be less frequent

## More tomorrow

- Processing the entire Inaugural Address corpus
    - Which inaugural speech was the longest? The shortest?
    - Which presidents favored long sentences?

All answered on [Day 3 (Wednesday)](day3.ipynb)

## Bring your own corpus
Is there any particular corpus you are looking to work with? Please suggest it for our very last class, when we will take a look at a couple of them together. Ideal candidates are: 
- Sharable with class (you should either have ownership or the corpus should be publicly available)
- Moderate in size (100MB or less)

__Please email both Na-Rae and David with your suggestions by NOON TODAY__. Please include a web link or attach a zipped archive (if you own the rights) along with a brief description of your end goals. 