Text Processing with NLTK3 Cookbook

## Chapter 1: Tokenizing Text and WordNet Basics
***

## Before you start executing the programs on this page...

You need to install the Natural Language Tool Kit (NLTK). You can install the NLTK relatively quickly, but may take longer to download the NTLK corpora (10-15 minutes).

### Installing the NLTK 

1. Open Terminal, and change your current folder to the course folder (e.g., ```Desktop/cfh/```)
2. Activate the virtual environment
3. Type the following line, and then press the return key.

```
pip install nltk
```

You should now be successfully downloading the NLTK. (If something goes wrong here, please contact me).

Once you have successfully downloaded the NLTK, you can download some NLTK corpora.

### Downloading NLTK Corpora

1. Open Terminal and change your current folder to the course folder (e.g., ```Desktop/cfh/```)
2. Activate the virtual environment
3. Start Python by typing 'python' in the command line
4. Import the NLTK by typing: **```import nltk```**
5. Launch the NLTK download app by typing: **```nltk.download()```**

The following example shows Steps 3-5:

>```
(ven) <your_username>/Desktop/cfh % python
Python 3.5.1 |Anaconda 4.0.0 (x86_64)| (default, Dec  7 2015, 11:24:55) 
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>> import nltk
>>>> nltk.download()
```

Select the **Corpora** tab, and download the following items:

- webtext
- wordnet
- stopwords

**CAUTION !!** -— Do not double click these items. If you do, you will start downloading them. In particular, DO NOT double click "all", "all-corpus", or "book" under the Collections tab. You will start downloading MANY things, and it will take a LONG time.

Reference: http://www.nltk.org/data.html

***

If you decide to download everything, do it when you do not need to use your computer for a while. It will take a lot of time.

If you are interested in the complete listing of corpora included in the NLTK, visit http://www.nltk.org/nltk_data/

***

When you are done, quit the download app, then quit Python by typing **```quit()```**.

### Notes on importing...

What are the differences between the following 2 lines?
>```
from nltk.tokenize import word_tokenize    
from nltk.tokenize import WordPunctTokenizer
```

These lines use the following syntax (template) for importing 2 kinds of 'things' into your progam.
>```
from <module> import <something>
```

```<module>``` is the name of the module that contains ```<something>``` you are intersted in.

When this ```<something>``` is a function, it is usually typed in lower case letters (e.g., word_tokenize). These are tools you can use. It's sort of like you are renting a tool (like a chainsaw). **Usually, tools are good at completing a single task.**

When this ```<something>``` is a class, it is usually typed in CamelCase letters. **A class** (such as 'WordPunctTokenizer') is sort of like a company or group of experts. So, when you *call them* using the following syntax:

>```
tokenizer = WordPunctTokenizer()
```

you will get to use an expert sent from the company. Typically, these experts (created by **classes**) can do multiple things (unlike functions).

In [1]:
# Tokenizing Text into Sentences (p.8)

from nltk.tokenize import sent_tokenize        # import 'sent_tokenize' function from the 'nltk.tokenize' module

s = "In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters."

res = sent_tokenize(s)
print(res)
for s in res:
    print(s)

['In the beginning God created the heaven and the earth.', 'And the earth was without form, and void; and darkness was upon the face of the deep.', 'And the Spirit of God moved upon the face of the waters.']
In the beginning God created the heaven and the earth.
And the earth was without form, and void; and darkness was upon the face of the deep.
And the Spirit of God moved upon the face of the waters.


In [21]:
# Tokenizing Sentences into Words (p.10)

from nltk.tokenize import word_tokenize        # import 'sent_tokenize' function from the 'nltk.tokenize' module

s = "In the beginning God created the heaven and the earth."
res = word_tokenize(s)
print(res)


['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']


In [194]:
# Tokenizing Sentences into Words (Separating Contractions, p.10)

from nltk.tokenize import word_tokenize        # import 'word_tokenize' function from the 'nltk.tokenize' module
s = "You don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter."
res = word_tokenize(s)
print(res)

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
res = tokenizer.tokenize(s)
print(res)

['You', 'do', "n't", 'know', 'about', 'me', 'without', 'you', 'have', 'read', 'a', 'book', 'by', 'the', 'name', 'of', 'The', 'Adventures', 'of', 'Tom', 'Sawyer', ';', 'but', 'that', 'ai', "n't", 'no', 'matter', '.']
['You', 'don', "'", 't', 'know', 'about', 'me', 'without', 'you', 'have', 'read', 'a', 'book', 'by', 'the', 'name', 'of', 'The', 'Adventures', 'of', 'Tom', 'Sawyer', ';', 'but', 'that', 'ain', "'", 't', 'no', 'matter', '.']


In [33]:
# Tokenizing Sentences using Regular Expressions (p.13)

from nltk.tokenize import RegexpTokenizer      

tokenizer = RegexpTokenizer("['\w']+")
s = "You don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter."
res = tokenizer.tokenize(s)
print(res)

['You', "don't", 'know', 'about', 'me', 'without', 'you', 'have', 'read', 'a', 'book', 'by', 'the', 'name', 'of', 'The', 'Adventures', 'of', 'Tom', 'Sawyer', 'but', 'that', "ain't", 'no', 'matter']


In [43]:
# Training a Sentence Tokenizer (p.14)

from nltk.tokenize import PunktSentenceTokenizer  
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer(text)     # teach the tokenizer

sents1 = sent_tokenizer.tokenize(text)
print(sents1[678])

from nltk.tokenize import sent_tokenize           # default tokenizer
sents2 = sent_tokenize(text)
print(sents2[678])


Girl: But you already have a Big Mac...
Girl: But you already have a Big Mac...
Hobo: Oh, this is all theatrical.


In [3]:
# Filtering Stopwords in a Tokenized Sentence (p.17)
from nltk.tokenize import RegexpTokenizer      
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))    # 'set' is a data type 
                                                   # A set is an unordered collection with no duplicat elements
tokenizer = RegexpTokenizer("['\w']+")
s = "You don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter."
words = tokenizer.tokenize(s)
print(words)
print("---")

# Let's remove all the stop words.
lst = []
for word in words:         
    if word not in english_stops:
        lst.append(word)
print(lst)

# !!! [word for word in words if word not in english_stops] does almost exactly what the code above does

res = [word for word in words if word not in english_stops]   
print(res)

['You', "don't", 'know', 'about', 'me', 'without', 'you', 'have', 'read', 'a', 'book', 'by', 'the', 'name', 'of', 'The', 'Adventures', 'of', 'Tom', 'Sawyer', 'but', 'that', "ain't", 'no', 'matter']
---
['You', "don't", 'know', 'without', 'read', 'book', 'name', 'The', 'Adventures', 'Tom', 'Sawyer', "ain't", 'matter']
['You', "don't", 'know', 'without', 'read', 'book', 'name', 'The', 'Adventures', 'Tom', 'Sawyer', "ain't", 'matter']


In [86]:
# Looking up Synsets for a Word in Wordnet (p.18)

from nltk.corpus import wordnet

print(wordnet.synsets("lunch"))

print("---")
def word_and_definition(w):
    syns = wordnet.synsets(w)              # ask 'wordnet' to return a set of synonyms
    for w in syns:                         # for each word in the set of synonyms
        print("{} — {}".format(w.name(), w.definition()))   # print its name and definition
    
word_and_definition("lunch")    
print("---")
word_and_definition("food")

[Synset('lunch.n.01'), Synset('lunch.v.01'), Synset('lunch.v.02')]
---
lunch.n.01 — a midday meal
lunch.v.01 — take the midday meal
lunch.v.02 — provide a midday meal for
---
food.n.01 — any substance that can be metabolized by an animal to give energy and build tissue
food.n.02 — any solid substance (as opposed to liquid) that is used as a source of nourishment
food.n.03 — anything that provides mental stimulus for thinking


In [125]:
# Looking up Synsets for a Word in Wordnet (Part of Speech, p20)

from nltk.corpus import wordnet

print("---")
def word_and_definition(w):
    syns = wordnet.synsets(w)              # ask 'wordnet' to return a set of synonyms
    for w in syns:                         # for each word in the set of synonyms
        print("{} — {}".format(w.name(), w.pos()))   # print its name and part of speech
    
word_and_definition("fan")

---
fan.n.01 — n
sports_fan.n.01 — n
fan.n.03 — n
fan.v.01 — v
fan.v.02 — v
fan.v.03 — v
winnow.v.01 — v


In [2]:
# Looking up Lemmas and Synonyms in WordNet (p.21)

# !!! -- A lemma is a word you find in the dictionary

from nltk.corpus import wordnet

syn = wordnet.synsets("lunch")[0]                           # first element in the synsets
print("syn.name() = {}\n".format(syn.name()))
print("syn.lemmas() = {}\n".format(syn.lemmas()))             # get synonyms objects
print("syn.lemma_names() = {}\n".format(syn.lemma_names()))   # get synonyms but just names

lemmas = syn.lemmas()

for w in lemmas:
    print(w.name())



syn.name() = lunch.n.01

syn.lemmas() = [Lemma('lunch.n.01.lunch'), Lemma('lunch.n.01.luncheon'), Lemma('lunch.n.01.tiffin'), Lemma('lunch.n.01.dejeuner')]

syn.lemma_names() = ['lunch', 'luncheon', 'tiffin', 'dejeuner']

lunch
luncheon
tiffin
dejeuner


In [163]:
# Calculating WordNet Synset Similarity (p.)

from nltk.corpus import wordnet

lunch  = wordnet.synsets("lunch")[0]
dinner = wordnet.synsets("dinner")[0]
frog   = wordnet.synsets("frog")[0]
acad   = wordnet.synsets("academic")[0]

print("lunch vs. dinner:    {:.3f}".format(lunch.wup_similarity(dinner)))
print("lunch vs. frog:      {:.3f}".format(lunch.wup_similarity(frog)))
print("lunch vs. academic:  {:.3f}".format(lunch.wup_similarity(acad)))

lunch vs. dinner:    0.875
lunch vs. frog:      0.211
lunch vs. academic:  0.250


In [1]:
# Discovering Word Collocations (p.25)
from nltk.tokenize import RegexpTokenizer      
from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

tokenizer = RegexpTokenizer("['\w']+")

fhand = open("text/mlk_i_have_a_dream_1963.txt")
s = fhand.read()
words = tokenizer.tokenize(s)   # tokenize the text
# print(words)

bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 5)


[('will', 'be'),
 ('freedom', 'ring'),
 ('ring', 'from'),
 ('one', 'day'),
 ('a', 'dream')]

In [5]:
# Discovering Word Collocations (p.25)
import pprint                                  # pretty printing for debugging
pp = pprint.PrettyPrinter(indent=4)            # create a pretty printing object used for debugging

from nltk.tokenize import RegexpTokenizer      
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

stopset = set(stopwords.words('english'))    # 'set' is a data type 
tokenizer = RegexpTokenizer("['\w']+")

fhand = open("text/clinton_dnc_speech_2016.txt")
s = fhand.read()
words = tokenizer.tokenize(s.lower())   # tokenize the text

bcf = BigramCollocationFinder.from_words(words)
res = bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
print("---- Clinton ----")
pp.pprint(res)

print("")
# fhand = open("text/clinton_dnc_speech_2016.txt")
fhand.seek(0)
s = fhand.read()
words = tokenizer.tokenize(s.lower())   # tokenize the text

bcf = BigramCollocationFinder.from_words(words)
filter_stops = lambda w: len(w) < 3 or w in stopset     # define a procedure that identifies a stop word
bcf.apply_word_filter(filter_stops)                     # ask 'bcf' to apply the filtering procedure
res = bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
print("---- Clinton (without stop words) ----")
pp.pprint(res)

print("")
fhand = open("text/trump_rnc_speech_2016.txt")
s = fhand.read()
words = tokenizer.tokenize(s.lower())   # tokenize the text

bcf = BigramCollocationFinder.from_words(words)
res = bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
print("---- Trump ----")
pp.pprint(res)

print("")
fhand.seek(0)
s = fhand.read()
words = tokenizer.tokenize(s.lower())   # tokenize the text

bcf = BigramCollocationFinder.from_words(words)
filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops)
res = bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)
print("---- Trump (without stop words) ----")
pp.pprint(res)


---- Clinton ----
[   ('we', 'will'),
    ('donald', 'trump'),
    ('each', 'other'),
    ('our', 'country'),
    ('if', 'you'),
    ('tell', 'you'),
    ('years', 'ago'),
    ('thank', 'you'),
    ('police', 'officers'),
    ('millions', 'of')]

---- Clinton (without stop words) ----
[   ('donald', 'trump'),
    ('years', 'ago'),
    ('police', 'officers'),
    ('donald', "trump's"),
    ('united', 'states'),
    ('small', 'businesses'),
    ('young', 'people'),
    ('years', 'later'),
    ('middle', 'class'),
    ('sales', 'pitch')]

---- Trump ----
[   ('going', 'to'),
    ('our', 'country'),
    ('hillary', 'clinton'),
    ('i', 'am'),
    ('we', 'will'),
    ('will', 'be'),
    ('my', 'opponent'),
    ('i', 'have'),
    ('believe', 'me'),
    ('united', 'states')]

---- Trump (without stop words) ----
[   ('hillary', 'clinton'),
    ('united', 'states'),
    ('years', 'ago'),
    ('president', 'obama'),
    ('white', 'house'),
    ("we're", 'going'),
    ('never', 'ever'),
    ('l

In [1]:
# TrigramCollocation Finder (p.26)

import pprint                                  # pretty printing for debugging
pp = pprint.PrettyPrinter(indent=4)            # create a pretty printing object used for debugging

from nltk.tokenize import RegexpTokenizer      
from nltk.corpus import stopwords
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures

stopset = set(stopwords.words('english'))    # 'set' is a data type 
tokenizer = RegexpTokenizer("['\w']+")

fhand = open("text/clinton_dnc_speech_2016.txt")
s = fhand.read()
words = tokenizer.tokenize(s.lower())   # tokenize the text

tcf = TrigramCollocationFinder.from_words(words)
filter_stops = lambda w: len(w) < 3 or w in stopset
tcf.apply_word_filter(filter_stops)
res = tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 10)
print("---- Clinton ----")
pp.pprint(res)

print("")
fhand = open("text/trump_rnc_speech_2016.txt")
s = fhand.read()
words = tokenizer.tokenize(s.lower())   # tokenize the text

tcf = TrigramCollocationFinder.from_words(words)
filter_stops = lambda w: len(w) < 3 or w in stopset
tcf.apply_word_filter(filter_stops)
res = tcf.nbest(TrigramAssocMeasures.likelihood_ratio, 10)
print("---- Trump ----")
pp.pprint(res)

---- Clinton ----
[   ('donald', 'trump', 'says'),
    ('donald', 'trump', 'refused'),
    ('wisconsin', 'donald', 'trump'),
    ('think', 'donald', 'trump'),
    ('donald', 'trump', "doesn't"),
    ('chief', 'donald', 'trump'),
    ('donald', 'trump', "can't"),
    ('donald', 'trump', 'donald'),
    ('trump', 'donald', 'trump'),
    ('240', 'years', 'ago')]

---- Trump ----
[   ('hillary', 'clinton', 'death'),
    ('hillary', 'clinton', 'plans'),
    ('hillary', 'clinton', 'pushed'),
    ('hillary', 'clinton', 'remember'),
    ('yet', 'hillary', 'clinton'),
    ('put', 'hillary', 'clinton'),
    ('hillary', 'clinton', 'americans'),
    ('united', 'states', 'gov'),
    ('united', 'states', 'supreme'),
    ('eight', 'years', 'ago')]
