In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
plt.rcParams['figure.figsize'] = (15, 5)

### Installing NLTK toolkit

Before starting let's install the NLTK library (http://www.nltk.org/), by typing the following commands in vagrant terminal.

* Install Numpy: `sudo -H pip3 install -U numpy`
* Install NLTK: `sudo -H pip3 install -U nltk`
* Install Tkinter: `sudo -H apt-get -y install python3-tk`

Test that the library is installed properly by executing the following command:

Once the NLTK toolkit is installed, we need to install the NLTK data: 

`sudo python -m nltk.downloader -d /usr/share/nltk_data all`

In [None]:
import nltk

In [None]:
!sudo -H apt-get -y install python3-tk

In [None]:
!sudo python3 -m nltk.downloader -d /usr/share/nltk_data all

#### Extra NLTK resources

NLTK also comes with some of the files from Project Gutenberg already included:

In [None]:
from nltk.book import *

In [None]:
len(text4)

In [None]:
list(text4)

It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text. You can produce this plot as shown below. You might like to try more words (e.g., liberty, constitution), and different texts. Can you predict the dispersion of a word before you view it? 

In [None]:
# Text4 is the inauguration addresses
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America", "world"])

#### Exercise

Pick your own text and create a dispersion plot for your keywords of choice. As a reminder:
* text1: Moby Dick by Herman Melville 1851
* text2: Sense and Sensibility by Jane Austen 1811
* text3: The Book of Genesis
* text4: Inaugural Address Corpus
* text5: Chat Corpus
* text6: Monty Python and the Holy Grail
* text7: Wall Street Journal
* text8: Personals Corpus
* text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [None]:
# Your code here

### Normalization and Tokenization

So, in order to to proper analysis we need to remove from the document all the punctuation. However, keeping only alphanumeric characters will break things like `B.Sc.` `N.Y.U.` and so on. The process of properly splitting the document into appropriate basic elements is called `tokenization`.

NLTK gives us a (set of ) function call(s) that can do the tokenization (see also http://www.nltk.org/_modules/nltk/tokenize.html):

#### Sentence splitting

In [None]:
example = '''Good bagels cost $2.88 in N.Y.C. Hey Prof. Ipeirotis, please buy me two of them.
    
    Thanks.
    
    PS: You have a Ph.D. you can handle this, right?'''

print(nltk.sent_tokenize(example))

#### Word splitting

In [None]:
import string 

example = '''Good bagels cost $2.88 in New York.  
    Hey Prof. Ipeirotis, please buy me two of them.
    
    Thanks.
    
    PS: You have a Ph.D., you can handle this, right?'''

for sentence in nltk.sent_tokenize(example):
    print("Sentence:", sentence)
    tokens = nltk.word_tokenize(sentence)
    print("All tokens:", tokens)
    # Keep only words, lowercase them, remove punctuation
    words = [w.lower() for w in tokens if w not in string.punctuation]
    print("Only words:", words)
    print('-------------------')

### Frequency distributions, Zipf's law

### Processing Text: Introduction 

Let's start by fetching a piece of text. We will go to [Project Gutenberg](https://www.gutenberg.org/) and fetch the text for "The origin of species"

In [None]:
f = open('/data/origin-of-species.txt', 'r')
content = f.read()
f.close()

In [None]:
# Approximate bytes of text
print(len(content))

Now, we have our first text ready to be analyzed. Let's first do some analysis of the words that appear in this classic text:

In [None]:
tokens = nltk.word_tokenize(content)

# Frequency analysis for words of interest
fdist = nltk.FreqDist(tokens)

# Number of unique and total words in the text
print(fdist)

Let's take a look at the frequencies of some words in the text:

In [None]:
fdist

In [None]:
print(fdist["species"])
print(fdist["sexual"])
print(fdist["origin"])

Now let's see what are the most frequent tokens of the text:

In [None]:
print(fdist.most_common(50))

Hm, that is not very useful. These are all words that are needed by every single English text. Only the world "species" seems to have some meaning. The rest of the words tell us nothing about the text; they're just English "plumbing."

What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words:


In [None]:
fdist.plot(100, cumulative=True)

These 100 words account for more than half the book! (If you rememeber, we had 176250 tokens in the book.)

Let's take a look at the actual words of the text:

If the frequent words don't help us, how about the words that occur once only, the so-called hapaxes? View them by typing `fdist.hapaxes()`: 

In [None]:
fdist.hapaxes()

In [None]:
print(len(fdist.hapaxes()))

So out of the 7687 unique words, 2666 of them appear only once in the text. But these are only 2666 out of the total of 175682 words in the text. This is ~1.5% of the text.

### Zipf's Law

Zipf's law says that the frequencies of words in text follow a power-law: A few words account for a big fraction of the text (the very frequent ones, usually just the "plumping" of English), and a large fraction of the unique vocabularly (the "hapaxes") appear very infrequently.

In [None]:
fdist.plot(100, cumulative=False)

In [None]:
fdist.plot(100, cumulative=True)

### Stopwords

NLTK contains a corpus of stopwords, that is, high-frequency words like `the`, `to` and `also` that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

In [None]:
from nltk.corpus import stopwords

print(stopwords.words('english'))

Let's define a function to remove the words in a text are in the stopwords list:

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(['one', 'may', 'would', 'many']) # add a few more stopwords

def get_most_frequent_words(text, top):
    content = [
        w.lower() for w in text
        if w.lower() not in stopwords # the word should not be a stopword 
        and w.isalpha() # and should consists of letters (no number or punctuation)
    ] 
    return nltk.FreqDist(content).most_common(top)

# get the top-10 most frequent, non-stopwords in the text
text_nostopwords = get_most_frequent_words(tokens, 10)

print(text_nostopwords)

In [None]:
# Dispersion plot
text = nltk.Text(tokens)
text.dispersion_plot([token for token, frequency in text_nostopwords])

### Summary

* A frequency distribution is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
* Tokenization is the segmentation of a text into basic units — or tokens — such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
* Zipf's law indicates that there are a few words that appear very often but there is also a large number of words that appear infrequently.