# Class 16 - Strings II - Text processing with NLTK

We are going to discuss how to process large text documents using athe [Natural Language Toolkit](http://www.nltk.org) library. 

We first have to download some data corpora and libraries to use NLTK. Running this block of code *should* pop up a new window with four blue tabs: Collections, Corpora, Models, All Packages. Under Collections, Select the entry with "book" in the Identifier column and select download. Once the status "Finished downloading collection 'book'." prints in the grey bar at the bottom, you can close this pop-up.

![](http://www.nltk.org/images/nltk-downloader.png)

## NLTK

In [None]:
import nltk
nltk.download()

You should only need to do the download step once. In the future, you can start from the cell below.

In [None]:
from collections import Counter
import re
import string
import nltk
import requests
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from nltk.book import text1, text2, text3, text4, text5, text6, text7, text8, text9
text_list = [text1, text2, text3, text4, text5, text6, text7, text8, text9]

We can also download text from the web and load it into a NLTK Text object. Let's get something from [Project Gutenberg's Top 100 list](https://www.gutenberg.org/browse/scores/top), like Charles Dickens's "A Tale of Two Cities."

In [None]:
#from recitation
re.search(r"(I. The Period.*)End of the Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens.*$",
                     atotc_raw, re.S) #it matches all, it'll make the .*  match any character/ will match anything
#until it sees the specific lines of code

In [None]:
#from recitation
re.search(r"(I. The Period.*)End of the Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens.*$",
                     atotc_raw, re.S).groups() #groups if it matched a bunch of things, it would match any of them...? 
#a strip() we can pull that out as well 

In [None]:
# Get the text file from the web
atotc_raw = requests.get('https://www.gutenberg.org/files/98/98-0.txt').text

# Write a regular expression to get everything between these two lines of text
atotc_core = re.search(r"(I. The Period.*)End of the Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens.*$",
                     atotc_raw, re.S).groups()[0].strip()

# Tokenize the raw text
atotc_tokenized = nltk.word_tokenize(atotc_core)

# Convert to a nltk Text object
text10 = nltk.Text(atotc_tokenized)

# Give the text a formal name, like the other text objects
text10.name = 'A Tale of Two Cities by Charles Dickens 1859'

In [None]:
# Add this text object to the text_list
text_list.append(text10)

How long is each body of text? We can use `len` on a `Text` object.

In [None]:
for t in text_list:
    print("{0} has {1:,} words\n".format(t.name,len(t)))

How many unique words in each text document? We can call `set` on a Text object.

In [None]:
for t in text_list:
    print("{0} has {1:,} unique words\n".format(t.name,len(set(t))))

We can define a function that measures the [lexical diversity](https://en.wikipedia.org/wiki/Lexical_diversity) of the text by computing the number of unique words as a percentage of the total number of words. If each word was used only once, then the richness would be 100% but if the same word was repeated the entire length of the document, then the richness would be 0%.

In [None]:
def lexical_diversity(text):
    return len(set(text))/len(text)

for t in text_list:
    print("{0} has a lexical diversity of {1:.2}\n".format(t.name,lexical_diversity(t)))
    #can change the {1:.1%}\n to change the decimal 

What are the longest words in the text?

In [None]:
def find_longest_word(text):
    longest = ''
    for word in set(text):
        if len(word) > len(longest):
            longest = word
    return longest

for t in text_list:
    print("The longest word in {0} is: {1}\n".format(t.name,find_longest_word(t)))

We can also measure the frequency distribution of how often a word is used in a corpus.

In [None]:
_t = text1 #moby dick 
fdist_text1 = nltk.FreqDist(_t)
fdist_text1.most_common(25)

We can also plot the distribution of how often words occur in the corpus. We get an interesting pattern called the [Zipf distribution](https://en.wikipedia.org/wiki/Zipf%27s_law). There are many words that occur only once (upper left) and single words that occur thousands of times (lower right) but the pattern follows a consistent log-linear pattern.

In [None]:
counter_text1 = Counter(fdist_text1.values())

f,ax = plt.subplots(1,1)
ax.scatter(list(counter_text1.keys()),list(counter_text1.values()))
ax.set_ylim((1e-1,1e5))
ax.set_xscale('log') #puts the scales on powers of 10- without these, it's pretty boring looking 
ax.set_yscale('log')
ax.set_title(_t.name)
ax.set_xlabel('Number of occurrences')
ax.set_ylabel('Number of words')

An important part of processing natural language data is normalizing this data by removing variations in the text that the computer naively thinks are different entities but humans recognize as being the same. There are several steps to this including case adjustment and stemming/lemmatization.

In the case of case adjustment, it turns out several of the different "words" in the corpus are actually the same, but because they have different capitalizations, they're counted as different unique words. Explore how many five-letter words are the same, just with different capitalizations.

![](http://www.nltk.org/images/pipeline1.png)

## Sentence segmenting

A novel can be represented as a single large string, but this huge string isn't very helpful for analyzing features of the text until the string is segmented into sentences or "tokens", which include words but also hyphenated phrases or contractions ("aren't", "doesn't", *etc*.)

There are a variety of different segmentation/tokenization strategies (with different tradeoffs) and corresponding methods implemented in NLTK.

If we wanted to get all the sentences in a string, we could naively split the string on a period and whitespace using regular expressions.

In [None]:
atotc_core.split('. ')[2:5]
#mrs. Southcott messes it up so this isn't the best way to do it

This splitting method only uses space characters, but not newline characters `\r\n` to split, so it misses several sentences. We could use a regular expression to split on periods and white spaces too.

In [None]:
re.split(r'\.\s+',atotc_core)[2:10] #can change to [3:8] 
#split on a period and a white space 

Notice this sentance tokenizing fails for a phrase like "Mrs. Southcott had recently attained her five and twentieth blessed birthday..." into two sentences, when it should be one.

NLTK has more specialized sentence tokenizers that deal with these kinds of cases. You should probably use these instead of trying to make your own.

In [None]:
nltk.sent_tokenize(atotc_core)[2:10]

## Word tokenizing
We may care less about sentences and more about individual words. Again, we could employ a naive approach of splitting on spaces.

In [None]:
space_tokens = atotc_core.split(' ')
space_tokens[0:50]

Again, this misses newline separators, so we might think we could use regular expressions.

In [None]:
re_tokens = re.split(r'\s+',atotc_core)
re_tokens[0:50]
#times has a comma on the inside, so Python will take it differently than 'times' without a comma 

It's clear we want to separate words based on other punctuation as well so that "Darkness," and "Darkness" aren't treated like separate words. Again, NLTK has a variety of methods for doing word tokenization more intelligently.

`word_tokenize` is probably the easiest-to-recommend

In [None]:
wt_tokens = nltk.word_tokenize(atotc_core)
wt_tokens[0:50]

But there are others like `wordpunct_tokenize` tha makes different assumptions about the language.

In [None]:
wpt_tokens = nltk.wordpunct_tokenize(atotc_core)
wpt_tokens[0:50]

Or `Toktok` is still another word tokenizer.

In [None]:
toktok = nltk.ToktokTokenizer()
ttt_tokens = toktok.tokenize(atotc_core)
ttt_tokens[0:50]

Each of these different methods returns a different word count based on their different assumptions about word boundaries, etc.

In [None]:
for name,tokenlist in zip(['space_split','re_tokenizer','word_tokenizer','wordpunct_tokenizer','toktok_tokenizer'],[space_tokens,re_tokens,wt_tokens,wpt_tokens,ttt_tokens]):
    print("{0:>20}: {1:,} words".format(name,len(tokenlist)))

### Mixed cases

Remember that strings of different cases (capitalizations) are treated as different words: "young" and "Young" are not the same. An important part of text processing is to remove un-needed variation, and mixed cases are variation we generally don't care about.

In [None]:
five_letter_words = [word for word in set(text1) if len(word) == 5]

print("There are {0:,} five-letter words in the corpus.".format(len(five_letter_words)))

In [None]:
mixed_case_tokens = []

for word1 in five_letter_words:
    for word2 in five_letter_words:
        if word1.lower() == word2.lower() and word1 != word2:
            mixed_case_tokens.append((word1,word2))

print("There are {0:,} five-letter words in the corpus that are the same but have different cases.".format(len(mixed_case_tokens)))
mixed_case_tokens[:10]

How does the number of words in the document change after applying `.lower()` to everything?

In [None]:
text1_lowered = [i.lower() for i in text1.tokens]
print("There are {0:,} unique words in text1 before lowering and {1:,} after lowering".format(len(set(text1)),len(set(text1_lowered))))

### Removing stopwords

English, like many languages, repeats many words in typical language that don't always convey a lot of information by themselves. When we do text processing, we should make sure to remove these "stop words".

In [None]:
fdist_text10 = nltk.FreqDist(text10)
fdist_text10.most_common(25)

NLTK helpfully has a list of stopwords in different languages.

In [None]:
english_stopwords = nltk.corpus.stopwords.words('english')
english_stopwords[:10]

In [None]:
english_stopwords = nltk.corpus.stopwords.words('english')
len(english_stopwords)

We can also use `string` module's "punctuation" attribute as well.

In [None]:
list(string.punctuation)[:10]

Let's combine them so get a list of `all_stopwords` that we can ignore.

In [None]:
all_stopwords = english_stopwords + list(string.punctuation)

We can use a list comprehension to exclude the words in this stopword list from analysis while also gives each word similar cases. This is not perfect, but an improvement over what we had before.

In [None]:
text10_no_stopwords = [word.lower() for word in text10 if word.lower() not in all_stopwords]
fdist_text10_no_stopwords = nltk.FreqDist(text10_no_stopwords)
fdist_text10_no_stopwords.most_common(25)

### Stemming and lemmatization
Another problem with natural language text is plural (dogs vs. dog) and possessive (dog's vs. dog) forms, verb conjugations (walk, walks, walked, walking), and contractions (they're) are also counted as unique words even if the underlying concepts are similar. Extracting [word stems](https://en.wikipedia.org/wiki/Word_stem) means removing prefixes and affixes that result in a new token, but not a significantly new meaning.

We can use a variety of [stemming](https://en.wikipedia.org/wiki/Stemming) and [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) tools in NLTK to try to recover unique words stripped of any prefixes or suffixes.

In [None]:
[t.lower() for t in text1.tokens[10:50] if len(t) > 2]

In [None]:
porter = nltk.PorterStemmer()

[porter.stem(t.lower()) for t in text1.tokens[10:50] if len(t) > 2]
#lets us clean up the data so we can remove some of the variation 
#stemming is fast, but is weird because it just takes off the prefix or suffix of words sometimes

After stemming the words in `text1`, how many unique words remain? 

Nearly half of the words in *Moby Dick* that were initially counted as unique were actually duplicates of other words!

In [None]:
text1_lowered_stemmed = set()

for t in set(text1):
    t_lower = t.lower()
    t_stemmed = porter.stem(t_lower)
    text1_lowered_stemmed.add(t_stemmed)
    
print("There are {0:,} unique words in text1 before and {1:,} after lowering and stemming".format(len(set(text1)),len(set(text1_lowered_stemmed))))

Lemmatization is a bit smarter about removing letters: it checks if the word is a plural, conjugation, etc. of another word and them "stems" it down to the root word only if in the dictionary. These lookups are expensive in comparision to basically slicing characters off a list like stemming, but results in better quality — but far from perfect — results. For example, "supplied" should have been reduced to "supply" and "dusting" and "dust".

In [None]:
wnl = nltk.WordNetLemmatizer()

[wnl.lemmatize(t.lower()) for t in text1.tokens[10:50] if len(t) > 2]

After lemmatizing the words in `text1`, how many unique words remain? Lemmatizing isn't as aggressive as stemming, but there's still a 25% reduction in the total number of unique words!

In [None]:
text1_lowered_lemmatized = set()

for t in set(text1):
    t_lower = t.lower()
    t_lemmatized = wnl.lemmatize(t_lower)
    text1_lowered_lemmatized.add(t_lemmatized)
    
print("There are {0:,} unique words in text1 before and {1:,} after lowering and stemming".format(len(set(text1)),len(set(text1_lowered_lemmatized))))

## Exercise

Download the raw text for "The Importance of Being Earnest" from [Project Gutenberg](http://www.gutenberg.org/cache/epub/844/pg844.txt) and save it as `tiobe_raw`.

In [None]:
tiobe_raw = requests.get('http://www.gutenberg.org/cache/epub/844/pg844.txt').text 

Adapt the regular expression used previously to get all the text between "FIRST ACT" and "END OF THE PROJECT GUTENBERG EBOOK THE IMPORTANCE OF BEING EARNEST" and save it as `tiobe_core`.

In [None]:
tiobe_core = re.search(r"(FIRST ACT.*)END OF THE PROJECT GUTENBERG EBOOK THE IMPORTANCE OF BEING EARNEST,BY OSCAR WILDE.*$",
                    tiobe_raw, re.S).groups()[0].strip()
tiobe_core[-100:]

Combine all the text processing steps into one function that accepts a string like `tiobe_core` and returns a list of lower-cased, stopword-removed, and lemmatized tokens.

In [None]:
all_stopwords = nltk.corpus.stopwords.words('english') + list(string.punctuation) + ['“','”','’','--']

def text_cleaner(s):
    # Lower-case everything in the string
    lower_s = s.lower() 
    
    # Tokenize the string
    tokenized_s = nltk.word_tokenize(lower_s)
    
    # Remove the stopwords from the list of tokens
    no_stopwords = [token for token in tokenized_s if token not in all_stopwords]
    #we're going token by token
    
    # lemmatize each token
    cleaned_tokens = [wnl.lemmatize(token) for token in no_stopwords] 
    
    # Return the cleaned_tokens
    return cleaned_tokens

Run the function on `tiobe_core`.

In [None]:
tiobe_tokens = text_cleaner(tibobe_core)

In [None]:
#"There are {0:,} toekns and {1:,} unique tokens in the corpus".format(len)(tiobe_tokens),len(set)

Use the `Counter` function on `cleaned_tokens` to count how often each cleaned token occurs.

In [None]:
tiobe_token_frequencies = Counter(tiobe_tokens)
most_frequent_tokens = sorted(tiobe_frequencies.items(),key=lambda x:x[1],reverse=True)
most_frequent_tokens[:25]

Plot the word frequency distribution for the 50-most frequent words.

In [None]:
x = [word for (word,count) in most_frequent_tokens[:50]]
y = [count for (word,count) in most_frequent_tokens[:50]]

f,ax = plt.subplots(1,1,figsize=(12,6))
ax.plot(y,lw=3)
ax.set_xticks(range(len(x)))
ax.set_xticklabels(x,rotation=90)
ax.set_xlabel('Tokens')
ax.set_ylabel('Count')

Plot a distribution of the frequencies themselves. This is a classic distribution found throughout information science.

In [None]:
frequency_frequencies = Counter(tiobe_token_frequencies.values())

f,ax = plt.subplots(1,1,figsize=(12,6))
ax.scatter(frequency_frequencies.keys(),frequency_frequencies.values())
#ax.scatter(frequency_frequencies.values(),frequency_frequencies.keys())
ax.set_xlabel('Number of occurrences')
ax.set_ylabel('Number of words')
ax.set_xscale('symlog')
ax.set_yscale('symlog')
ax.grid(True)
ax.set_xlim((0,1e3))
ax.set_ylim((0,1e4))