# Tokenization

In this notebook, you'll see how to use the NLTK library to tokenize and normalize text data.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter

from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords

from nltk.stem import PorterStemmer, WordNetLemmatizer

In this notebook, we'll be working with the full text of Moby Dick.

In [None]:
with open('../data/moby_dick.txt', encoding = 'utf-8') as fi:
    moby = fi.read()

In [None]:
moby[:1000]

First, let's split into sentences. For this, we can use the `sent_tokenize` function.

In [None]:
sentences = sent_tokenize(moby)

In [None]:
print(sentences[1000])

If we want to split into tokens, we can utilize one of nltk's word tokenizers.

In [None]:
i = 1000
print(sentences[i])
word_tokenize(sentences[i])

Notice how `word_tokenize` counts punctuation marks as tokens. If we want to be more specific in what we count as a token, we can make use of the `regexp_tokenize` function which allows us to specify a regular expression pattern to define a token.

For example, we can look for word characters using `\w`. This will match one or more of any letter or digit.

In [None]:
regexp_tokenize(sentences[1000], r'\w+')

Let's add a couple of other types of characters to catch and then count up the most frequent tokens using the `Counter` class. This will create a dictionary whose keys are the tokens and values are the frequency counts.

In [None]:
moby_counter = Counter(regexp_tokenize(moby, r'[-\'\w]+'))

In [None]:
moby_counter['whale']

In [None]:
moby_counter['Ishmael']

Let's see how large a vocabulary we have.

In [None]:
len(moby_counter)

Or how many total words.

In [None]:
sum(moby_counter.values())

We can also see the most common words.

In [None]:
moby_counter.most_common()

You'll notice that the most common words include a large number of words like "the" and "of". These can be considered "stop words", and in certain applications are less interesting and can be removed.

NLTK includes lists of stop words.

In [None]:
stop_words = set(stopwords.words('english'))

stop_words

Let's make two modifications to our counter above. First, we'll covert all text to lowercase and the we'll remove stop words.

In [None]:
moby_counter = Counter([x.lower() for x in regexp_tokenize(moby, r'[-\'\w]+') if x.lower() not in stop_words])

In [None]:
len(moby_counter)

In [None]:
moby_counter.most_common()

We might also want to do further preprocessing of our text. Let's try stemming and lemmatization. First, let's look at stemming. We'll try out the PorterStemmer from NLTK.

In [None]:
porter = PorterStemmer()

To use it, you just need to call the `.stem` method and pass in the token to be stemmed.

In [None]:
porter.stem('whales')

In [None]:
porter.stem('numerical')

In [None]:
moby_counter = Counter([porter.stem(x.lower()) for x in regexp_tokenize(moby, r'[-\'\w]+') if x.lower() not in stop_words])

In [None]:
len(moby_counter)

In [None]:
moby_counter.most_common()

One disadvanatage of stemming is that you can end up with non-words.

In [None]:
porter.stem('remove')

We might instead try a lemmatizer, like the WordNetLemmatizer from NLTK.

In [None]:
wnl = WordNetLemmatizer()

In [None]:
wnl.lemmatize('whales')

In [None]:
moby_counter = Counter([wnl.lemmatize(x.lower()) for x in regexp_tokenize(moby, r'[-\'\w]+') if x.lower() not in stop_words])

In [None]:
len(moby_counter)

In [None]:
moby_counter.most_common()

Finally, let's look at the distribution of words.

In [None]:
plt.figure(figsize = (10,6))
plt.hist(moby_counter.values(), bins = 50, edgecolor = 'black')
plt.xlabel('Number of Appearances')
plt.ylabel('Number of Words');

In [None]:
pd.Series(moby_counter.values()).describe()

If we want a fun way to visualize the most frequent words in the text, we can use a word cloud.

In [None]:
# %conda install wordcloud

In [None]:
from wordcloud import WordCloud

In [None]:
wc = WordCloud()

wc.generate_from_frequencies(moby_counter)
plt.figure(figsize = (10,6))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off");