# 2. Doing Things with Words

## Table of Contents <a name="toc"></a>

**[Loading a File & Understanding What It Is](#file)**  
**[Tokenizing](#tokenizing)**  
[A Quick Note about Normalization](#norm)  
[Using regex to Tokenize](#REtoke)   
[Using NLTK Tokenizers](#nltktoken)  

The first line of code run _here_ is something internal to Jupyter Notebooks that allows us to place any graphical output into the page itself and not in a separate window or file. (We can still save output to a file, if we want.)

In [None]:
%pylab inline
figsize(12, 6)

## Loading a File & Understanding What It Is <a name="file"></a>

In [4]:
# The work of the two lines above can also be achieved in one line.
# See if your growing Pythonista abilities can't tell how this is done!
mdg = open('texts/mdg.txt', 'r').read()

In [None]:
# We need the regular expression library
import re

mdg_words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

In [5]:
# In this instance, we are going to tell Python that we only want 
# one particular tool from the larger toolkit:

from nltk.tokenize import WhitespaceTokenizer

In [6]:
mdg_tokens = WhitespaceTokenizer().tokenize(mdg.lower())

What would happen if we were to read the text differently, if we were to read all the words, but this time with the words in alphabetical order?

In [None]:
print(sorted(mdg_tokens))

What's this? That's a lot of *a*s and *about*s. What happens if we look at just the words without repetition?

In [None]:
print(sorted(set(mdg_tokens)))

That looks like a lot of words. If we ask how many by counting how long the set of words is, we get: 

In [None]:
# Notice how I enclose the text I want to print in single quotes
# so that I can use double quotes in the text itself:
print('There are {} words in "The Most Dangerous Game."'.format(len(set(mdg_tokens))))

### Stacked Tokenizers


In [None]:
fdist = nltk.FreqDist()
for sentence in nltk.tokenize.sent_tokenize(mdg):
    for word in nltk.tokenize.word_tokenize(sentence):
        fdist[word] += 1

In [None]:
print(fdist.most_common(10))

In [None]:
sentences = nltk.tokenize.sent_tokenize(mdg)
# print(sentence for sentence in sentences[0:10])

In [None]:
tokens = nltk.tokenize.word_tokenize(sentences)

## Counting Words

From our experiments above we learn that approximately two thousand words are spread out over 8000 places. If averaged over the entire text, each word appears 4 times, but looking over our sorted `mdg_words` above, we can see that the word **and** appears 162 times alone. And it's not even the top 5 of most used words! 

In order they are:

    the, 512
    a, 258
    he, 248
    i, 177
    of, 172
    and, 164

The list above is the start of a word frequency list. There are a number of ways to do this, and I will include those in separate files for your reference, but since we have started with the NLTK, I thought we would stay with it. We have a choice to make, however: we can either continue to import one tool at a time from the NLTK library, or we can just say to ourselves that we're going to be playing with a lot of the tools, so why not just bring them all into our workspace?

Please note that once I've imported all of the `nltk` library, I need to tell Python that a particular tool, or function, comes from that library. Sometimes functions from two different, and large, libraries can have the same name, prepending the library name is one way that Python has of avoiding what are called "namespace conflicts," which is a fancy way of saying you can't call two things by the same name. You've seen both ways of doing things now.

In [None]:
import nltk

In [None]:
freq_dist = nltk.FreqDist(mdg_words)

In [None]:
type(freq_dist)

In [None]:
for word, frequency in freq_dist.most_common(10):
    print('{}:  {}'.format(word, frequency))

In [None]:
for word, frequency in freq_dist:
    print(word, frequency)

In [None]:
mdg_dict = {key:value for word, frequency in freq_dist.items()}

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Let's graph the 50 most frequent words:
# =-=-=-=-=-=-=-=-=-=-= 

# This shows all the words: still working on slices
freq_dist.plot()

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Save these results to a CSV file (makes it easier for the Excel-impaired)
# =-=-=-=-=-=-=-=-=-=-= 

mdg_counts.to_csv('../data/mdg_word_freq.csv')

### ngrams

In [None]:
all_counts = dict()
for size in 2, 3, 4, 5:
    all_counts[size] = nltk.FreqDist(nltk.ngrams(mdg_words, size))


In [None]:
all_counts[5].most_common(5)

In [None]:
mpl.style.use('ggplot')
ax = df[['Word','Frequency']].plot(kind='bar', 
                                   title ="Frequency of Words in MDG",
                                   figsize=(20,10),
                                   legend=True)
ax.set_xlabel("Word")
ax.set_ylabel("Occurrences")
ax.set_xticklabels(list(df['Word'])) 
mpl.pyplot.show()

In [None]:
myword = mdg_words.concordance("dangerous")
print(myword)

In [None]:
text.similar("love")
text.common_contexts(["husband", "wife"])
text.collocations()

In [None]:
# Lexical Diversity of MDG:
len(mdg2_word_list) / len(set(mdg2_word_list))

In [None]:
len(mdg_tokens) / len(set(mdg_tokens))

On average, a word occurs four times in "The Most Dangerous Game."

Out of curiosity, how many words occur four times?

In [None]:
wordfrequency = nltk.FreqDist(mdg_tokens)
four_times = [word for word in wordfrequency.keys() if wordfrequency[word] == 4]
print(four_times)

In [None]:
mdg_text.count("dangerous")

In [None]:
mdg_text.concordance("dangerous")

Where does "dangerous" occur within the larger text?

In [None]:
mdg_text.dispersion_plot(["dangerous", "danger", "game", "fear"])

In [None]:
wordfrequency.plot()