# Workbook 2: Counting Words

Determing the frequency of words is about as fundamental as it gets. And it really is useful. When I teach "The Most Dangerous Game" to undergraduates, I like to throw a spreadsheet up which just lists all the words of the story, from most frequent to least, paired with their number of occurrences. I then ask students to go through the list and draw a line where the words become significant. A lot of the most common words, as we will see in a moment, do not carry much semantic weight -- they function, as linguists observe, as syntactic glue. The same can be said for words that occur infrequently, and I ask students to draw another line. Scanning through a frequency distribution, as such a list is called, is one place to start asking questions that you can then operationalize in your code.

In the first workbook, we learned how to load a single file and to tokenize its contents such that we could work with it. In this workbook we are going to learn how to count words in a text. Like last time, we will see how to do this using basic Python tools and then explore how to do this with the NLTK. We will also explore some of the graphing possibilities available in Python. 

## Table of Contents <a name="toc"></a>

2.1. [The Built-In Way](#builtin)  
2.2. [The PANDAS Way](#pandas)   
2.3. [The NLTK Way](#nltkt)  

Like last time, we start with loading our text. Here we use the one-liner we discussed, and then, because our first two methods of creating a term frequency distribution, another way of describing a list of words and their counts, we are going to go ahead and turn our long string of text into a list of words, which is where almost all our work will begin so get used to doing it. Please note that, like last time, our list of words only includes words and apostrophes and all the words in lower case.

Even though we aren't going to use the imported regular expression module until after we load the text, it is customary in Python, as I believe it is in other languages, to list your imports at the beginning of a block of code.

In [None]:
import re

# First we load our file into a string
mdg = open('texts/mdg.txt', 'r').read()

# Then we turn that string into a list of words
mdg_words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

## The Built-In Way <a name="builtin"></a>

> Wanna get Capone? Here's how you get him. He pulls a knife, you pull a gun, he sends one of yours to the hospital, you send one of his to the morgue. That's the Chicago way, and that's how you get Capone. -- _The Untouchables_

The first thing to know about Python, like a lot of languages (both natural and programming), is that there is usually more than one day to do something. While it's often the case that some of the powerful libraries, like PANDAS and NLTK, can do a lot for you without you having to write a lot of code, they are themselves built, in many instances, built using either basic Python, or some of the foundational libraries that are commonly installed. 

Our first way to compile a list of words and their frequencies uses one of the standard Python data types, a dictionary. A dictionary is always in the form of a **key** paired with a **value**, and you can recognize a dictionary by the use or curly braces, {}, whereas lists use square brackets, []. (There is yet another data type, a tuple that uses parentheses. More on that in a moment.)

Let's load the code below, and then we can describe what it does:

In [None]:
mdg_dict = {}
for word in mdg_words:
    try:
        mdg_dict[word] += 1
    except: 
        mdg_dict[word] = 1

Here's a breakdown:

* **`freq_dict = {}`** creates an empty dictionary into which we are going to place our key-value pairs, which will be the words along with the number of times they occur.
* in the **`for`** loop that follows, we essentially either give a value of **`1`** if this the first time we are encountering the word or we add **`1`** to its count if we've seen it before. 

Having created a dictionary of word-count pairs, we can query it on various words to see how many times they occur. In the code below, I have "hunter" but you could also try hunted, night, Zaroff, Rainsford, and jungle. (Try it, but remember to lowercase Zaroff and Rainsford!)

In [None]:
mdg_dict["hunter"]

There's more work we can do here, especially if we wanted to sort out dictionary, from most frequent to least frequent, but, for the time being, let's move onto another way to count words.

## The PANDAS Way

The **`pandas`** library is useful when you find yourself wanting to work with data that looks a lot like a spreadsheet: it allows you to create rows and columns and then navigate them as part of its dataframe functionality. Less-mentioned than the dataframe is the pandas series, which still packs a lot of punch. 

In the code below, we take our list of words and create a pandas series with it -- imagine it like a single-columned spreadsheet. We can then work through the spreadsheet, compiling data as we go.

In [None]:
import pandas as pd

mdg_series = pd.Series(mdg_words)

print(mdg_series[0:5])

In [None]:
mdg_counts = mdg_series.value_counts()
print(mdg_counts[0:5])

In [None]:
%pylab inline
figsize(12, 6)

Below we graph the 50 most frequent words, but feel free to put in your own values and then press the RUN button above or simply CTRL-RETURN. (One word of warning: 50 words is about as many will fit horizontally and still be readable. 

In [None]:
mdg_counts.iloc[0:49].plot(kind='bar')

`pandas` makes it easy to save our results to a CSV file:

In [None]:
mdg_counts.to_csv('mdg_word_freq.csv')

## The NLTK Way

In [None]:
fdist = nltk.FreqDist()
for sentence in nltk.tokenize.sent_tokenize(mdg):
    for word in nltk.tokenize.word_tokenize(sentence):
        fdist[word] += 1

In [None]:
print(fdist.most_common(10))

In [None]:
sentences = nltk.tokenize.sent_tokenize(mdg)
# print(sentence for sentence in sentences[0:10])

In [None]:
tokens = nltk.tokenize.word_tokenize(sentences)

## Counting Words

From our experiments above we learn that approximately two thousand words are spread out over 8000 places. If averaged over the entire text, each word appears 4 times, but looking over our sorted `mdg_words` above, we can see that the word **and** appears 162 times alone. And it's not even the top 5 of most used words! 

In order they are:

    the, 512
    a, 258
    he, 248
    i, 177
    of, 172
    and, 164

The list above is the start of a word frequency list. There are a number of ways to do this, and I will include those in separate files for your reference, but since we have started with the NLTK, I thought we would stay with it. We have a choice to make, however: we can either continue to import one tool at a time from the NLTK library, or we can just say to ourselves that we're going to be playing with a lot of the tools, so why not just bring them all into our workspace?

Please note that once I've imported all of the `nltk` library, I need to tell Python that a particular tool, or function, comes from that library. Sometimes functions from two different, and large, libraries can have the same name, prepending the library name is one way that Python has of avoiding what are called "namespace conflicts," which is a fancy way of saying you can't call two things by the same name. You've seen both ways of doing things now.

In [None]:
%pylab inline
figsize(12, 6)

In [None]:
import nltk

In [None]:
freq_dist = nltk.FreqDist(mdg_words)

In [None]:
type(freq_dist)

In [None]:
for word, frequency in freq_dist.most_common(10):
    print('{}:  {}'.format(word, frequency))

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Let's graph the 50 most frequent words:
# =-=-=-=-=-=-=-=-=-=-= 

# This shows all the words: still working on slices
freq_dist.plot()

### ngrams

In [None]:
all_counts = dict()
for size in 2, 3, 4, 5:
    all_counts[size] = nltk.FreqDist(nltk.ngrams(mdg_words, size))


In [None]:
all_counts[5].most_common(5)

On average, a word occurs four times in "The Most Dangerous Game."

Out of curiosity, how many words occur four times?

In [None]:
wordfrequency = nltk.FreqDist(mdg_tokens)
four_times = [word for word in wordfrequency.keys() if wordfrequency[word] == 4]
print(four_times)

In [None]:
mdg_text.count("dangerous")

### Lexical Diversity

In [None]:
def lexical_diversity(text):
    return len(text) / len(set(text))