# Counting Words

Let's do some standard imports.

In [2]:
%matplotlib inline
import matplotlib
import nltk

## Unique words in a corpus

Again, let's use nltk to get the words in genesis as a list.

In [3]:
mygenesis = nltk.corpus.PlaintextCorpusReader("corpora", 'genesis.txt')
genesis_words = mygenesis.words()

Here's how long it is, in words.

In [None]:
len(genesis_words)

**sets**

We've learned about `strings` and `lists` as types of objects in python.

Another type is a `set`. Sets are like lists, except that each item can only appear once.

You can make a set by converting a list:

In [None]:
my_list = ['this', 'that', 'bruce', 'this']
my_set = set(my_list)
my_set

So, we can use `set()` to figure out how many unique words there are in genesis.

In [None]:
genesis_set = set(genesis_words)
len(genesis_set)

Let's invent something called *lexical diversity* and define a function to compute it. (This is just for some practice with defining a function.)

In [None]:
def lexical_diversity(text):
    set_text = set(text)
    return len(set_text) / len(text)

In [None]:
lexical_diversity(genesis_words)

## Frequency distributions

A *frequency distribution* is a tabulation of how many times each unique word appears in a corpus. 

Let's compute the frequency distribution for genesis. To do that we will use nltk's `FreqDist`, which is designed for this purpose. These two lines do all of the work for us.

In [4]:
from nltk import FreqDist
fdist = FreqDist(genesis_words)

We can ask a `FreqDist` object to list the most common words for us, along with how many times each one appears.

In [None]:
fdist.most_common(25)

In [5]:
fdist["Adam"]

18

A `FreqDist` object can also draw a plot for us of the distribution

In [None]:
fdist.plot(25)

This plot shows the commulative counts for the 25 top words

In [None]:
fdist.plot(25, cumulative=True)