## Explore word frequencies for a curated dataset

This notebook shows how to explore the word frequencies in your dataset. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Getting raw frequency counts of the document using `Counter` from `Collections`
* Cleaning up the corpus using `Stopwords` from `nltk.corpus.corpus`, a part of the Natural Language Toolkit
* Creating a new list of the most common words by frequency

A familiarity with the `Counter` datatype is helpful for understanding how this notebook sums word frequencies.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). 

In [1]:
from tdm_client import Dataset

Lastly, we import `Counter` from `collections` library and `stopwords` from the `nltk` library.

In [2]:
from collections import Counter
from nltk.corpus import stopwords

To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** featuring Shakespeare Quarterly (1950-2014) is provided here ('59c090b6-3851-3c65-e016-9181833b4a2c'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [3]:
dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

Find total number of documents in the dataset using the `len()` function. 

In [4]:
len(dset)

6687

Let's select a single volume from the dataset and examine its word frequencies. We create a new variable `my_doc` and initialize it to an item in our dataset. We have randomly chosen 2278 here, but any item from 0-6686 would be suitable for this analysis.
Here, we also return `my_doc` to view a stable JSTOR link that describes the item.

In [5]:
my_doc = dset.items[2278]
my_doc

'http://www.jstor.org/stable/2867774'

If we copy this URL into a search bar, we can see the article is "Shakespeare and the Middling Sort" by Theodore B. Leinwand. 

___
Now, let's inspect the individual words from the article. First, we create a new variable `article_features` that will contain the extracted features from the dataset object. We can accomplish this using the `get_feature` method. This will copy the dictionary of terms and their frequencies for the article.

In [28]:
article_features = dset.get_feature(my_doc)

Next, we will use the `Counter` function from `collections`. This turns our dictionary into a Counter datatype that makes it easier to sum. 

In [34]:
word_freq = Counter(article_features)

Next, we loop through all of the tokens in our new counter object `word_freq` to discover the most common words. The `most_common` method returns a list of the n most common elements and their counts. Here n=25 but you could set this to any value you like. If left blank, the `most_common` method will return all of the tokens from most to least common.

We print the tokens out using the ljust() method that left justifies the output so we get a nice neat columns for the tokens and the counts. 

In [41]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

the                            208
of                             122
to                             85
in                             65
and                            62
is                             39
that                           38
The                            32
be                             28
for                            28
a                              27
as                             25
by                             24
this                           24
his                            20
with                           19
he                             18
Othello                        17
reading                        17
it                             16
have                           15
which                          15
has                            13
on                             13
we                             13


There are a couple immediate problems with these preliminary results:
1. There are many [function words](./key-terms.ipynb#function-words) (the, in, of) that we may not be interested in.
2. The words represented here are actually case-sensitive [strings](./key-terms.ipynb#string)

The most popular word is "the" which occurs 208 times. We can also see "The", with a capital "T", is listed as occurring 32 times. We need to find a way remove to remove common [function words](./key-terms.ipynb#function-words) and combine [strings](./key-terms.ipynb#string) that may have capital letters in them. 

We can solve these issues by:

1. Using a [stopwords](./key-terms.ipynb#stop-words) list to remove common [function words](./key-terms.ipynb#function-words)
2. Lowercasing all the characters in the text and combining the counts

We will use NLTK's [stopwords](./key-terms.ipynb#stop-words) to get started. We create a new list variable `stop_words` and initialize it with the common English [stopwords](./key-terms.ipynb#stop-words) from the [Natural Language Toolkit](./key-terms.ipynb#nltk) library. We'll print a slice of the first ten words in `stop_words` to get a preview.

In [45]:
stop_words = stopwords.words('english')
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

It may be that we want to add additional words to our stoplist. For example, we may want to remove character names. We can add items to the list by using the append method.

In [70]:
stop_words.append("hamlet")
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

["shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'hamlet']

We can also add multiple words to our stoplist by using the extend() method. Notice that this method requires using a set of brackets `[]` to clarify that we are adding "gertrude" and "horatio" as list items.

In [71]:
stop_words.extend(["gertrude", "horatio"])
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

["wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'hamlet',
 'gertrude',
 'horatio']

We can also remove words from our list with the remove() method.

In [72]:
stop_words.remove("hamlet")
stop_words.remove("gertrude")
stop_words.remove("horatio")
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

Now that we have a good stopwords list, let's use it to standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* discard [tokens](./key-terms.ipynb#token) less than 4 characters in length
* discard [tokens](./key-terms.ipynb#token) with non-alphabetical characters
* lowercase all characters in each [token](./key-terms.ipynb#token)
* remove [stopwords](./key-terms.ipynb#stop-words) based on the list we created in `stop_words`

In [73]:
clean_word_freq = Counter() # define a new variable `clean_word_freq` that t
for token, count in word_freq.items():
    # require tokens to be 4+ characters
    if len(token) < 4:
        continue
    # require tokens to be alphabetical
    if not token.isalpha():
        continue
    # lower case
    t = token.lower()
    # don't include stopwords
    if t in stop_words:
        continue
    clean_word_freq[t] += count


In [20]:
def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs toke through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

In [21]:
for token, count in clean_word_freq.most_common(25):
    print(token.ljust(30), count)

othello                        19
reading                        17
shakespeare                    12
folio                          11
first                          8
iago                           8
also                           8
dramatic                       8
quarto                         7
text                           7
sword                          6
even                           5
context                        5
word                           5
desdemona                      4
john                           4
readings                       4
spirit                         4
upon                           4
walker                         4
bloody                         4
considered                     4
editors                        4
expression                     4
hero                           4
