## Explore word frequencies for a curated dataset

This notebook shows how to explore the word frequencies in your dataset. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Turning your dataset into a pandas dataframe
* Visualizing the contents of your dataset as a graph with pandas

A familiarity with pandas is helpful but not required.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

In [17]:
from tdm_client import Dataset

We import the [pandas](./key-terms.ipynb#pandas) module to help visualize and manipulate our data. Importing `as pd` allows us to call pandas' functions using the short phrase `pd` instead of typing out `pandas` each time. 

In [18]:
import pandas as pd

Lastly, we import `Counter` from `collections` library and `stopwords` from the `nltk` library.

In [19]:
from collections import Counter
from nltk.corpus import stopwords

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** featuring Shakespeare Quarterly (1950-2014) is provided here ('59c090b6-3851-3c65-e016-9181833b4a2c'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [4]:
dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

Find total number of documents in the dataset using the `len()` function. 

In [5]:
len(dset)

6687

Let's select a single volume from the dataset and examine its word frequencies. We'll start with the first item in our dataset.

We create a new variable `my_doc` and initialize it to item 0 of our dataset. 
Here, we also return `my_doc` to view a stable JSTOR link that describes the item.

In [21]:
my_doc = dset.items[2278]
my_doc

'http://www.jstor.org/stable/2871420'

If we copy this URL into a search bar, we can see the article is "Shakespeare and the Middling Sort" by Theodore B. Leinwand. 

___
Now, let's inspect the individual words from the article. First, we create a new variable `article_features` that will contain the extracted features from the dataset object. We can accomplish this using the `get_feature` method. This will copy the dictionary of terms and their frequencies for the article.

In [27]:
article_features = dset.get_feature(my_doc)

Next, we will use the `Counter` function from `collections` 

In [26]:
word_freq = Counter(article_features)

In [13]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

the                            617
of                             504
and                            426
in                             311
to                             276
a                              191
that                           135
with                           75
The                            73
for                            72
as                             69
his                            67
pp.                            65
is                             62
middling                       59
not                            58
or                             56
their                          55
was                            54
have                           53
we                             48
were                           48
on                             45
p.                             45
In                             43


There is a lot of noise in these unigrams - mixed case, punctuation, and very common words. Let's use NLTK's stopwords and a couple of simple transformations to make a cleaner word frequency list.

In [32]:
stop_words = set(stopwords.words('english'))

In [29]:
clean_word_freq = Counter()
for token, count in word_freq.items():
    # require tokens to be 4+ characters
    if len(token) < 4:
        continue
    # require tokens to be alphabetical
    if not token.isalpha():
        continue
    # lower case
    t = token.lower()
    # don't include stopwords
    if t in stop_words:
        continue
    clean_word_freq[t] += count

In [30]:
for token, count in clean_word_freq.most_common(25):
    print(token.ljust(30), count)

middling                       71
sort                           48
shakespeare                    47
london                         36
poor                           29
social                         29
would                          29
among                          25
citizens                       25
also                           24
early                          23
might                          22
english                        20
even                           20
coriolanus                     19
modern                         16
middle                         15
people                         15
much                           15
england                        14
terms                          14
citizen                        13
city                           13
urban                          13
percent                        13
