## Explore word frequencies for a curated dataset

This notebook shows how to explore the word frequencies in your dataset. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Turning your dataset into a pandas dataframe
* Visualizing the contents of your dataset as a graph with pandas

A familiarity with pandas is helpful but not required.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

In [3]:
from tdm_client import Dataset

We import the [pandas](./key-terms.ipynb#pandas) module to help visualize and manipulate our data. Importing `as pd` allows us to call pandas' functions using the short phrase `pd` instead of typing out `pandas` each time. 

In [4]:
import pandas as pd

Lastly, we import `Counter` from `collections` library and `stopwords` from the `nltk` library.

In [5]:
from collections import Counter
from nltk.corpus import stopwords

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of journals focused on Shakespeare is provided here ('a517ef1f-0794-48e4-bea1-ac4fb8b312b4'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [6]:
dset = Dataset('a517ef1f-0794-48e4-bea1-ac4fb8b312b4')

Find total number of documents in the dataset using the `len()` function. 

In [7]:
len(dset)

1000

Let's select a single volume from the dataset and examine its word frequencies. We'll start with the first item in our dataset.

We create a new variable `my_doc` and initialize it to item 0 of our dataset. 
We can also return `my_doc` to see what get a stable JSTOR link that describes the item.

In [10]:
my_doc = dset.items[0]
my_doc

'http://www.jstor.org/stable/i40075057'

Let's inspect the individual words in the volume. First, we create a new variable `volume_features` that will contain the extracted features from the dataset object. We can accomplish this using the `get_feature` attribute. 

In [11]:
volume_features = dset.get_feature(my_doc)

In [12]:
word_freq = Counter()
for page in volume_features['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos_count in body['tokenPosCount'].items():
        for pos, count in pos_count.items():
            word_freq[token] += count

In [13]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

the                            3661
,                              3553
of                             2528
.                              2286
and                            1709
to                             1515
'                              1294
in                             1217
a                              1208
is                             1078
that                           716
's                             642
as                             512
for                            490
it                             468
his                            443
)                              436
(                              436
with                           396
The                            370
which                          354
be                             344
by                             333
I                              325
are                            315


There is a lot of noise in these unigrams - mixed case, punctuation, and very common words. Let's use NLTK's stopwords and a couple of simple transformations to make a cleaner word frequency list.

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
word_freq = Counter()
for page in volume_features['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos_count in body['tokenPosCount'].items():
        # require tokens to be 4+ characters
        if len(token) < 4:
            continue
        # require tokens to be alphabetical
        if not token.isalpha():
            continue
        # lower case
        t = token.lower()
        if t in stop_words:
            continue
        for pos, count in pos_count.items():
            word_freq[t] += count

In [None]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)