## Explore word frequencies for a curated dataset

This notebook shows how to connect to a curated TDM dataset and explore the metadata.

In [62]:
from collections import Counter

import pandas as pd
from nltk.corpus import stopwords

from tdm_client import Dataset

Initialize a TDM dataset object with the collection ID provided en the email you received after curating your collection in the Digital Scholars Workbench.

In [63]:
dset = Dataset('a517ef1f-0794-48e4-bea1-ac4fb8b312b4')

Find total number of documents in the dataset. 

In [64]:
len(dset)

1000

Lets examine frequent words for a volume in the collection. 

In [65]:
my_doc = dset.items[2]
my_doc

'http://www.jstor.org/stable/i40075051'

In [66]:
volume_features = dset.get_feature(my_doc)

In [67]:
volume_metadata = [m for m in dset.get_metadata() if m['id'] == my_doc][0]
volume_metadata

{'id': 'http://www.jstor.org/stable/i40075051',
 'journalTitle': 'Critical Survey',
 'pageCount': 127,
 'provider': 'jstor',
 'title': 'Shakespeare and the Cultures of Commemoration',
 'wordCount': 57179,
 'yearPublished': 2010}

In [68]:
word_freq = Counter()
for page in volume['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos in body['tokenPosCount'].items():
        word_freq[token] += 1

In [69]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

,                              123
.                              122
the                            120
a                              119
and                            118
in                             118
of                             118
to                             115
is                             111
that                           111
The                            107
as                             106
by                             102
with                           100
it                             99
which                          98
"                              96
not                            94
this                           93
from                           91
for                            90
's                             90
or                             87
was                            86
his                            85


There is a lot of noise in these unigrams - mixed case, punctuation, and very common words. Let's use NLTK's stopwords and a couple of simple transformations to make a cleaner word frequency list.

In [70]:
stop_words = set(stopwords.words('english'))

In [71]:
word_freq = Counter()
for page in volume['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos in body['tokenPosCount'].items():
        # require tokens to be 4+ characters
        if len(token) < 4:
            continue
        # require tokens to be alphabetical
        if not token.isalpha():
            continue
        # lower case
        t = token.lower()
        if t in stop_words:
            continue
        word_freq[t] += 1

In [72]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

life                           52
time                           49
also                           47
would                          45
many                           41
dream                          39
unconscious                    37
like                           36
mother                         35
first                          34
dreams                         33
must                           32
could                          31
woman                          30
freud                          29
thus                           28
even                           28
mind                           27
child                          27
story                          27
father                         27
little                         27
symbolic                       27
early                          26
death                          26
