## Explore word frequencies for a curated collection

This notebook shows how to connect to a curated TDM dataset and explore the metadata.

In [1]:
from collections import Counter

import pandas as pd
from nltk.corpus import stopwords

from tdm_core.client import Dataset
from tdm_core.text import extracted_features_to_counter, filtered_token

Initialize a TDM dataset object with the collection ID provided en the email you received after curating your collection in the Digital Scholars Workbench.

In [2]:
dset = Dataset('bb3d938b-bc61-4c2c-a21c-9a4f102035c8')

Find total number of documents in the dataset. 

In [3]:
len(dset)

61

Lets examine frequent words for a volume in the collection. 

In [4]:
volume = dset.get(dset.items[40])

In [5]:
meta = [e for e in volume['metadata'] if e['@type'] == 'PublicationIssue'][0]

In [6]:
print("\n".join([
    meta['@id'], 
    meta.get('name') or meta.get('title'), 
    str(meta['datePublished']['@value'])
]))

http://www.jstor.org/stable/i25091409
The Massachusetts Review volume 39 issue 2
1998-07-01


In [7]:
word_freq = Counter()
for page in volume['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos in body['tokenPosCount'].items():
        word_freq[token] += 1

In [8]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

the                            153
and                            149
,                              148
of                             146
.                              145
a                              144
to                             141
in                             141
I                              114
on                             113
for                            113
it                             108
's                             108
that                           108
with                           107
The                            106
at                             105
?                              105
as                             103
"                              103
by                             99
his                            99
was                            98
:                              97
from                           92


There is a lot of noise in these unigrams - mixed case, punctuation, and very common words. Let's use NLTK's stopwords and a couple of simple transformations to make a cleaner word frequency list.

In [9]:
stop_words = set(stopwords.words('english'))

In [10]:
word_freq = Counter()
for page in volume['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos in body['tokenPosCount'].items():
        # require tokens to be 4+ characters
        if len(token) < 4:
            continue
        # require tokens to be alphabetical
        if not token.isalpha():
            continue
        # lower case
        t = token.lower()
        if t in stop_words:
            continue
        word_freq[t] += 1

In [11]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

like                           69
would                          62
time                           54
years                          50
first                          46
could                          46
ginsberg                       45
poetry                         44
poet                           44
people                         44
poem                           43
said                           43
back                           42
later                          40
poems                          38
allen                          37
life                           37
long                           37
much                           36
around                         36
good                           34
frost                          33
little                         33
never                          33
know                           32
