## Explore word frequencies for a curated dataset

This notebook shows how to connect to a curated TDM dataset and explore the metadata.

In [None]:
from collections import Counter

import pandas as pd
from nltk.corpus import stopwords

from tdm_client import Dataset

Initialize a TDM dataset object with the collection ID provided en the email you received after curating your collection in the Digital Scholars Workbench.

In [None]:
dset = Dataset('a517ef1f-0794-48e4-bea1-ac4fb8b312b4')

Find total number of documents in the dataset. 

In [None]:
len(dset)

Lets examine frequent words for a volume in the collection. Let's look at the first item in the collection.

In [None]:
my_doc = dset.items[0]
my_doc

To inspect the individual words in the volume, load the extracted features from the dataset object. 

In [None]:
volume_features = dset.get_feature(my_doc)

In [None]:
word_freq = Counter()
for page in volume_features['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos_count in body['tokenPosCount'].items():
        for pos, count in pos_count.items():
            word_freq[token] += count

In [None]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)

There is a lot of noise in these unigrams - mixed case, punctuation, and very common words. Let's use NLTK's stopwords and a couple of simple transformations to make a cleaner word frequency list.

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
word_freq = Counter()
for page in volume_features['features']['pages']:
    body = page['body']
    if body is None:
        continue
    for token, pos_count in body['tokenPosCount'].items():
        # require tokens to be 4+ characters
        if len(token) < 4:
            continue
        # require tokens to be alphabetical
        if not token.isalpha():
            continue
        # lower case
        t = token.lower()
        if t in stop_words:
            continue
        for pos, count in pos_count.items():
            word_freq[t] += count

In [None]:
for token, count in word_freq.most_common(25):
    print(token.ljust(30), count)