# Finding significant words within a curated dataset

This example uses the [`gensim`](https://radimrehurek.com/gensim/index.html) library for calculating [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [1]:
import warnings
warnings.filterwarnings('ignore')

from tdm_core.client import Dataset
from tdm_core.text import htrc_corrections, htrc_stopwords

import gensim

Initialize a dataset object. 

In [2]:
dset = Dataset('bb3d938b-bc61-4c2c-a21c-9a4f102035c8')

In [3]:
len(dset)

61

In [4]:
dset.query()

'q=%22walt%20whitman%22%20brooklyn&fq=yearPublished%3A%5B1700%20TO%202019%5D&fq=outputFormat%3Aunigrams'

Create a helper function for cleaning the individual tokens in the dataset. This function:
* lower cases all tokens
* uses an HTRC dictionary to correct common OCR problems
* discards tokens less than 4 characters in length
* discards tokens with non-alphabetical tokens
* removes stopwords from the HTC stopword list

In [5]:
def process_token(token):
    token = token.lower()
    corrected = htrc_corrections.get(token)
    if corrected is not None:
        token = corrected
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    return token

In [9]:
documents = []

for doc_n, volume in enumerate(dset):
    this_doc = []
    try:
        pages = volume['features']['pages']
    except KeyError:
        continue
    for pn, page in enumerate(pages):
        body = page.get('body')
        if body is not None:
            for token, pos_count in body.get('tokenPosCount', {}).items():
                clean_token = process_token(token)
                if clean_token is None:
                    continue
                for pos, n in pos_count.items():
                    this_doc += [clean_token] * n
    documents.append(this_doc)
                    

In [10]:
dictionary = gensim.corpora.Dictionary(documents)

In [11]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [12]:
model = gensim.models.TfidfModel(bow_corpus)

In [13]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [14]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [15]:
for term, weight in sorted_td[:10]:
    print(term, weight)

mcadie 0.7787276861573577
cantrell 0.7549777890095891
pudney 0.63605417970789
sahitya 0.5273050613229457
lanyer 0.4967422716248303
titelbaum 0.49005434929015723
delgado 0.464262015116991
marlin 0.4617810095938151
wilhams 0.46074722820884134
rosi 0.4384985587853527


Print the most significant word, by TFIDF, for each document in the corpus. 

In [16]:
for n, doc in enumerate(corpus_tfidf):
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)

http://hdl.handle.net/2027/uc1.b4880650 sahitya 0.5273050613229457
http://hdl.handle.net/2027/mdp.39076006733898 lamartine 0.3900976840324696
http://hdl.handle.net/2027/uc1.32106002083555 superintendent 0.4646221961996292
http://hdl.handle.net/2027/uva.x000926626 prietos 0.27052455371978007
http://hdl.handle.net/2027/nyp.33433082396866 balzac 0.17697208056045913
http://hdl.handle.net/2027/mdp.39015030718087 feisal 0.18675094639439566
http://hdl.handle.net/2027/uc2.ark:/13960/t1xd0rw0b powys 0.9090784780355294
http://hdl.handle.net/2027/mdp.39015019763518 conservatory 0.25557880084144774
http://hdl.handle.net/2027/mdp.49015000556002 murine 0.2606232188631184
http://hdl.handle.net/2027/mdp.39015025045207 lanyer 0.4967422716248303
http://hdl.handle.net/2027/uc2.ark:/13960/fk2q52ff8n congreve 0.32575462864142546
http://hdl.handle.net/2027/mdp.39015059999972 pudney 0.63605417970789
http://hdl.handle.net/2027/mdp.39015030718079 phrenology 0.4328614462032251
http://hdl.handle.net/2027/mdp.390