# Finding significant words within a curated dataset

This example uses the [`gensim`](https://radimrehurek.com/gensim/index.html) library for calculating [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [54]:
import warnings
warnings.filterwarnings('ignore')

from tdm_client import Dataset
from tdm_client import htrc_corrections, htrc_stopwords

import gensim

Initialize a dataset object. 

In [55]:
dset = Dataset('a517ef1f-0794-48e4-bea1-ac4fb8b312b4')

In [56]:
len(dset)

1000

In [57]:
dset.query()

'q=shakespeare&start=0&rows=20&fq=yearPublished%3A%5B1900%20TO%202019%5D&fq=category%3A(%22Literature%20(General)%22%20OR%20%22English%20literature%22)'

Create a helper function for cleaning the individual tokens in the dataset. This function:
* lower cases all tokens
* uses an HTRC dictionary to correct common OCR problems
* discards tokens less than 4 characters in length
* discards tokens with non-alphabetical tokens
* removes stopwords from the HTC stopword list

In [58]:
def process_token(token):
    token = token.lower()
    corrected = htrc_corrections.get(token)
    if corrected is not None:
        token = corrected
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    return token

In [59]:
documents = []

for doc_n, volume in enumerate(dset.get_features()):
    this_doc = []
    try:
        pages = volume['features']['pages']
    except KeyError:
        continue
    for pn, page in enumerate(pages):
        body = page.get('body')
        if body is not None:
            for token, pos_count in body.get('tokenPosCount', {}).items():
                clean_token = process_token(token)
                if clean_token is None:
                    continue
                for pos, n in pos_count.items():
                    this_doc += [clean_token] * n
    documents.append(this_doc)
                    

In [60]:
dictionary = gensim.corpora.Dictionary(documents)

In [61]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [62]:
model = gensim.models.TfidfModel(bow_corpus)

In [63]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [64]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [65]:
for term, weight in sorted_td[:25]:
    print(term, weight)

rihll 0.8789730103989494
ffarington 0.8687586142454262
lācis 0.8243941724639476
vennar 0.8163770278745967
springthorpe 0.7756033880521054
dowson 0.7319115494098776
ansori 0.7275629612915525
haggai 0.6938212666496392
mordred 0.6741468905424344
gambuh 0.6694600905272786
hynde 0.6476993218928895
mbti 0.6373307886940024
naatsilanei 0.612118444994542
pgends 0.6074158461930421
pgvar 0.6074158461930421
teena 0.6066638524325301
bunting 0.6061238433470649
londesbr 0.605012939081335
ramlila 0.5986739623074735
crimewatch 0.5864962542111738
bulworth 0.5685071371122152
mccrea 0.559153117887106
ncox 0.5589374086582615
edib 0.5561714547394812
mayella 0.5523679319537927


Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [66]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 50:
        break

http://www.jstor.org/stable/i40075057 sophie 0.31074312791735936
http://www.jstor.org/stable/i40103856 hodgetts 0.3584969280241884
http://www.jstor.org/stable/i40075051 tercentenary 0.25790641235753525
http://www.jstor.org/stable/i40075048 harvey 0.2823835999735381
http://www.jstor.org/stable/i40075029 siddons 0.4804214701488567
http://www.jstor.org/stable/i40075043 mutran 0.4860344553470682
http://www.jstor.org/stable/i40075049 hathaway 0.37880283509018037
http://www.jstor.org/stable/i40103854 faucit 0.5626181311237378
http://www.jstor.org/stable/i24712320 hawkes 0.5600912328304873
http://www.jstor.org/stable/i40180516 péguy 0.8972513573836641
http://www.jstor.org/stable/i40075016 blackmore 0.18185351011005418
http://www.jstor.org/stable/i23917916 zarathustra 0.6152358652346409
http://www.jstor.org/stable/i338528 designer 0.1496299927861339
http://www.jstor.org/stable/i24712323 prufrock 0.3850736252470401
http://www.jstor.org/stable/i40074765 shak 0.17726733854453525
http://www.jstor.