# Finding significant words within a curated dataset

This notebook demonstrates how to find the significant words in your dataset using [tf-idf](./key-terms.ipynb#tf-idf). 


. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Turning your dataset into a pandas dataframe
* Visualizing the contents of your dataset as a graph with pandas

A familiarity with pandas is helpful but not required.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 



This example uses the [`gensim`](https://radimrehurek.com/gensim/index.html) library for calculating [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

In [None]:
import warnings
warnings.filterwarnings('ignore')

from tdm_client import Dataset
from tdm_client import htrc_corrections, htrc_stopwords

import gensim

Initialize a dataset object. 

In [None]:
dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

In [None]:
len(dset)

In [None]:
dset.query()

Create a helper function for cleaning the individual tokens in the dataset. This function:
* lower cases all tokens
* uses an HTRC dictionary to correct common OCR problems
* discards tokens less than 4 characters in length
* discards tokens with non-alphabetical tokens
* removes stopwords from the HTC stopword list

In [None]:
def process_token(token):
    token = token.lower()
    corrected = htrc_corrections.get(token)
    if corrected is not None:
        token = corrected
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    return token

In [None]:
documents = []

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
                    

In [None]:
dictionary = gensim.corpora.Dictionary(documents)

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [None]:
model = gensim.models.TfidfModel(bow_corpus)

In [None]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [None]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [None]:
for term, weight in sorted_td[:25]:
    print(term, weight)

Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [None]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 50:
        break