# Finding significant words within a curated dataset

This notebook demonstrates how to find the significant words in your dataset using [tf-idf](./key-terms.ipynb#tf-idf). The following processes are described:

* Importing your [dataset](./key-terms.ipynb#dataset)
* Finding your initial query within your [dataset's](./key-terms.ipynb#dataset) metadata
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Cleaning each document of your dataset, one [token](./key-terms.ipynb#token) at a time
* Using a dictionary of English words to remove words with poor [OCR](./key-terms.ipynb#ocr)
* Computing the most significant words in your [corpus](./key-terms.ipynb#corpus) using [TFIDF](./key-terms.ipynb#tf-idf) with the [gensim](./key-terms.ipynb#gensim) library

A familiarity with gensim is helpful but not required.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). 

In [None]:
import warnings
warnings.filterwarnings('ignore')

import gensim

from tdm_client import Dataset
from tdm_client import htrc_corrections

To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes.  

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of journals focused on Shakespeare is provided here ('a517ef1f-0794-48e4-bea1-ac4fb8b312b4'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [None]:
dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

Find total number of documents in the dataset using the `len()` function. 

In [None]:
len(dset)

To check if this is the correct dataset, we can look at the original query by using the query_text method.

In [None]:
dset.query_text()

Now that we've verified that we have the correct [corpus](./key-terms.ipynb#corpus), let's create a helper function that can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* lower cases all [tokens](./key-terms.ipynb#token)
* use a dictionary from [The HathiTrust Research Center](./key-terms.ipynb#htrc) to correct common [Optical Character Recognition](./key-terms.ipynb#ocr) problems
* discard [tokens](./key-terms.ipynb#token) less than 4 characters in length
* discard [tokens](./key-terms.ipynb#token) with non-alphabetical characters
* remove [stopwords](./key-terms.ipynb#stop-words) based on an [The HathiTrust Research Center](./key-terms.ipynb#htrc) [stopword](./key-terms.ipynb#stop-words) list

In [None]:
def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs token through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.

In [None]:
documents = [] #Create a new variable `documents` that is a list that will contain all of our documents.

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
                    

In [None]:
dictionary = gensim.corpora.Dictionary(documents)

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [None]:
model = gensim.models.TfidfModel(bow_corpus)

In [None]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [None]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [None]:
for term, weight in sorted_td[:25]:
    print(term, weight)

Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [None]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 50:
        break