<font color ='gray'> [Start here](https://hub.tdm-pilot.org/user/jfergusonnortheasternedu/notebooks/jasfTest/About%20the%20Corpus%20and%20About%20Jupyter.ipynb) with the 'About the Corpus/About Jupyter'</font>

## Finding significant words within a curated dataset##

This notebook demonstrates how to find the significant words in your dataset using a model called TF-IDF. As you work through this notebook, you'll take the following steps:

* Import your dataset
* Find your initial query within your dataset's metadata
* Write a helper function to help clean up a single token
* Clean each document of your dataset, one token at a time
* Use a dictionary of English words to remove words with poor OCR
* Compute the most significant words in your corpus using TFIDF and a library callled gensim 


**What's a token?**  It's a string of text. For our purposes, think of a token = a single word.

First we'll import gensim, and the Dataset module from the tdm_client library.  The tdm_client library contains functions for connecting to the JSTOR server containing our corpus dataset.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import gensim

from tdm_client import Dataset
from tdm_client import htrc_corrections

To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes.  

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of data derived from searching JSTOR for 'antibiotic' and 'resistance' and 'coli' is provided here ('730b508b-5152-618a-2856-aa1a2900a0b2'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [None]:
dset = Dataset('730b508b-5152-618a-2856-aa1a2900a0b2')

Find the total number of documents in the dataset using the `len()` function. 

In [None]:
len(dset)

Let's double-check to make sure we have the correct dataset. 
We can look at the original query by using the query_text method.

In [None]:
dset.query_text()

Now that we've verified that we have the correct corpus/dataset, let's create a helper function that can standardize and clean up the tokens in it. The function will:

* Change all tokens (aka words) to lower case.  This will make 'Cats' and 'cats' be counted as the same token.
* Use a dictionary from The HathiTrust Research Center to correct common OCR (Optical Character Recognition) problems
* Remove stopwords based on an The HathiTrust Research Center stopword list
* Discard tokens with non-alphabetical characters
* Discard tokens less than 4 characters in length

*Why do you think we want to discard tokens that are less than 4 characters long?*

In [None]:
def process_token(token): #defines a function `process_token` that takes the argument `token`
    token = token.lower() #changes all strings to lower case
    corrected = htrc_corrections.get(token) #this is a function that fixes common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #discards any tokens that are less than 4 characters long
        return
    if not(token.isalpha()): #discards any tokens with non-alphabetic characters
        return
    return token #returns the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.  This may take a while to run; recall that if it's in process, you'll see an * in the In [ ].

In [None]:
documents = [] #Creates a new variable `documents` that is a list that will contain all of our documents.

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
                        

In [None]:
dictionary = gensim.corpora.Dictionary(documents)

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [None]:
model = gensim.models.TfidfModel(bow_corpus)

In [None]:
corpus_tfidf = model[bow_corpus]

Now that we have thosee pieces in place, we can run the following code cells to find the most significant terms, by TFIDF, in our dataset. 

In [None]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [None]:
for term, weight in sorted_td[:20]:
    print(term, weight)

Print the most significant word, by TFIDF, for the first 20 documents in the corpus. 

In [None]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 20:
        break

*Optional (easy):  How would you print the most significant word for the first 8 documents? Modify the code block above and paste your modified code in the code block below.*

Want to learn more and/or try setting up your own Jupyter Notebook?   [This is a great tutorial.](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)