In [1]:
# Remaining to do on this: tweak some of the model language, dumb it down/re-explain e.g. what's a token?

*Teaching note*
* *Run from the menu: Cell, All outputs, Clear at the beginning to initialize all cell inputs to the blank state*

[Start here](https://hub.tdm-pilot.org/user/jfergusonnortheasternedu/notebooks/jasfTest/About%20the%20Corpus%20and%20About%20Jupyter.ipynb) with the 'About the Corpus/About Jupyter'

## Finding significant words within a curated dataset##

This notebook demonstrates how to find the significant words in your dataset using a model called TF-IDF. As you work through this notebook you'll take the following steps:

* Import your dataset
* Find your initial query within your dataset's metadata
* Write a helper function to help clean up a single token
* Clean each document of your dataset, one token at a time
* Use a dictionary of English words to remove words with poor OCR
* Compute the most significant words in your corpus using TFIDF and a library callled gensim 

**What's a token?** It's a string of text. For our purposes, think of a token = a single word.

First, we import the Dataset module from the tdm_client library. The tdm_client library contains functions for connecting to the JSTOR server containing our corpus dataset.

In [22]:
import warnings
warnings.filterwarnings('ignore')

import gensim

from tdm_client import Dataset
from tdm_client import htrc_corrections

To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes.  

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of data derived from searching JSTOR for 'antibiotic' and 'resistance' and 'coli' is provided here ('730b508b-5152-618a-2856-aa1a2900a0b2'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [23]:
dset = Dataset('730b508b-5152-618a-2856-aa1a2900a0b2')

Find the total number of documents in the dataset using the `len()` function. 

In [24]:
len(dset)

8393

Let's double-check this to make sure we have the correct dataset. We can look at the original query by using the query_text method.

In [11]:
dset.query_text()

'"antibiotic coli resistance" from JSTOR from 1985 - 2020'

Now that we've verified that we have the correct corpus, let's create a helper function that can standardize and clean up the tokens in our dataset. The function will:

* Change all tokens (aka words) to lower case.  This will make 'Meat' and 'meat' be counted as the same token.
* Use a dictionary from The HathiTrust Research Center to correct common OCR (Optical Character Recognition) problems
* Discard tokens less than 4 characters in length. *(Why?)*
* Discard tokens with non-alphabetical characters
* Remove stopwords based on an The HathiTrust Research Center stopword list

In [27]:
def process_token(token): #defines a function `process_token` that takes the argument `token`
    token = token.lower() #changes all strings to lower case
    corrected = htrc_corrections.get(token) #this is a function that fixes common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #discards any tokens that are less than 4 characters long
        return
    if not(token.isalpha()): #discards any tokens with non-alphabetic characters
        return
    return token #returns the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.  This may take a while to run; recall that if it's in process, you'll see an * in the In [ ].

In [29]:
documents = [] #Creates a new variable `documents` that is a list that will contain all of our documents.

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
                        

In [30]:
dictionary = gensim.corpora.Dictionary(documents)

In [31]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [32]:
model = gensim.models.TfidfModel(bow_corpus)

In [33]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [34]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [35]:
for term, weight in sorted_td[:20]:
    print(term, weight)

creg 0.9841662944870286
mdrgnb 0.975557848630102
biba 0.9715613436948968
orndcase 0.9631345518081613
gapr 0.9529489230155735
annatl 0.945784372964842
whmd 0.9408637418840726
npma 0.9357097608681134
mprl 0.9350557470086153
samp 0.9334536614930616
mcjd 0.9311354394764306
lemir 0.9301171834163956
tsll 0.9286491029905937
cpkp 0.9279376596706661
geomorphus 0.924834020485281
becs 0.9233882319798145
bmta 0.9209127411481993
myotubularin 0.9203318325054199
hvraf 0.9194049619562112
crdof 0.9182367837830574


Print the most significant word, by TFIDF, for the first 20 documents in the corpus. 

In [36]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 20:
        break

http://www.jstor.org/stable/30142230 mycophagy 0.4446299264467046
http://www.jstor.org/stable/30224786 cimfl 0.6369482204030475
http://www.jstor.org/stable/4091495 caerulescens 0.5779491895296922
http://www.jstor.org/stable/1514055 biocontrol 0.32200863535477997
http://www.jstor.org/stable/1514056 elsas 0.38096349437582494
http://www.jstor.org/stable/newphytologist.196.2.561 lotus 0.5725634239462112
http://www.jstor.org/stable/newphytologist.202.4.1142 phytologist 0.46315267805695803
http://www.jstor.org/stable/20869142 arsenic 0.6492155880829461
http://www.jstor.org/stable/2558634 rhizobium 0.32451365751263717
http://www.jstor.org/stable/2558646 leguminosarum 0.3690918330439759
http://www.jstor.org/stable/2558647 biocontrol 0.4795072654959623
http://www.jstor.org/stable/newphytologist.200.3.847 flic 0.44618239623871403
http://www.jstor.org/stable/40864570 climate 0.7830237862013699
http://www.jstor.org/stable/44614859 flight 0.6111188059818026
http://www.jstor.org/stable/44471701 vcds

*Optional (easy):  How would you print the most significant word for the first 8 documents? Modify the code block above and paste your modified code in the code block below.*

In [37]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 8:
        break

http://www.jstor.org/stable/30142230 mycophagy 0.4446299264467046
http://www.jstor.org/stable/30224786 cimfl 0.6369482204030475
http://www.jstor.org/stable/4091495 caerulescens 0.5779491895296922
http://www.jstor.org/stable/1514055 biocontrol 0.32200863535477997
http://www.jstor.org/stable/1514056 elsas 0.38096349437582494
http://www.jstor.org/stable/newphytologist.196.2.561 lotus 0.5725634239462112
http://www.jstor.org/stable/newphytologist.202.4.1142 phytologist 0.46315267805695803
http://www.jstor.org/stable/20869142 arsenic 0.6492155880829461
http://www.jstor.org/stable/2558634 rhizobium 0.32451365751263717


Want to learn more and/or try setting up your own Jupyter Notebook?   [This is a great tutorial.](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)