## Finding significant words within a curated dataset##

This notebook demonstrates how to find the significant words in your dataset using a model called TF-IDF. 

*Fun fact: TF-IDF was used in early search engines as a way to do relevance ranking, until clever folks figured out a way to break it with keyword stuffing.* 

As you work through this notebook, you'll take the following steps:

* Import your dataset
* Find your initial query within your dataset's metadata
* Write a helper function to help clean up a single token
* Clean each document of your dataset, one token at a time
* Use a dictionary of English words to remove words with poor OCR
* Compute the most significant words in your corpus using TFIDF and a library callled gensim 

**What's a token?**  It's a string of text. For our purposes, think of a token = a single word.

A quick note before we get started. As you work through this notebook you'll see cells marked ***'optional'***. These are opportunities for you to try modifying and applying Python code to see what happens. I encourage you to try them, but you can also just run the notebook as written.

First we'll import gensim, and the Dataset module from the tdm_client library.  The tdm_client library contains functions for connecting to the JSTOR server containing our corpus dataset.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import gensim

from tdm_client import Dataset
from tdm_client import htrc_corrections

To analyze your dataset, use the dataset ID provided when you created your dataset.
We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of data derived from searching JSTOR for 'antibiotic' and 'resistance' and 'coli' is provided here ('730b508b-5152-618a-2856-aa1a2900a0b2'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server. (No output will show.)

In [2]:
dset = Dataset('730b508b-5152-618a-2856-aa1a2900a0b2')

Find the total number of documents in the dataset using the `len()` function. 

In [3]:
len(dset)

11111

Let's double-check to make sure we have the correct dataset. 
We can look at the original query by using the query_text method.

In [4]:
dset.query_text()

'antibiotic coli resistance from JSTOR from 1985 - 2020'

Now that we've verified that we have the correct corpus/dataset, let's create a helper function that can standardize and clean up the tokens in it. The function will:

* Change all tokens (aka words) to lower case.  This will make 'Cats' and 'cats' be counted as the same token.
* Use a dictionary from The HathiTrust Research Center to correct common OCR (Optical Character Recognition) problems
* Remove stopwords based on an The HathiTrust Research Center stopword list
* Discard tokens with non-alphabetical characters
* Discard tokens less than 4 characters in length

*Question to ponder:* Why do you think we want to discard tokens that are less than 4 characters long?



In [5]:
def process_token(token): #defines a function `process_token` that takes the argument `token`
    token = token.lower() #changes all strings to lower case
    corrected = htrc_corrections.get(token) #this is a function that fixes common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #discards any tokens that are less than 4 characters long
        return
    if not(token.isalpha()): #discards any tokens with non-alphabetic characters
        return
    return token #returns the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the corpus with our helper function.  This may take a while to run; recall that if it's in process, you'll see this: In [ * ]. (No output will show.)

In [6]:
documents = [] #Creates a new variable `documents` that is a list that will contain all of our documents.

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
                        

In [7]:
dictionary = gensim.corpora.Dictionary(documents)

In [8]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [9]:
model = gensim.models.TfidfModel(bow_corpus)

In [10]:
corpus_tfidf = model[bow_corpus]

Now that we have those pieces in place, we can run the following code cells to find the most significant terms, by TFIDF, in our dataset. 

In [11]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [12]:
for term, weight in sorted_td[:20]:
    print(term, weight)

creg 0.985472072383967
mdrgnb 0.9752274211736038
biba 0.9710149484549401
orndcase 0.9649982520576849
letx 0.9643944773788852
cidl 0.9586607209180106
gapr 0.957340331085051
rpar 0.9554485770666841
cjfur 0.9521815564827408
annatl 0.9506783080042027
xylitol 0.9495181584343583
whmd 0.9449462729206862
rrfhcp 0.9431778937694659
rifaximin 0.9410790102054745
lemir 0.9378245131200877
mcjd 0.9372176832162501
squalamine 0.9361024495260135
repa 0.9351415405166057
sqrr 0.9319239029829915
npma 0.9308989171767714


Print the most significant word, by TFIDF, for the first 20 documents in the corpus. 

In [13]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 20:
        break

http://www.jstor.org/stable/30085290 imipenem 0.51903961751567
http://www.jstor.org/stable/23361671 piperacillin 0.42524608618659304
http://www.jstor.org/stable/26159053 dysenteriae 0.6768243695571163
http://www.jstor.org/stable/41435597 empirical 0.3587476010614525
http://www.jstor.org/stable/23049339 avleal 0.5022447328372467
http://www.jstor.org/stable/2354239 multiacetylation 0.3279667356249203
http://www.jstor.org/stable/43872707 eaggec 0.32786877806876613
http://www.jstor.org/stable/41328436 trgnb 0.47431510044765346
http://www.jstor.org/stable/j.ctvb4bssr.20 fructose 0.856928157119235
http://www.jstor.org/stable/24590958 eaea 0.5129162634554189
http://www.jstor.org/stable/24809490 cpka 0.8303579361928378
http://www.jstor.org/stable/3372748 pneumoniae 0.4657366194728223
http://www.jstor.org/stable/34198 gsls 0.6185192757750724
http://www.jstor.org/stable/10.1086/664768 meat 0.4116906562360587
http://www.jstor.org/stable/24077776 waldvogel 0.2974469445257273
http://www.jstor.org/s

*Optional:  How would you print the most significant word for the **first 8 documents**? Modify the code block above and paste your modified code in the code block below.*

In [14]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 8:
        break

http://www.jstor.org/stable/30085290 imipenem 0.51903961751567
http://www.jstor.org/stable/23361671 piperacillin 0.42524608618659304
http://www.jstor.org/stable/26159053 dysenteriae 0.6768243695571163
http://www.jstor.org/stable/41435597 empirical 0.3587476010614525
http://www.jstor.org/stable/23049339 avleal 0.5022447328372467
http://www.jstor.org/stable/2354239 multiacetylation 0.3279667356249203
http://www.jstor.org/stable/43872707 eaggec 0.32786877806876613
http://www.jstor.org/stable/41328436 trgnb 0.47431510044765346
http://www.jstor.org/stable/j.ctvb4bssr.20 fructose 0.856928157119235


Want to learn more and/or try setting up your own Jupyter Notebook?   [This is a great tutorial.](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)