# Finding significant words within a curated dataset

This notebook demonstrates how to find the significant words in your dataset using [tf-idf](./key-terms.ipynb#tf-idf). The following processes are described:

* Importing your [dataset](./key-terms.ipynb#dataset)
* Finding your initial query within your [dataset's](./key-terms.ipynb#dataset) metadata
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Cleaning each document of your dataset, one [token](./key-terms.ipynb#token) at a time
* Using a dictionary of English words to remove words with poor [OCR](./key-terms.ipynb#ocr)
* Computing the most significant words in your [corpus](./key-terms.ipynb#corpus) using [TFIDF](./key-terms.ipynb#tf-idf) with the [gensim](./key-terms.ipynb#gensim) library

A familiarity with gensim is helpful but not required.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). 

In [12]:
#import warnings
#warnings.filterwarnings('ignore')

import gensim

from tdm_client import Dataset
from tdm_client import htrc_corrections

To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes.  

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of journals focused on Shakespeare is provided here ('a517ef1f-0794-48e4-bea1-ac4fb8b312b4'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [13]:
dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

Find total number of documents in the dataset using the `len()` function. 

In [14]:
len(dset)

6687

To check if this is the correct dataset, we can look at the original query by using the query_text method.

In [15]:
dset.query_text()

'All documents from JSTOR published in Shakespeare Quarterly from 1700 - 2019'

Now that we've verified that we have the correct [corpus](./key-terms.ipynb#corpus), let's create a helper function that can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* lower cases all [tokens](./key-terms.ipynb#token)
* use a dictionary from [The HathiTrust Research Center](./key-terms.ipynb#htrc) to correct common [Optical Character Recognition](./key-terms.ipynb#ocr) problems
* discard [tokens](./key-terms.ipynb#token) less than 4 characters in length
* discard [tokens](./key-terms.ipynb#token) with non-alphabetical characters
* remove [stopwords](./key-terms.ipynb#stop-words) based on an [The HathiTrust Research Center](./key-terms.ipynb#htrc) [stopword](./key-terms.ipynb#stop-words) list

In [16]:
def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs token through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.

In [33]:
documents = [] #Create a new variable `documents` that is a list that will contain all of our documents.

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)  

875

In [18]:
dictionary = gensim.corpora.Dictionary(documents)

In [19]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [20]:
model = gensim.models.TfidfModel(bow_corpus)

In [21]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [22]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [23]:
for term, weight in sorted_td[:25]:
    print(term, weight)

ofamiem 1.0
cwtrnca 0.9863483873087012
chartres 0.9273378589673646
worken 0.9207300498619446
sobran 0.9075508762980733
nuimber 0.8775001624677011
weingust 0.8755826466137229
rudanko 0.86000135238765
enbiemata 0.8563716462679598
weils 0.8472587947507879
sliv 0.8394611060327918
snuggs 0.8381178515481901
ouderdom 0.8316113874601273
habib 0.8308118578625007
buzacott 0.8303061621922353
gaiicanus 0.8294403338236659
holmer 0.8201937618215803
spectogram 0.817057687336797
reproducedfrmtefgr 0.8139670501922074
womersley 0.8080655409128259
dulcitius 0.8048781892048817
margolies 0.7961508372216022
dugas 0.7868223742710501
willeford 0.7859219459401604
tcad 0.7819493652557465


Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [24]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 50:
        break

http://www.jstor.org/stable/2869980 henslowe 0.32498040467805084
http://www.jstor.org/stable/2870198 nimrod 0.1616432176708553
http://www.jstor.org/stable/2870199 beatrice 0.22776869095345914
http://www.jstor.org/stable/2870209 donaldson 0.47661799586847836
http://www.jstor.org/stable/2870208 cheng 0.5913759877754204
http://www.jstor.org/stable/2870189 antonio 0.44365366900689435
http://www.jstor.org/stable/2870193 painting 0.4091485612863863
http://www.jstor.org/stable/2870188 edgar 0.3258325027350309
http://www.jstor.org/stable/2870203 hartwig 0.6147359515246972
http://www.jstor.org/stable/2870194 hall 0.5121228500237642
http://www.jstor.org/stable/2870206 novy 0.5473522309642914
http://www.jstor.org/stable/2870202 booth 0.3863680373528867
http://www.jstor.org/stable/2870313 vizcaya 0.601154198547381
http://www.jstor.org/stable/2870307 dennis 0.3266904513150024
http://www.jstor.org/stable/2870327 hollar 0.6487395808866938
http://www.jstor.org/stable/2870308 longleat 0.502334217047471