# Finding significant words within a curated dataset

This notebook demonstrates how to find the significant words in your dataset using [tf-idf](./key-terms.ipynb#tf-idf). The following processes are described:

* Importing your [dataset](./key-terms.ipynb#dataset)
* Finding your initial query within your [dataset's](./key-terms.ipynb#dataset) metadata
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Cleaning each document of your dataset, one [token](./key-terms.ipynb#token) at a time
* Using a dictionary of English words to remove words with poor [OCR](./key-terms.ipynb#ocr)
* Computing the most significant words in your [corpus](./key-terms.ipynb#corpus) using [TFIDF](./key-terms.ipynb#tf-idf) with the [gensim](./key-terms.ipynb#gensim) library

A familiarity with gensim is helpful but not required.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). 

In [5]:
import warnings
warnings.filterwarnings('ignore')

import gensim

from tdm_client import Dataset
from tdm_client import htrc_corrections

To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes.  

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of journals focused on Shakespeare is provided here ('a517ef1f-0794-48e4-bea1-ac4fb8b312b4'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [6]:
dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

Find total number of documents in the dataset using the `len()` function. 

In [7]:
len(dset)

6687

To check if this is the correct dataset, we can look at the original query by using the query attribute.

In [8]:
dset.query()

'q=*%3A*&fq=yearPublished%3A%5B1700%20TO%202019%5D&fq=-provider%3Aportico&fq=isPartOf%3A(%22Shakespeare%20Quarterly%22)'

This string is part of the URL used for your initial search. It is written in  It is normally interpreted by the computer, but we can parse it if we keep in mind a few rules:

* Each part of the query is separated by an `&`
* It uses URL enconding to represent characters. Where there is a `%`, a special character is being encoded:
    * %20 is a single space ` `
    * %3A is a `:`
    * %5B is a ``[``
    * %5D is a `]`

Alternatively, we could decode the URL using `urllib` library. 

In [9]:
import urllib.parse
encodedStr = 'q=shakespeare&start=0&rows=20&fq=yearPublished%3A%5B1900%20TO%202019%5D&fq=category%3A(%22Literature%20(General)%22%20OR%20%22English%20literature%22)'
urllib.parse.unquote(encodedStr)

'q=shakespeare&start=0&rows=20&fq=yearPublished:[1900 TO 2019]&fq=category:("Literature (General)" OR "English literature")'

In the example:
* the original query was `shakespeare`
* published from `1900 to 2019`
* found within the categories `Literature (General)` or `English literature`
___

Now that we've verified that we have the correct corpus, let's create a helper function that can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* lower cases all tokens
* use an HTRC dictionary to correct common OCR problems
* discard tokens less than 4 characters in length
* discard tokens with non-alphabetical characters
* remove stopwords based on an HTRC stopword list

In [10]:
def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs toke through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.

In [11]:
documents = [] #Create a new variable `documents` that is a list

for n, unigram_count in enumerate(dset.get_features()):
    this_doc = []
    for token, count in unigram_count.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)
                    

In [12]:
dictionary = gensim.corpora.Dictionary(documents)

In [13]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [14]:
model = gensim.models.TfidfModel(bow_corpus)

In [15]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [16]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [17]:
for term, weight in sorted_td[:25]:
    print(term, weight)

ofamiem 1.0
cwtrnca 0.9816048327406713
zamir 0.909664366407156
reprint 0.8912794505341458
ouderdom 0.8907547566168921
worken 0.889452318558063
falocco 0.8735400172020873
werken 0.8619900346951035
weingust 0.8366952585903492
witmore 0.8101754277391725
falco 0.7687425271759438
womersley 0.7673648412212968
wynkyn 0.7478190668986658
honan 0.747047406836647
unton 0.7389050311436763
willeford 0.7313129489069663
mebane 0.7205366793489603
demea 0.717697038586963
matz 0.7110635758552036
alceste 0.7093984204685807
whitefriars 0.7043184902613787
athelstan 0.7040717360868357
hrotsvits 0.70280612009222
foulkes 0.7017590805228023
playbookes 0.6942220407478248


Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [18]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 50:
        break

http://www.jstor.org/stable/24778431 handkerchief 0.5104825155260231
http://www.jstor.org/stable/24778442 altman 0.5338518939832136
http://www.jstor.org/stable/24778441 shakespeareans 0.26752472146914286
http://www.jstor.org/stable/44990760 zeeb 0.19243734202627377
http://www.jstor.org/stable/44990258 folger 0.20695590614013684
http://www.jstor.org/stable/44991483 item 0.6937914194847521
http://www.jstor.org/stable/44990809 institute 0.2708117216896589
http://www.jstor.org/stable/44990755 item 0.6338561247748161
http://www.jstor.org/stable/44990251 jackson 0.20811283511274925
http://www.jstor.org/stable/44990806 facsimile 0.15689657515431293
http://www.jstor.org/stable/44990805 item 0.5122766995683712
http://www.jstor.org/stable/2866476 entrance 0.5893884586749324
http://www.jstor.org/stable/2866484 ducis 0.446803200335144
http://www.jstor.org/stable/2866493 professor 0.35995179197163296
http://www.jstor.org/stable/2868727 survey 0.4219999222625147
http://www.jstor.org/stable/2868724 d