# Finding significant words within a curated dataset

This notebook demonstrates how to find the significant words in your dataset using [tf-idf](./key-terms.ipynb#tf-idf). The following processes are described:

* Importing your [dataset](./key-terms.ipynb#dataset)
* Finding your initial query within your [dataset's](./key-terms.ipynb#dataset) metadata
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Cleaning each document of your dataset, one [token](./key-terms.ipynb#token) at a time
* Using a dictionary of English words to remove words with poor [OCR](./key-terms.ipynb#ocr)
* Computing the most significant words in your [corpus](./key-terms.ipynb#corpus) using [TFIDF](./key-terms.ipynb#tf-idf) with the [gensim](./key-terms.ipynb#gensim) library

A familiarity with gensim is helpful but not required.
____
We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). 

In [2]:
from tdm_client import Dataset
from tdm_client import htrc_corrections, htrc_stopwords

We will also import `warnings` to XXXXX and [gensim](https://radimrehurek.com/gensim/index.html), a Python library to help us calculate the significant words in our text using using the [TFIDF](./key-terms.ipynb#tf-idf) method.

In [3]:
import warnings
warnings.filterwarnings('ignore')

import gensim

To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes.  

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of journals focused on Shakespeare is provided here ('a517ef1f-0794-48e4-bea1-ac4fb8b312b4'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [4]:
dset = Dataset('a517ef1f-0794-48e4-bea1-ac4fb8b312b4')

Find total number of documents in the dataset using the `len()` function. 

In [5]:
len(dset)

1000

To check if this is the correct dataset, we can look at the original query by using the query attribute.

In [6]:
dset.query()

'q=shakespeare&start=0&rows=20&fq=yearPublished%3A%5B1900%20TO%202019%5D&fq=category%3A(%22Literature%20(General)%22%20OR%20%22English%20literature%22)'

This string is part of the URL used for your initial search. It is written in  It is normally interpreted by the computer, but we can parse it if we keep in mind a few rules:

* Each part of the query is separated by an `&`
* It uses URL enconding to represent characters. Where there is a `%`, a special character is being encoded:
    * %20 is a single space ` `
    * %3A is a `:`
    * %5B is a ``[``
    * %5D is a `]`

Alternatively, we could decode the URL using `urllib` library. 

In [7]:
import urllib.parse
encodedStr = 'q=shakespeare&start=0&rows=20&fq=yearPublished%3A%5B1900%20TO%202019%5D&fq=category%3A(%22Literature%20(General)%22%20OR%20%22English%20literature%22)'
urllib.parse.unquote(encodedStr)

'q=shakespeare&start=0&rows=20&fq=yearPublished:[1900 TO 2019]&fq=category:("Literature (General)" OR "English literature")'

In the example:
* the original query was `shakespeare`
* published from `1900 to 2019`
* found within the categories `Literature (General)` or `English literature`
___

Now that we've verified that we have the correct corpus, let's create a helper function that can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* lower cases all tokens
* use an HTRC dictionary to correct common OCR problems
* discard tokens less than 4 characters in length
* discard tokens with non-alphabetical characters
* remove stopwords based on an HTRC stopword list

In [8]:
def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs toke through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.

In [59]:
documents = [] #Create a new variable `documents` that is a list

for doc_n, volume in enumerate(dset.get_features()): #for each 
    this_doc = [] #create a new variable `this_doc` that is a list
    try:
        pages = volume['features']['pages']
    except KeyError:
        continue
    for pn, page in enumerate(pages):
        body = page.get('body')
        if body is not None:
            for token, pos_count in body.get('tokenPosCount', {}).items():
                clean_token = process_token(token)
                if clean_token is None:
                    continue
                for pos, n in pos_count.items():
                    this_doc += [clean_token] * n
    documents.append(this_doc)
                    

In [60]:
dictionary = gensim.corpora.Dictionary(documents)

In [61]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [62]:
model = gensim.models.TfidfModel(bow_corpus)

In [63]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [64]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [65]:
for term, weight in sorted_td[:25]:
    print(term, weight)

rihll 0.8789730103989494
ffarington 0.8687586142454262
lācis 0.8243941724639476
vennar 0.8163770278745967
springthorpe 0.7756033880521054
dowson 0.7319115494098776
ansori 0.7275629612915525
haggai 0.6938212666496392
mordred 0.6741468905424344
gambuh 0.6694600905272786
hynde 0.6476993218928895
mbti 0.6373307886940024
naatsilanei 0.612118444994542
pgends 0.6074158461930421
pgvar 0.6074158461930421
teena 0.6066638524325301
bunting 0.6061238433470649
londesbr 0.605012939081335
ramlila 0.5986739623074735
crimewatch 0.5864962542111738
bulworth 0.5685071371122152
mccrea 0.559153117887106
ncox 0.5589374086582615
edib 0.5561714547394812
mayella 0.5523679319537927


Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [66]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(dset.items[n], dictionary.get(word_id), score)
    if n >= 50:
        break

http://www.jstor.org/stable/i40075057 sophie 0.31074312791735936
http://www.jstor.org/stable/i40103856 hodgetts 0.3584969280241884
http://www.jstor.org/stable/i40075051 tercentenary 0.25790641235753525
http://www.jstor.org/stable/i40075048 harvey 0.2823835999735381
http://www.jstor.org/stable/i40075029 siddons 0.4804214701488567
http://www.jstor.org/stable/i40075043 mutran 0.4860344553470682
http://www.jstor.org/stable/i40075049 hathaway 0.37880283509018037
http://www.jstor.org/stable/i40103854 faucit 0.5626181311237378
http://www.jstor.org/stable/i24712320 hawkes 0.5600912328304873
http://www.jstor.org/stable/i40180516 péguy 0.8972513573836641
http://www.jstor.org/stable/i40075016 blackmore 0.18185351011005418
http://www.jstor.org/stable/i23917916 zarathustra 0.6152358652346409
http://www.jstor.org/stable/i338528 designer 0.1496299927861339
http://www.jstor.org/stable/i24712323 prufrock 0.3850736252470401
http://www.jstor.org/stable/i40074765 shak 0.17726733854453525
http://www.jstor.