By <a href="https://nkelber.com">Nathan Kelber</a> and Ted Lawless <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____

## Finding Significant Words within a Dataset Using TF/IDF
**Difficulty:** Intermediate

**Programming Knowledge Required:** 
This notebook can be run on a JSTOR/Portico [non-consumptive](./key-terms.ipynb#non-consumptive) [JSON Lines (.jsonl)](./key-terms.ipynb#jsonl) [dataset](./key-terms.ipynb#dataset) with little to no knowledge of [Python](./key-terms.ipynb#python). To have a full understanding of the code used in this [notebook](./key-terms.ipynb#jupyter-notebook), we recommend learning:
* [Python Basics](https://automatetheboringstuff.com/2e/chapter1/)
* [Flow Control](https://automatetheboringstuff.com/2e/chapter2/)
* [Functions](https://automatetheboringstuff.com/2e/chapter3/)
* [Lists](https://automatetheboringstuff.com/2e/chapter4/)
* [Dictionaries](https://automatetheboringstuff.com/2e/chapter5/)

**Completion time:** 35 minutes

**Data Format:** [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico) [non-consumptive](./key-terms.ipynb#non-consumptive) [JSON Lines (.jsonl)](./key-terms.ipynb#jsonl)

**Libraries Used:**
* **[json](./key-terms.ipynb#json-python-library)** to convert our dataset from json lines format to a Python list
* **[gensim](./key-terms.ipynb#gensim)** to help compute the [tf-idf](./key-terms.ipynb#tf-idf) calculation

**Description of methods in this notebook:**
This [notebook](./key-terms.ipynb#jupyter-notebook) shows how to discover significant words in your [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico) [dataset](./key-terms.ipynb#dataset) using [Python](./key-terms.ipynb#python). The method for finding significant terms is [tf-idf](./key-terms.ipynb#tf-idf).  The following processes are described:

* Converting your [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico)[dataset](./key-terms.ipynb#dataset) into a Python list
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Cleaning each document of your dataset, one [token](./key-terms.ipynb#token) at a time
* Using a dictionary of English words to remove words with poor [OCR](./key-terms.ipynb#ocr)
* Computing the most significant words in your [corpus](./key-terms.ipynb#corpus) using [TFIDF](./key-terms.ipynb#tf-idf) with the [gensim](./key-terms.ipynb#gensim) library

A familiarity with [gensim](./key-terms.ipynb#gensim) is helpful but not required.
____

## Importing your dataset

You have two options for bringing your dataset into the local environment:

1. Manually download and upload your dataset
2. Use a dataset id to automatically upload a dataset

### Option one: Manually download and upload your dataset

You can download your dataset from the corpus builder in the link shown below. (You may also have a link to your dataset in your email.) If you wish, you can modify your dataset on your local machine before the next upload phase. This gives you some more flexibility than automatically pulling in your dataset using a dataset ID using option 2 below.

![The link for downloading your dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/downloadDataset.png)

Once you have your dataset ready on your local machine, you can then upload your dataset into JupyterLab by clicking the upload button in the file pane on the left.

![The upload button in the file pane](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/uploadDataset.png)

Make sure to upload your dataset to the "datasets" folder. 

### Option Two: Use a Dataset ID to automatically upload a dataset

You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

In [482]:
#Importing your dataset with a dataset ID
import tdm_client
tdm_client.get_dataset("f6ae29d4-3a70-36ee-d601-20a8c0311273", "sampleJournalAnalysis") #Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.

# Other humanities datasets:

#English
# Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016) (b4668c50-a970-c4d7-eb2c-bb6d04313542)
# Shakespeare Quarterly (1950-2013) (f6ae29d4-3a70-36ee-d601-20a8c0311273)
# ELH (1934-2014) (4999901a-fa17-31da-cfe5-2abf3a429df7)
# College English (1939-2016) (a161f384-720b-b6bf-a0cc-4d7d3b857e1c)
# PMLA (1889-2014) (1aea53b9-26d5-fe54-e35c-8259156ce6cd)

#History

#Philosophy

#Anthropology

#Law

#Art

#Classics
#Classical Quarterly (1907-2014) (82014740-8ed9-3c34-5716-d0879b8317f6)

'datasets/sampleJournalAnalysis.jsonl'

Before we can begin working with our [dataset](./key-terms.ipynb#dataset), we need to convert the [JSON lines](./key-terms.ipynb#jsonl) file written in [JavaScript](./key-terms.ipynb#javascript) into [Python](./key-terms.ipynb#python) so we can work with it. Remember that each line of our [JSON lines](./key-terms.ipynb#jsonl) file represents a single text, whether that is a journal article, book, or something else. We will create a [Python](./key-terms.ipynb#python) list that contains every document. Within each list item for each document, we will use a [Python dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) to store information related to that document. 

Essentially we will have a [list](./key-terms.ipynb#python-list) of documents numbered, from zero to the last document. Each [list](./key-terms.ipynb#python-list) item then will be composed of a [dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) that allows us to retrieve information from that particular document by number. The structure will look something like this:

![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)

For each item in our list we will be able to use [key/value pairs](./key-terms.ipynb#key-value-pair) to get a **value** if we supply a **key**. We will call our [Python list](./key-terms.ipynb#python-list) variable `all_documents` since it will contain all of the documents in our [corpus](./key-terms.ipynb#corpus).

In [483]:
# Replace with your filename and be sure your file is in your datasets folder
file_name = 'sampleJournalAnalysis.jsonl' 

# Import the json module
import json
# Create an empty new list variable named `all_documents`
all_documents = [] 
# Temporarily open the file `filename` in the datasets/ folder
with open('./datasets/' + file_name) as dataset_file: 
    #for each line in the dataset file
    for line in dataset_file: 
        # Read each line into a Python dictionary.
        # Create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
        document = json.loads(line) 
        # Append a new list item to `all_documents` containing the dictionary we created.
        all_documents.append(document) 

Now all of our documents have been converted from our original [JSON lines](./key-terms.ipynb#jsonl) file format (.jsonl) into a [python List](./key-terms.ipynb#python-list) variable named `all_documents`. Let's see what we can discover about our [corpus](./key-terms.ipynb#corpus) with a few simple methods.

First, we can determine how many texts are in our [dataset](./key-terms.ipynb#dataset) by using the `len()` function to get the size of `all_documents`. 

In [484]:
len(all_documents)

6687

In [485]:
print('Original number of documents: ' + str(len(all_documents)))
reduced_list = [all_documents[x] for x in range(len(all_documents)) if all_documents[x].get('title') != 'Review Article']
print('After removing "Review Articles": ' + str(len(reduced_list)))
reduced_list = [all_documents[x] for x in range(len(reduced_list)) if reduced_list[x].get('title') != 'Front Matter']
print('After removing articles labeled "Front Matter": ' + str(len(reduced_list)))
reduced_list = [all_documents[x] for x in range(len(reduced_list)) if reduced_list[x].get('title') != 'Back Matter']
print('After removing articles labeled "Back Matter": ' + str(len(reduced_list)))
reduced_list = [all_documents[x] for x in range(len(reduced_list)) if reduced_list[x].get('wordCount') < 3000]
print('After removing short articles: ' + str(len(reduced_list)))

Original number of documents: 6687
After removing "Review Articles": 4573
After removing articles labeled "Front Matter": 4324
After removing articles labeled "Back Matter": 4152
After removing short articles: 3008


In [499]:
def remove_non_articles(test_doc):
    print('Article ' + str(i) + ':')
    print('Title: ' + test_doc.get('title'))
    print('URL: ' + test_doc.get('id'))
    print('Status: ', end='')
    if test_doc.get('creators') == None:
        print('Removed--No author')
    elif test_doc.get('title') == 'Review Article':
        print('Removed--Review Article')
    elif test_doc.get('title') == 'Front Matter':
        print('Removed--Front Matter')
    elif test_doc.get('title') == 'Back Matter':
        print('Removed--Back Matter')  
    elif test_doc.get('wordCount') < 3000:
        print('Removed--Too short at ' + str(test_doc.get('wordCount')) + ' words')
    else:
        print('GOOD ARTICLE')      

articles_to_show = 5
#articles_to_show = len(all_documents)
for i in range(articles_to_show):
    remove_non_articles(all_documents[i])  

Article 0:
Title: Review Article
URL: http://www.jstor.org/stable/2869980
Status: Removed--Review Article
Article 1:
Title: Shakespeare in Sydney
URL: http://www.jstor.org/stable/2870198
Status: Removed--Too short at 2032 words
Article 2:
Title: Shakespeare in the Berkshires, 1985
URL: http://www.jstor.org/stable/2870199
Status: Removed--Too short at 1805 words
Article 3:
Title: Review Article
URL: http://www.jstor.org/stable/2870209
Status: Removed--Review Article
Article 4:
Title: Review Article
URL: http://www.jstor.org/stable/2870208
Status: Removed--Review Article


Let's create a helper function that can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* lower cases all [tokens](./key-terms.ipynb#token)
* use a dictionary from [The HathiTrust Research Center](./key-terms.ipynb#htrc) to correct common [Optical Character Recognition](./key-terms.ipynb#ocr) problems
* discard [tokens](./key-terms.ipynb#token) less than 4 characters in length
* discard [tokens](./key-terms.ipynb#token) with non-alphabetical characters
* remove [stopwords](./key-terms.ipynb#stop-words) based on [The HathiTrust Research Center](./key-terms.ipynb#htrc) [stopword](./key-terms.ipynb#stop-words) list

In [487]:
from tdm_client import htrc_corrections

def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs token through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

def process_document(chosen_document):
    this_doc = []
    singleDoc = chosen_document.get('unigramCount')
    for token, count in singleDoc.items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append(this_doc)

In [489]:
documents = []
for i in range(len(reduced_list)):
    process_document(reduced_list[i])

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.

In [490]:
import gensim
dictionary = gensim.corpora.Dictionary(documents)

In [491]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [492]:
model = gensim.models.TfidfModel(bow_corpus)

In [493]:
corpus_tfidf = model[bow_corpus]

Find the most significant terms, by TFIDF, in the curated dataset. 

In [494]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [495]:
for term, weight in sorted_td[:25]:
    print(term, weight)

ofamiem 1.0
ouderdom 0.9127008148928485
sturgess 0.9086024303923905
zamir 0.8776651519018113
santayana 0.8665736199609847
falocco 0.8562928124271615
weingust 0.8547899776401692
chinese 0.8462001652815331
weils 0.8445705864452131
rudanko 0.8390830868389877
enbiemata 0.8280102498481464
daileader 0.8171301955562135
nodier 0.8168018882816901
usury 0.8005782510346803
menas 0.7909230276348473
beaurline 0.7905965479055058
spectogram 0.7879121817261375
franciscus 0.7771865284620213
soellner 0.77327831058108
bastarde 0.7712490973605648
unton 0.7682807524450844
cohens 0.7677867862204198
falco 0.7635813973078707
callimachus 0.7604638066069844
wynkyn 0.7583409894849417


Print the most significant word, by TFIDF, for the first 50 documents in the corpus. 

In [498]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(reduced_list[n].get('id'), dictionary.get(word_id), score)
    if n >= 50:
        break

http://www.jstor.org/stable/2869980 henslowe 0.33564303350040253
http://www.jstor.org/stable/2870198 stairs 0.15472663369841613
http://www.jstor.org/stable/2870199 beatrice 0.23846910220032727
http://www.jstor.org/stable/2870209 donaldson 0.4977445938192701
http://www.jstor.org/stable/2870208 cheng 0.6067690431054176
http://www.jstor.org/stable/2870203 hartwig 0.6244428220115
http://www.jstor.org/stable/2870194 hall 0.5136214288878345
http://www.jstor.org/stable/2870206 novy 0.5645876593477805
http://www.jstor.org/stable/2870202 booth 0.3864621355873546
http://www.jstor.org/stable/2870313 vizcaya 0.6046299393365276
http://www.jstor.org/stable/2870327 hollar 0.639842687006522
http://www.jstor.org/stable/2870308 longleat 0.5039894018479222
http://www.jstor.org/stable/2869730 andidentifies 0.1635030038209591
http://www.jstor.org/stable/2869726 rubinstein 0.5497022223906934
http://www.jstor.org/stable/2871196 carded 0.2820633806904763
http://www.jstor.org/stable/2871206 jaggard 0.270489292