By <a href="https://nkelber.com">Nathan Kelber</a> and Ted Lawless <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____

# Finding Significant Words Using TF/IDF
**Difficulty:** Intermediate

**Knowledge Required:** 
* [Python Basics I](./0-python-basics-1.ipynb)
* [Python Basics II](./0-python-basics-2.ipynb)
* [Python Basics III](./0-python-basics-3.ipynb)

**Knowledge Recommended:**
* [Exploring Metadata](./2-metadata-localcopy.ipynb)

**Completion time:** 90 minutes

**Data Format:** [JSTOR](./key-terms.ipynb#jstor)/[Portico](./key-terms.ipynb#portico) [JSON Lines (.jsonl)](./key-terms.ipynb#jsonl)

**Libraries Used:**
* **[json](./key-terms.ipynb#json-python-library)** to convert our dataset from json lines format to a Python list
* **[gensim](./key-terms.ipynb#gensim)** to help compute the [tf-idf](./key-terms.ipynb#tf-idf) calculation

**Description of methods in this notebook:**
This [notebook](./key-terms.ipynb#jupyter-notebook) shows how to discover significant words in your [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico) [dataset](./key-terms.ipynb#dataset) using [Python](./key-terms.ipynb#python). The method for finding significant terms is [tf-idf](./key-terms.ipynb#tf-idf).  The following processes are described:

* Converting your [JSTOR](./key-terms.ipynb#jstor)/[Portico](./key-terms.ipynb#portico) [dataset](./key-terms.ipynb#dataset) into a Python list
* Filtering out articles from your [dataset](./key-terms.ipynb#dataset) depending on a flexible set of rules
* Writing a helper function to help clean up a single [token](./key-terms.ipynb#token)
* Cleaning each document of your dataset, one [token](./key-terms.ipynb#token) at a time
* Using htrc_dictionary to remove words with poor [OCR](./key-terms.ipynb#ocr)
* Creating a [gensim dictionary](./key-terms.ipynb#gensim-dictionary)
* Creating a [gensim](./key-terms.ipynb#gensim) [bag of words]((./key-terms.ipynb#bag-of-words) [corpus](./key-terms.ipynb#corpus)
* Computing the most significant words in your [corpus](./key-terms.ipynb#corpus) using [gensim](./key-terms.ipynb#gensim) implementation of [TF-IDF](./key-terms.ipynb#tf-idf)

A familiarity with [gensim](./key-terms.ipynb#gensim) is helpful but not required.
____

## What is "Term Frequency- Inverse Document Frequency" (TF-IDF)?

[TF-IDF](./key-terms.ipynb#tf-idf) is used in [machine learning](./key-terms.ipynb#machine-learning) and [natural language processing](./key-terms.ipynb/#nlp) for measuring the significance of particular terms for a given document. It consists of two parts that are multiplied together:

1. Term Frequency- A measure of how many times a given word appears in a document
2. Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus

If we were to merely consider [word frequency](./key-terms.ipynb#word-frequency), the most frequent words would be common [function words](./key-terms.ipynb#function-words) like: "the", "and", "of". We could use a [stopwords list](./key-terms.ipynb#stop-words) to remove the common [function words](./key-terms.ipynb#function-words), but that still may not give us results that describe the unique terms in the document since the uniqueness of terms depends on the context of a larger body of documents. In other words, the same term could be significant or insignificant depending on the context. Consider these examples:

* Given a set of scientific journal articles in biology, the term "lab" may not be significant since biologists often rely on and mention labs in their research. However, if the term "lab" were to occur frequently in a history or English article, then it is likely to be significant since humanities articles rarely discuss labs. 
* If we were to look at thousands of articles in literary studies, then the term "postcolonial" may be significant for any given article. However, if were to look at a few hundred articles on the topic of "the global south," then the term "postcolonial" may occur so frequently that it is not a significant way to differentiate between the articles.

The [TF-IDF](./key-terms.ipynb#tf-idf) calculation reveals the words that are frequent in this document **yet rare in other documents**. The goal is to find out what is unique or remarkable about a document given the context (and *the given context* can change the results of the analysis). 

Here is how the calculation is mathematically written:

$$tfidf_{t,d} = tf_{t,d} \cdot idf_{t,D}$$

In plain English, this means: **The value of [TF-IDF](./key-terms.ipynb#tf-idf) is the product (or multiplication) of a given term's frequency multiplied by its inverse document frequency.** Let's unpack these terms one at a time.

### Term Frequency Function

$$tf_{t,d}$$
The number of times (t) a term occurs in a given document (d)

### Inverse Document Frequency Function

$$idf_i = \mbox{log} \frac{N}{|{d : t_i \in d}|}$$
The inverse document frequency can be expanded to the calculation on the right. In plain English, this means: **The log of the total number of documents (N) divided by the number of documents that contain the term**

### TF-IDF Calculation in Plain English

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

There are variations on the [TF-IDF](./key-terms.ipynb#tf-idf) formula, but this is the most widely-used version.

### An Example Calculation of TF-IDF

Let's take a look at an example to illustrate the fundamentals of [TF-IDF](./key-terms.ipynb#tf-idf). First, we need several texts to compare. Our texts will be very simple.

* text1 = 'The grass was green and spread out the distance like the sea.'
* text2 = 'Green eggs and ham were spread out like the book.'
* text3 = 'Green sailors were met like the sea met troubles.'
* text4 = 'The grass was green.'

The first step is we need to discover how many unique words are in each text. 

|text1|text2|text3|text4|
|    ---    | ---| --- | --- |
|the|green|green|the|
|grass|eggs|sailors|grass|
|was|and|were|was|
|green|ham|met|green|
|and|were|like| |
|spread|spread|the| |
|out|out|sea| |
|into|like|met| |
|distance|the|troubles| |
|like|book| | |
|sea| | | |


Our four texts share some similar words. Next, we create a single list of unique words that occur across all three texts. (When we use the [gensim](./key-terms.ipynb#gensim) library later, we will call this list a [gensim dictionary](./key-terms.ipynb#gensim-dictionary).) 

|Unique Words|
| --- |
|and|
|book|
|distance|
|eggs|
|grass|
|green|
|ham|
|like|
|met|
|out|
|sailors|
|sea|
|spread|
|the|
|troubles|
|was|
|were|

Now let's count the occurences of each unique word in each sentence

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|and|1|1|0|0|
|book|0|1|0|0|
|distance|1|0|0|0|
|eggs|0|1|0|0|
|grass|1|0|0|1|
|green|1|1|1|1|
|ham|0|1|0|0|
|like|1|1|1|0|
|met|0|0|2|0|
|out|1|1|0|0|
|sailors|0|0|1|0|
|sea|1|0|1|0|
|spread|1|1|0|0|
|the|3|1|1|1|
|troubles|0|0|1|0|
|was|1|0|0|1|
|were|0|1|1|0|

### Computing TF-IDF (Example 1)

We have enough information now to compute [TF-IDF](./key-terms.ipynb#tf-idf) for every word in our corpus. Recall the plain English formula.

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

We can use the formula to compute [TF-IDF](./key-terms.ipynb#tf-idf) for the most common word in our corpus: 'the'. In total, we will compute [TF-IDF](./key-terms.ipynb#tf-idf) four times (once for each of our texts). 

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|the|3|1|1|1|

text1: $$ tf-idf = 3 \cdot \mbox{log} \frac{4}{(4)} = 3 \cdot \mbox{log} 1 = 3 \cdot 0 = 0$$
text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$
text3: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$
text4: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(4)} = 1 \cdot \mbox{log} 1 = 1 \cdot 0 = 0$$

The results of our analysis suggest 'the' has a weight of 0 in every document. The word 'the' exists in all of our documents, and therefore it is not a significant term to differentiate one document from another.

Given that idf is

$$\mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

and 

$$\mbox{log} 1 = 0$$
we can see that [TF-IDF](./key-terms.ipynb#tf-idf) will be 0 for any word that occurs in every document. That is, if a word occurs in every document, then it is not a significant term for any individual document.



### Computing TF-IDF (Example 2)

Let's try a second example with the word 'out'. Recall the plain English formula.

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

We will compute [TF-IDF](./key-terms.ipynb#tf-idf) four times, once for each of our texts. 

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|out|1|1|0|0|

text1: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$
text2: $$ tf-idf = 1 \cdot \mbox{log} \frac{4}{(2)} = 1 \cdot \mbox{log} 2 = 1 \cdot .3010 = .3010$$
text3: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$
text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(2)} = 0 \cdot \mbox{log} 2 = 0 \cdot .3010 = 0$$

The results of our analysis suggest 'out' has some significance in text1 and text2, but no significance for text3 and text4 where the word does not occur.

### Computing TF-IDF (Example 3)

Let's try one last example with the word 'met'. Here's the [TF-IDF](./key-terms.ipynb#tf-idf) formula again:

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

And here's how many times the word 'met' occurs in each text.

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|met|0|0|2|0|

text1: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$
text2: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$
text3: $$ tf-idf = 2 \cdot \mbox{log} \frac{4}{(1)} = 2 \cdot \mbox{log} 4 = 1 \cdot .6021 = 1.2042$$
text4: $$ tf-idf = 0 \cdot \mbox{log} \frac{4}{(1)} = 0 \cdot \mbox{log} 4 = 1 \cdot .6021 = 0$$

As should be expected, we can see that the word 'met' is very significant in text3 but not significant in any other text since it does not occur in any other text. 

## The Full TF-IDF Example Table

Here are the original sentences for each text:

* text1 = 'The grass was green and spread out the distance like the sea.'
* text2 = 'Green eggs and ham were spread out like the book.'
* text3 = 'Green sailors were met like the sea met troubles.'
* text4 = 'The grass was green.'

And here's the corresponding [TF-IDF](./key-terms.ipynb#tf-idf) scores for each word in each text:

|word|text1|text2|text3|text4|
|---|---|---|---|---|
|and|.3010|.3010|0|0|
|book|0|.6021|0|0|
|distance|.6021|0|0|0|
|eggs|0|.6021|0|0|
|grass|.3010|0|0|.3010|
|green|0|0|0|0|
|ham|0|.6021|0|0|
|like|.1249|.1249|.1249|0|
|met|0|0|.6021|0|
|out|.3010|.3010|0|0|
|sailors|0|0|.6021|0|
|sea|.3010|0|.3010|0|
|spread|.3010|.3010|0|0|
|the|0|0|0|0|
|troubles|0|0|.6021|0|
|was|.3010|0|0|.3010|
|were|0|.3010|.3010|0|

There are a few noteworthy things in this data. 

* The [TF-IDF](./key-terms.ipynb#tf-idf) score for any word that does not occur in a text is 0.
* The scores for almost every word in text4 are 0 since it is a shorter version of text1. There are no unique words in text4 since text1 contains all the same words. It is also a short text which means that there are only four words to consider. The words 'the' and 'green' occur in every text, leaving only 'was' and 'grass' which are also found in text1.
* The words 'book', 'eggs', and 'ham' are significant in text2 since they only occur in that text.

Now that you have a basic understanding of how [TF-IDF](./key-terms.ipynb#tf-idf) is computed at a small scale, let's try computing [TF-IDF](./key-terms.ipynb#tf-idf) on a [corpus](./key-terms.ipynb#corpus) which could contain millions of words.

---

# Computing TF-IDF with your JSTOR/Portico Dataset

## Importing your dataset

You have two options for bringing your dataset into the local environment:

1. Manually download and upload your dataset
2. Use a dataset id to automatically upload a dataset

### Option one: Manually download and upload your dataset

You can download your dataset from the corpus builder in the link shown below. (You may also have a link to your dataset in your email.) If you wish, you can modify your dataset on your local machine before the next upload phase. This gives you some more flexibility than automatically pulling in your dataset using a dataset ID using option 2 below.

![The link for downloading your dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/downloadDataset.png)

Once you have your dataset ready on your local machine, you can then upload your dataset into JupyterLab by clicking the upload button in the file pane on the left.

![The upload button in the file pane](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/uploadDataset.png)

Make sure to upload your dataset to the "datasets" folder. 

### Option Two: Use a Dataset ID to automatically upload a dataset

You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

In [48]:
#Importing your dataset with a dataset ID
import tdm_client
tdm_client.get_dataset("cord-19", "sampleJournalAnalysis") #Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.

# Other humanities datasets:

#English
# Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016) (b4668c50-a970-c4d7-eb2c-bb6d04313542)
# Shakespeare Quarterly (1950-2013) (f6ae29d4-3a70-36ee-d601-20a8c0311273)
# ELH (1934-2014) (4999901a-fa17-31da-cfe5-2abf3a429df7)
# College English (1939-2016) (a161f384-720b-b6bf-a0cc-4d7d3b857e1c)
# PMLA (1889-2014) (1aea53b9-26d5-fe54-e35c-8259156ce6cd)

#History

#Philosophy

#Anthropology

#Law

#Art

#Classics
#Classical Quarterly (1907-2014) (82014740-8ed9-3c34-5716-d0879b8317f6)

'datasets/sampleJournalAnalysis.jsonl'

Before we can begin working with our [dataset](./key-terms.ipynb#dataset), we need to convert the [JSON lines](./key-terms.ipynb#jsonl) file written in [JavaScript](./key-terms.ipynb#javascript) into [Python](./key-terms.ipynb#python) so we can work with it. Remember that each line of our [JSON lines](./key-terms.ipynb#jsonl) file represents a single text, whether that is a journal article, book, or something else. We will create a [Python](./key-terms.ipynb#python) list that contains every document. Within each list item for each document, we will use a [Python dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) to store information related to that document. 

Essentially we will have a [list](./key-terms.ipynb#python-list) of documents numbered, from zero to the last document. Each [list](./key-terms.ipynb#python-list) item then will be composed of a [dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) that allows us to retrieve information from that particular document by number. The structure will look something like this:

![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)

For each item in our list we will be able to use [key/value pairs](./key-terms.ipynb#key-value-pair) to get a **value** if we supply a **key**. We will call our [Python list](./key-terms.ipynb#python-list) variable `all_documents` since it will contain all of the documents in our [corpus](./key-terms.ipynb#corpus).

In [56]:
# Replace with your filename and be sure your file is in your datasets folder
file_name = 'sampleJournalAnalysis.jsonl' 

# Import the json module
import json
# Create an empty new list variable named `all_documents`
all_documents = [] 
# Temporarily open the file `filename` in the datasets/ folder
with open('./datasets/' + file_name) as dataset_file: 
    #for each line in the dataset file
    for line in dataset_file: 
        # Read each line into a Python dictionary.
        # Create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
        document = json.loads(line) 
        # Append a new list item to `all_documents` containing the dictionary we created.
        all_documents.append(document) 

Now all of our documents have been converted from our original [JSON lines](./key-terms.ipynb#jsonl) file format (.jsonl) into a [python List](./key-terms.ipynb#python-list) variable named `all_documents`. Let's see what we can discover about our [corpus](./key-terms.ipynb#corpus) with a few simple methods.

First, we can determine how many texts are in our [dataset](./key-terms.ipynb#dataset) by using the `len()` function to get the size of `all_documents`. 

In [50]:
len(all_documents)

682

---
## Removing Articles That Are Not Full-Length

When journal articles are added to JSTOR, they are broken up into chunks called articles. If we want to analyze every word in every issue, this approach works well. However, if we only want to analyze the full-length articles, we may want to remove some articles. These articles could be things like:

* Tables of Contents
* Indices
* Unauthored Materials
* Short Notes
* Book Reviews

We can design a set of assessments that will remove these materials. Depending on the journal and the field, shorter articles may still be relevant. The example code below demonstrates a set of rules for filtering using the article metadata.

### Articles To Be Removed
* Articles not in English
* Articles with no authors
* Articles with title: "Review Article"
* Articles with title: "Front Matter"
* Articles with title: "Back Matter"
* Articles with a word count less than 3000 words

The function below outputs a list of the first ten articles followed by the reason each article would or would not be kept. This can be a useful exploratory tool for getting the best starting corpus. You may need to adjust the word count number up or down, particularly for distinguishing between say short notes and full-length articles. You might also consider writing other [metadata](./key-terms.ipynb#metadata) field tests to narrow your [corpus](./key-terms.ipynb#corpus). **Note: The Article Assessment Exploration code below does not change your corpus found within ``all_documents``. It merely serves as a convenient way to adjust the logic of the actual filtering that happens in the next step.**

In [55]:
#Article Assessment Exploration

## Define a function ``remove_non_articles`` that will test a single document
def remove_non_articles(test_doc):
    print('Article ' + str(i) + ':') # Print the list index for each article so they can be easily referenced
    print('Title: ' + test_doc.get('title')) # Print the title for the article in question
    print('URL: ' + test_doc.get('id')) # Print the URL for the article so it can be quickly reviewed
    print('Status: ', end='') # Print the phrase 'Status' without a following line-break
    #if test_doc.get('language') != ['eng']:
        #print('Removed--Not in English')
    if test_doc.get('creators') == None: # Get the value for the key 'creators' in the test document and check if it is equal to none
        print('Removed--No author') # If the value for 'creators' is none, print 'Removed--No author'
    elif test_doc.get('title') == 'Review Article': # Get the value for the key 'title' in the test document and check if it is equal to 'Review Article'
        print('Removed--Review Article') # If the value for 'title' is 'Review Article', print 'Removed--Review Article'
    elif test_doc.get('title') == 'Front Matter': # Get the value for the key 'title' in the test document and check if it is equal to 'Front Matter'
        print('Removed--Front Matter') # If the value for 'title' is 'Front Matter', print 'Removed--Front Matter'
    elif test_doc.get('title') == 'Back Matter': # Get the value for the key 'title' in the test document and check if it is equal to 'Back Matter'
        print('Removed--Back Matter')  # If the value for 'title' is 'Back Matter', print 'Removed--Back Matter'
    elif test_doc.get('wordCount') < 3000: # Get the value for 'wordCount' in the test document and check if the integer is less than 3000 words (Change this number if you want more or less words)
        print('Removed--Too short at ' + str(test_doc.get('wordCount')) + ' words') # If the value for wordCount is less than 3000, print 'Removed--Too short at' followed by the actual word count
    else:
        print('GOOD ARTICLE at '+ str(test_doc.get('wordCount')) + ' words') # If the article passes all the above tests, print 'GOOD ARTICLE at ' with the article word count

articles_to_show = 20 # Show the first ten articles (Change this number to show more or fewer articles)
#articles_to_show = len(all_documents) # Uncomment to show all articles
for i in range(articles_to_show): # Repeat this process the number of times as the value of ``articles_to_show`` + 1
    remove_non_articles(all_documents[i])  # Run the remove_non_articles function on a single document with list index value of ``i``

Article 0:
Title: Residue analysis of a CTL epitope of SARS-CoV spike protein by IFN-gamma production and bioinformatics prediction
URL: cord19-4152ae12ac49d157a290281842bce9e9d16096a7
Status: Removed--No author
Article 1:
Title: Send Orders of Reprints at bspsaif@emirates.net.ae Synthetic Genomics and Synthetic Biology Applications Between Hopes and Concerns
URL: cord19-0414c8763c6ec295be60d86698ec3a76e981e54f
Status: Removed--No author
Article 2:
Title: Identification, Characterization and Application of a G- Quadruplex Structured DNA Aptamer against Cancer Biomarker Protein Anterior Gradient Homolog 2
URL: cord19-a015401a1cdf151bd0bca060bd8eb96d6683a147
Status: Removed--No author
Article 3:
Title: Evasion of Antiviral Innate Immunity by Theiler's Virus L* Protein through Direct Inhibition of RNase L
URL: cord19-5908982f4dbbfde0faabcfadc0988b8a19223411
Status: Removed--No author
Article 4:
Title: Building core capacities at the designated points of entry according to the Internationa

Now that we've done some exploratory analysis to figure out the right parameters for filtering our corpus, we can put them into actual practice. The following set of [list comprehensions](./key-terms.ipynb#list-comprehensions) creates a new list called ``reduced_list`` from our original [corpus](./key-terms.ipynb#corpus) ``all_documents``. These [list comprehensions](./key-terms.ipynb#list-comprehensions) can be adjusted to your [corpus](./key-terms.ipynb#corpus) to get the best results. For example, you may want to change the ``wordCount`` to be larger or smaller than the default below.

After each list reduction, the number of remaining articles is printed for reference.

In [52]:
print('Original number of documents: ' + str(len(all_documents))) # Print the original number of documents in ``all_documents``

reduced_list = []
reduced_list = [all_documents[x] for x in range(len(all_documents)) if all_documents[x].get('language') == ['eng']] # Copy each list item from ``all_documents`` to ``reduced_list`` if the ``language`` key has a value pair of ['eng']
print('After removing articles not in English: ' + str(len(reduced_list))) # Print the current size of ``reduced_list``
#Note that this first filter works on ``all_documents`` but the following filters must work on ``reduced_list`` 

reduced_list = [reduced_list[x] for x in range(len(reduced_list)) if reduced_list[x].get('creators') != None] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``creators`` key does not have a value pair of None
print('After removing articles with no authors: ' + str(len(reduced_list))) # Print the current size of ``reduced_list``

reduced_list = [reduced_list[x] for x in range(len(reduced_list)) if reduced_list[x].get('title') != 'Review Article'] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``title`` key does not have a value pair of 'Review Article'
print('After removing "Review Articles": ' + str(len(reduced_list))) # Print the current size of ``reduced_list``

reduced_list = [reduced_list[x] for x in range(len(reduced_list)) if reduced_list[x].get('title') != 'Front Matter'] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``title`` key does not have a value pair of 'Front Matter'
print('After removing articles labeled "Front Matter": ' + str(len(reduced_list))) # Print the current size of ``reduced_list``

reduced_list = [reduced_list[x] for x in range(len(reduced_list)) if reduced_list[x].get('title') != 'Back Matter'] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``title`` key does not have a value pair of 'Back Matter'
print('After removing articles labeled "Back Matter": ' + str(len(reduced_list))) # Print the current size of ``reduced_list``

reduced_list = [reduced_list[x] for x in range(len(reduced_list)) if reduced_list[x].get('wordCount') > 3000] # Copy each list item from ``reduced_list`` to ``reduced_list`` if the ``wordCount`` has a value pair less than 3000
print('After removing short articles: ' + str(len(reduced_list))) # Print the current size of ``reduced_list``

Original number of documents: 682
After removing articles not in English: 0
After removing articles with no authors: 0
After removing "Review Articles": 0
After removing articles labeled "Front Matter": 0
After removing articles labeled "Back Matter": 0
After removing short articles: 0


If you'd like to see what's in your ``reduced_list``, you can run the code below to supply some of the metadata for the first 10 items.

In [57]:
reduced_list = all_documents

In [58]:
#Print information for the first 10 items in the ``reduced_list`` 
for r in range (10):
    print('Title ' + str(r) + ': ' + reduced_list[r].get('title'))
    print('Language: ' + str(reduced_list[r].get('language')))
    print('Authors: ' + str(reduced_list[r].get('creators')))
    print('Number of words: ' + str(reduced_list[r].get('wordCount')), end='\n\n')

Title 0: Residue analysis of a CTL epitope of SARS-CoV spike protein by IFN-gamma production and bioinformatics prediction
Language: None
Authors: None
Number of words: None

Title 1: Send Orders of Reprints at bspsaif@emirates.net.ae Synthetic Genomics and Synthetic Biology Applications Between Hopes and Concerns
Language: None
Authors: None
Number of words: None

Title 2: Identification, Characterization and Application of a G- Quadruplex Structured DNA Aptamer against Cancer Biomarker Protein Anterior Gradient Homolog 2
Language: None
Authors: None
Number of words: None

Title 3: Evasion of Antiviral Innate Immunity by Theiler's Virus L* Protein through Direct Inhibition of RNase L
Language: None
Authors: None
Number of words: None

Title 4: Building core capacities at the designated points of entry according to the International Health Regulations 2005: a review of the progress and prospects in Taiwan
Language: None
Authors: None
Number of words: None

Title 5: Excessive production

## Cleaning Up the Tokens in the Corpus

Let's create a helper function that can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset). The function will:
* lower case all [tokens](./key-terms.ipynb#token)
* use a dictionary from [The HathiTrust Research Center](./key-terms.ipynb#htrc) to correct common [Optical Character Recognition](./key-terms.ipynb#ocr) problems
* discard [tokens](./key-terms.ipynb#token) less than 4 characters in length
* discard [tokens](./key-terms.ipynb#token) with non-alphabetical characters
* remove [stopwords](./key-terms.ipynb#stop-words) based on [The HathiTrust Research Center](./key-terms.ipynb#htrc) [stopword](./key-terms.ipynb#stop-words) list

In [59]:
from tdm_client import htrc_corrections # Import the htrc_corrections that helps correct common OCR problems

def process_token(token): #define a function `process_token` that takes the argument `token`
    token = token.lower() #set the string in token to a new string with all lowercase letters
    corrected = htrc_corrections.get(token) #initialize a new variable `corrected` that runs token through the `htrc_corrections.get()` function to fix common OCR errors
    if corrected is not None: #if corrected has a value, set the `token` variable to the same value as `corrected`
        token = corrected
    if len(token) < 4: #if token is less than four characters, return nothing for process_function (no output here essentially erases this token)
        return
    if not(token.isalpha()): #if token contains non-alphabetic characters, return nothing for process_function (no output here essentially erases this token)
        return
    return token #return the `token` variable which has been set equal to the `corrected` variable

def process_document(chosen_document): # Create a new function ``process_document`` that takes the argument chosen_document
    this_doc = [] # Create a new list ``this_doc`` that will hold the contents of the current document
    singleDoc = chosen_document.get('unigramCount') # Create a list variable ``singleDoc` that will contain the contents of `unigramCount` for the current document
    for token, count in singleDoc.items(): # For each token in the document, 
        clean_token = process_token(token) # Use the ``process_token`` function above to clean that token
        if clean_token is None: # If there is no token returned, proceed
            continue
        this_doc += [clean_token] * count # Add to ``this_doc`` list the number of token occurences
    documents.append(this_doc) # Add the token count results for ``this_doc`` to the ``documents`` list

Now let's cycle through each document in the [corpus](./key-terms.ipynb#corpus) with our helper function.

In [60]:
documents = [] # An empty variable ``documents`` that will contain all our documents with cleaned tokens
for i in range(len(reduced_list)): # Repeat this process once for every document in ``reduced_list``
    process_document(reduced_list[i]) # Run the ``process_token`` function on the single article by reference to its index number of i

---
# Using Gensim to Compute "Term Frequency- Inverse Document Frequency"

It will be helpful to remember the basic steps we did in the explanatory [TF-IDF](./key-terms.ipynb#tf-idf) example:

1. Create a list of the frequency of every word in every document
2. Create a list of every word in the [corpus](./key-terms.ipynb#corpus)
3. Compute [TF-IDF](./key-terms.ipynb#tf-idf) based on that data

So far, we have completed the first item by creating a list of the frequency of every word in every document. Now we need to create a list of every word in the corpus. In [gensim](./key-terms.ipynb#gensim), this is called a "dictionary". A [gensim dictionary](./key-terms.ipynb#gensim-dictionary) is similar to a [Python dictionary](./key-terms.ipynb#python-dictionary), but here it is called a [gensim dictionary](./key-terms.ipynb#gensim-dictionary) to show it is a specialized kind of dictionary.

## Creating a Gensim Dictionary

Let's create our [gensim dictionary](./key-terms.ipynb#gensim-dictionary). A [gensim dictionary](./key-terms.ipynb#gensim-dictionary) is a kind of masterlist of all the words across all the documents in our corpus. Each unique word is assigned an ID in the gensim dictionary. The result is a set of key/value pairs of unique tokens and their unique IDs.

In [61]:
import gensim
dictionary = gensim.corpora.Dictionary(documents) # Create the Gensim dictionary based on our ``documents`` variable

Now that we have a [gensim dictionary](./key-terms.ipynb#gensim-dictionary), we can get a preview that displays the number of unique tokens across all of our texts.

In [62]:
print(dictionary)

Dictionary(49058 unique tokens: ['ability', 'about', 'absence', 'absences', 'according']...)


The [gensim dictionary](./key-terms.ipynb#gensim-dictionary) stores a unique identifier (starting with 0) for every unique token in the corpus. The [gensim dictionary](./key-terms.ipynb#gensim-dictionary) does not contain information on word frequencies; it only catalogs all the words in the corpus. You can see the unique ID for each token in the text using the .token2id() method. Your corpus may have hundreds of thousands of unique words so here we just give a preview of the first ten.

In [63]:
dict(list(dictionary.token2id.items())[0:10]) # Print the first ten tokens and their associated IDs.


{'ability': 0,
 'about': 1,
 'absence': 2,
 'absences': 3,
 'according': 4,
 'accuracy': 5,
 'acetonitrile': 6,
 'achieved': 7,
 'acid': 8,
 'acids': 9}

We can also look up the corresponding ID for a token using the ``.get`` method.

In [None]:
dictionary.token2id.get('people', 0) # Get the value for the key 'people'. Return 0 if there is no token matching 'people'. The number returned is the gensim dictionary ID for the token. 

## Creating a Bag of Words Corpus


### A Single Document Example

The next step is to combine our word frequency data found within ``documents`` to our [gensim dictionary](./key-terms.ipynb#gensim-dictionary) token IDs. For every document, we want to know how many times a word (notated by its ID) occurs. We can do a single document first to show how this works. We will create a [Python list](./key-terms.ipynb#python-list) called ``example_bow_corpus`` that will turn our word counts into a series of [tuples](./key-terms.ipynb#tuple) where the first number is the [gensim dictionary](./key-terms.ipynb#gensim-dictionary) token ID and the second number is the word frequency.

In [64]:
example_bow_corpus = [dictionary.doc2bow(documents[31])] # Create an example bag of words corpus. We select a document at random to use as our sample.
list(example_bow_corpus[0][:10]) # List out the first ten tuples in ``example_bow_corpus``

[(1, 2),
 (8, 5),
 (10, 1),
 (14, 8),
 (15, 5),
 (21, 11),
 (22, 2),
 (31, 1),
 (32, 2),
 (33, 1)]

Using IDs can seem a little abstract, but we can discover the word associated with a particular ID. For demonstration purposes, the following code will replace the token IDs in the last example with the actual tokens.

In [65]:
word_counts = [[(dictionary[id], count) for id, count in line] for line in example_bow_corpus]
list(word_counts[0][:10])

[('about', 2),
 ('acid', 5),
 ('acquired', 1),
 ('added', 8),
 ('addition', 5),
 ('after', 11),
 ('against', 2),
 ('almost', 1),
 ('also', 2),
 ('although', 1)]

We saw before that you could discover the [gensim dictionary](./key-terms.ipynb#gensim-dictionary) ID number by running:

> dictionary.token2id.get('people', 0)

If you wanted to discover the token given only the ID number, the method is a little more involved. You could use [list comprehension](./key-terms.ipynb#list-comprehensions) to find the **key** token based on the **value** ID. Normally, [Python dictionaries](./key-terms.ipynb#python-dictionary) only map from keys to values (not from values to keys). However, we can write a quick [list comprehension](./key-terms.ipynb#list-comprehensions) to go the other direction. (It is unlikely one would ever do these methods in practice, but they are shown here to demonstrate how the [gensim dictionary](./key-terms.ipynb#gensim-dictionary) is connected to the list entries in the [gensim](./key-terms.ipynb#gensim) ``bow_corpus``. 

In [66]:
[token for dict_id, token in dictionary.items() if dict_id == 155] # Find the corresponding token in our gensim dictionary for the gensim dictionary ID 239

['contained']

## Creating a Bag of Words Corpus Using Every Document

We have seen an example that demonstrates how the [gensim](./key-terms.ipynb#gensim) [bag of words](./key-terms.ipynb#bag-of-words) [corpus](./key-terms.ipynb#corpus) works on a single document. Let's apply it now to all of our documents. 

In [67]:
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]
#print(bow_corpus[:3]) #Show the bag of words corpus for the first 3 documents

The next step is to create the [TF-IDF](./key-terms.ipynb#tf-idf) model which will set the parameters for our implementation of [TF-IDF](./key-terms.ipynb#tf-idf). In our [TF-IDF](./key-terms.ipynb#tf-idf) example, the formula for [TF-IDF](./key-terms.ipynb#tf-idf) was:

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \mbox{log} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

In [gensim](./key-terms.ipynb#gensim), the default formula for measuring [TF-IDF](./key-terms.ipynb#tf-idf) uses log base 2 instead of log base 10, as shown:

$$ tf-idf = (Number-of-times-the-word-occurs-in-given-document) \cdot \log_{2} \frac{(Total-number-of-documents)}{(Total-number-of-documents-containing-the-word)}$$

If you would like to use a different formula for your [TF-IDF](./key-terms.ipynb#tf-idf) calculation, there is a description of [parameters you can pass](https://radimrehurek.com/gensim/models/tfidfmodel.html).

In [68]:
model = gensim.models.TfidfModel(bow_corpus) # Create our gensim TF-IDF model

Now, we apply our model to the ``bow_corpus`` to create our results in ``corpus_tfidf``. The ``corpus_tfidf`` is a python list of each document similar to ``bow_document``. Instead of listing the frequency next to the [gensim dictionary](./key-terms.ipynb#gensim-dictionary) ID, however, it contains the TF-IDF](./key-terms.ipynb#tf-idf) score for the associated token. Below, we display the first document in ``corpus_tfidf``. 

In [69]:
corpus_tfidf = model[bow_corpus] # Create TF-IDF scores for the ``bow_corpus`` using our model
list(corpus_tfidf[0][:10]) # List out the TF-IDF scores for the first 10 tokens of the first text in the corpus

[(0, 0.011799202765245332),
 (1, 0.008036048923438305),
 (2, 0.004369120139099937),
 (3, 0.028736792523255245),
 (4, 0.011478847004727148),
 (5, 0.010378315746537695),
 (6, 0.01880829168645958),
 (7, 0.00711351106333274),
 (8, 0.0460031750787105),
 (9, 0.016458702901873113)]

Let's display the tokens instead of the [gensim dictionary](./key-terms.ipynb#gensim-dictionary) IDs.

In [70]:
example_tfidf_scores = [[(dictionary[id], count) for id, count in line] for line in corpus_tfidf]
list(example_tfidf_scores[0][:10]) # List out the TF-IDF scores for the first 10 tokens of the first text in the corpus

[('ability', 0.011799202765245332),
 ('about', 0.008036048923438305),
 ('absence', 0.004369120139099937),
 ('absences', 0.028736792523255245),
 ('according', 0.011478847004727148),
 ('accuracy', 0.010378315746537695),
 ('acetonitrile', 0.01880829168645958),
 ('achieved', 0.00711351106333274),
 ('acid', 0.0460031750787105),
 ('acids', 0.016458702901873113)]

Finally, let's sort the terms by their [TF-IDF](./key-terms.ipynb#tf-idf) weights to find the most significant terms in the document.

In [71]:
# Sort the tuples in our tf-idf scores list

def Sort(tfidf_tuples): 
    tfidf_tuples.sort(key = lambda x: x[1], reverse=True) 
    return tfidf_tuples 

list(Sort(example_tfidf_scores[0])[:10]) #List the top ten tokens in our example document by their TF-IDF scores

[('peptides', 0.35035070865018264),
 ('fmoc', 0.32086619856965903),
 ('epitope', 0.2994320168244199),
 ('peptide', 0.25308577163577206),
 ('anchor', 0.18865345053653582),
 ('elispot', 0.1706032138529056),
 ('residue', 0.15415285428730355),
 ('splenocytes', 0.1522897630164384),
 ('binding', 0.1364804307169562),
 ('prediction', 0.13213101796054588)]

We could also analyze across the entire corpus to find the most unique terms. These are terms that appear frequently in a single text, but rarely or never appear in other texts. (Often, these will be proper names since a particular article may mention a name often but the name may rarely appear in other articles.)

In [72]:
td = { # Define a dictionary ``td`` where each document gather
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True) # Sort the items of ``td`` into a new variable ``sorted_td``, the ``reverse`` starts from highest to lowest

In [73]:
for term, weight in sorted_td[:25]: # Print the top 25 terms in the entire corpus
    print(term, weight)

cypa 0.988862841780877
binase 0.9678905911290004
gpnmb 0.9581098254231177
dgfw 0.9491123402701133
dips 0.9433876272345001
grft 0.9401874199169196
irak 0.9400469955770853
siba 0.9307091673998441
omvs 0.9251525551071448
ctsl 0.9185892839935239
atgus 0.9168596131763339
dmgf 0.9138803944742877
iirt 0.9100936512137063
participatory 0.9069623494362896
apmv 0.9056392141699534
ibrv 0.897379132318612
alkb 0.8903262409041488
oroa 0.882674190936685
ptal 0.8772153350787546
glycosites 0.876041287641947
niclosamide 0.8756255977878135
ahsv 0.8714955907167496
hbov 0.8513340580910022
whcv 0.8468878566890287
rndv 0.8455603806775437


And, finally, we can see the most significant term in every document.

In [74]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(reduced_list[n].get('id'), dictionary.get(word_id), score)
    if n >= 10:
        break

cord19-4152ae12ac49d157a290281842bce9e9d16096a7 peptides 0.35035070865018264
cord19-0414c8763c6ec295be60d86698ec3a76e981e54f synthetic 0.46517993608859876
cord19-a015401a1cdf151bd0bca060bd8eb96d6683a147 beads 0.46297277468832837
cord19-5908982f4dbbfde0faabcfadc0988b8a19223411 rnase 0.7299772411749696
cord19-dba0cdf3fbecdba11207ba0d7da322fc2a83b798 poes 0.6693689845298985
cord19-4a17cedf8c94b518c91dfa6719832894b5dd0417 adar 0.6256730732678308
cord19-ff77d5ebc39274bb7729f2b5261bca574bdfd6a8 hcov 0.8519882259225342
cord19-b8aeb68acc940bb4f4d68bb6e7bb89da81c5d12f dmvs 0.5032917516742966
cord19-b147a9e7654ef4257a7c45cae6606543a7864c80 jhmv 0.4654463133451279
cord19-ed02cdbaaf52f191ad3bebb5a3bc117524718c8e cough 0.4723303110988772
cord19-bdfce208ef62424bc68fdb610364dab6416365e1 higg 0.5866380711271436
