By <a href="https://nkelber.com">Nathan Kelber</a> and Ted Lawless <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____
# Exploring Word Frequencies

**Difficulty:** Beginner

**Programming Knowledge Required:** 
* [Python Basics I](https://docs.tdm-pilot.org/python-basics-i/)
* [Python Basics II](https://docs.tdm-pilot.org/python-basics-ii/)
* Python Basics III

**Programming Knowledge Recommended**
* [Exploring Metadata](https://docs.tdm-pilot.org/exploring-metadata/)

A familiarity with [The Natural Language Toolkit](https://docs.tdm-pilot.org/key-terms/#nltk) and [Counter objects](https://docs.tdm-pilot.org/key-terms/#python-counter) is helpful but not required.

**Completion time:** 35 minutes

**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [non-consumptive](https://docs.tdm-pilot.org/key-terms/#non-consumptive) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)

**Libraries Used:**
* **[json](https://docs.tdm-pilot.org/key-terms/#json-python-library)** to convert our dataset from json lines format to a Python list
* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset
* **Counter** from the **Collections Module** to help sum up our word frequencies

**Description of methods in this notebook:**
This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) shows how to explore the [word frequencies](https://docs.tdm-pilot.org/key-terms/#word-frequency) of your [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:

* Converting your [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico)[dataset](https://docs.tdm-pilot.org/key-terms/#dataset) into a Python list
* Creating a raw word frequency count
* Creating and modifying a [stop words list](https://docs.tdm-pilot.org/key-terms/#stop-words)
* Cleaning up the [corpus](https://docs.tdm-pilot.org/key-terms/#corpus)
* Create a new word frequency list focused on [content words](https://docs.tdm-pilot.org/key-terms/#content-words)


___

## Importing your dataset

You have two options for bringing your dataset into the local environment:

1. Manually download and upload your dataset
2. Use a dataset id to automatically upload a dataset

### Option one: Manually download and upload your dataset

You can download your dataset from the corpus builder in the link shown below. (You may also have a link to your dataset in your email.) If you wish, you can modify your dataset on your local machine before the next upload phase. This gives you some more flexibility than automatically pulling in your dataset using a dataset ID using option 2 below.

![The link for downloading your dataset](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/downloadDataset.png)

Once you have your dataset ready on your local machine, you can then upload your dataset into JupyterLab by clicking the upload button in the file pane on the left.

![The upload button in the file pane](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/uploadDataset.png)

Make sure to upload your dataset to the "datasets" folder. 

### Option Two: Use a Dataset ID to automatically upload a dataset

You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. 

In [None]:
#Importing your dataset with a dataset ID
import tdm_client
tdm_client.get_dataset("cord-19", "sampleJournalAnalysis") #Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.

# Other humanities datasets:

#English
# Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016) (b4668c50-a970-c4d7-eb2c-bb6d04313542)
# Shakespeare Quarterly (1950-2013) (f6ae29d4-3a70-36ee-d601-20a8c0311273)
# ELH (1934-2014) (4999901a-fa17-31da-cfe5-2abf3a429df7)
# College English (1939-2016) (a161f384-720b-b6bf-a0cc-4d7d3b857e1c)
# PMLA (1889-2014) (1aea53b9-26d5-fe54-e35c-8259156ce6cd)

#History

#Philosophy

#Anthropology

#Law

#Art

#Classics
#Classical Quarterly (1907-2014) (82014740-8ed9-3c34-5716-d0879b8317f6)

Before we can begin working with our [dataset](https://docs.tdm-pilot.org/key-terms/#dataset), we need to convert the [JSON lines](https://docs.tdm-pilot.org/key-terms/#jsonl) file written in [JavaScript](https://docs.tdm-pilot.org/key-terms/#javascript) into [Python](https://docs.tdm-pilot.org/key-terms/#python) so we can work with it. Remember that each line of our [JSON lines](https://docs.tdm-pilot.org/key-terms/#jsonl) file represents a single text, whether that is a journal article, book, or something else. We will create a [Python](https://docs.tdm-pilot.org/key-terms/#python) list that contains every document. Within each list item for each document, we will use a [Python dictionary](https://docs.tdm-pilot.org/key-terms/#python-dictionary) of [key/value pairs](https://docs.tdm-pilot.org/key-terms/#key-value-pair) to store information related to that document. 

Essentially we will have a [list](https://docs.tdm-pilot.org/key-terms/#python-list) of documents numbered, from zero to the last document. Each [list](https://docs.tdm-pilot.org/key-terms/#python-list) item then will be composed of a [dictionary](https://docs.tdm-pilot.org/key-terms/#python-dictionary) of [key/value pairs](https://docs.tdm-pilot.org/key-terms/#key-value-pair) that allows us to retrieve information from that particular document by number. The structure will look something like this:

![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)

For each item in our list we will be able to use [key/value pairs](https://docs.tdm-pilot.org/key-terms/#key-value-pair) to get a **value** if we supply a **key**. We will call our [Python list](https://docs.tdm-pilot.org/key-terms/#python-list) variable `all_documents` since it will contain all of the documents in our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus).

In [None]:
# Replace with your filename and be sure your file is in your datasets folder
file_name = 'sampleJournalAnalysis.jsonl' 

# Import the json module
import json
# Create an empty new list variable named `all_documents`
all_documents = [] 
# Temporarily open the file `filename` in the datasets/ folder
with open('./datasets/' + file_name) as dataset_file: 
    #for each line in the dataset file
    for line in dataset_file: 
        # Read each line into a Python dictionary.
        # Create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
        document = json.loads(line) 
        # Append a new list item to `all_documents` containing the dictionary we created.
        all_documents.append(document) 

Now all of our documents have been converted from our original [JSON lines](https://docs.tdm-pilot.org/key-terms/#jsonl) file format (.jsonl) into a [python List](https://docs.tdm-pilot.org/key-terms/#python-list) variable named `all_documents`. Let's see what we can discover about our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) with a few simple methods.

First, we can determine how many texts are in our [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) by using the `len()` function to get the size of `all_documents`. 

In [None]:
len(all_documents)

---
## Choosing a Document to Explore Word Frequencies
We will create a new variable called `chosenDocument` and set it equal to first [list](https://docs.tdm-pilot.org/key-terms/#python-list) item in `all_documents`. (Remember, in computer code, 0 is the first item, 1 is the second item, 2 is the third item, etc.)

We'll also use the .get method to retrieve some information about the item and print it here. Check to make sure this is a suitable article. It it is front matter or back matter, for example, you may want to select another article. You can achieve that by changing the index number in the first line of code. For example, you might change `all_documents[0]` (the first article in the list) to `all_documents[5]` (the sixth article in the list). Remember, in Python, counting starts with 0. 

If you're not sure if the JSTOR document you selected is a good example, follow the URL to preview it.

In [None]:
chosenDocument = all_documents[0] # Create a dictionary variable that contains the first document from all_documents. Change 0
print(chosenDocument.get('title')) # Get the value for the key title for `chosenDocument` and print it
print('written in ' + str(chosenDocument.get('isPartOf'))) # Print 'written in' and the journal value stored in the key 'isPartOf'
#print(str(chosenDocument.get('publicationYear')) + ', Volume ' + chosenDocument.get('volumeNumber')) # Print the value of the key `publicationYear` and `volumeNumber`
print('URL is: ' + chosenDocument.get('url')) # Print 'URL is: ' and the value for the key 'url' in `chosenDocument`

Now, let's examine the word counts from the `chosenDocument`. First, we create a new variable `wordCounts` that will contain the word counts from our `chosenDocument`. These are stored as [Python dictionary variables](https://docs.tdm-pilot.org/key-terms/#python-dictionary).

In [None]:
wordCounts = chosenDocument.get('unigramCount')
#dict(list(wordCounts.items())[:10]) #This code previews the first 10 items in your dictionary
#It does this by turning the `wordCounts` dictionary into a list and then shows 10 items (and then turns it back into a dictionary)

---
## An Explanation of the Counter Container Datatype

In order to help analyze our dictionary, we are going to use a special container datatype called a Counter. A Counter is like a dictionary. In fact, it uses brackets `{}` like a dictionary. Here's an example where we turn a dictionary (`dictionaryDemo`) into a Counter (`counterDemo`) in order to explore the difference between the two:

In [None]:
from collections import Counter # Import Counter datatype
dictionaryDemo = {"Othello's": 23,
                  'Black': 3,
                  'Handkerchief': 4,
                  'cover': 4,
                  'of': 553} # Create example dictionary with key/value pairs of words and numbers
counterDemo = Counter(dictionaryDemo) # Turn the dictionary into a counter
counterDemo

As you can see, the Counter type looks identical to a dictionary with key/value pairs within {} that is surrounded by `Counter()`. Both dictionaries and counters can return a **value** from a **key**.

In [None]:
print(dictionaryDemo['cover']) # Using the Python dictionary `dictionaryDemo`, return the value for the key 'cover'
print(counterDemo['cover']) #Using the Python counter `counterDemo`, return the value for the key 'cover'

However, the Counter has some differences that will help us sum up our word tallies. One key difference is that a Counter returns a 0 for items that are not keys.

In [None]:
print(counterDemo['nosuchkeyexists']) # With a Counter, the value of the made-up key `nosuchkeyexists` is 0. 

Compare this with a dictionary. If a key is not in a dictionary, it returns a key error.

In [None]:
print(dictionaryDemo['nosuchkeyexists']) # With a dictionary, the value of the made-up key `nosuchkeyexists` causes a KeyError in Python

For our purposes, the most useful aspect of the counter datatype is that it lets us easily return the most common items through the `most_common()` method. We can specify an argument with this method to receive a specified number of results. Let's try it on our example `counterDemo`. 

In [None]:
counterDemo.most_common(3) # List the top 3 most common items in `counterDemo`

---
## Using Counter to Sum the Words in a Single Article

Let's return then to the dictionary we created to hold all the words in our article. We called that variable `wordCounts`. We can get a preview of the first 10 words in our dictionary using the code below.

In [None]:
dict(list(wordCounts.items())[:10])

Note, these are not in order of which words are most frequent. We can discover the most frequent words by turning our dictionary `wordCounts` into a Counter. We will call this Counter `word_freq` and then print out the top 25 most common words. 

In [None]:
word_freq = Counter(wordCounts) # Create `word_freq` that will be Counter datatype version of our original `wordCounts` dictionary
for key, value in word_freq.most_common(25): # For each key/value pair in word_freq Counter's top 25 most common words
    print(key.ljust(15), value) #print the `key` left justified 15 characters from the `value` 

We have successfully created a word frequency list. There are a couple small issues, however, that we still need to address:
1. There are many [function words](https://docs.tdm-pilot.org/key-terms/#function-words), words like "the", "in", and "of" that are grammatically important but do not carry as much semantic meaning like [content words](https://docs.tdm-pilot.org/key-terms/#content-words), such as nouns and verbs. 
2. The words represented here are actually case-sensitive [strings](https://docs.tdm-pilot.org/key-terms/#string). That means that the string "the" is a different from the string "The". You may notice this in your results above.

To solve these issues, we need to find a way remove to remove common [function words](https://docs.tdm-pilot.org/key-terms/#function-words) and combine [strings](https://docs.tdm-pilot.org/key-terms/#string) that may have capital letters in them. We can solve these issues by:

1. Using a [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) list to remove common [function words](https://docs.tdm-pilot.org/key-terms/#function-words)
2. Lowercasing all the characters in each string to combine our counts

We could create our own stopwords list, but luckily there are many examples out there already. We'll use NLTK's [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) list to get started.

First, we create a new list variable `stop_words` and initialize it with the common English [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) from the [Natural Language Toolkit](https://docs.tdm-pilot.org/key-terms/#nltk) library. 

In [None]:
from nltk.corpus import stopwords #import stopwords from nltk.corpus
stop_words = stopwords.words('english') #create a list `stop_words` that contains the English stop words list

If you're curious what in our stopwords list, we can print a slice of the first ten words in our list to get a preview.

In [None]:
stop_words[:10] #print the first 10 stop words in the list
#list(stop_words) #show the whole stopwords list

It may be that we want to add additional words to our stoplist. For example, we may want to remove character names. We can add items to the list by using the append method.

In [None]:
stop_words.append("hamlet")
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

We can also add multiple words to our stoplist by using the extend() method. Notice that this method requires using a set of brackets `[]` to clarify that we are adding "gertrude" and "horatio" as list items.

In [None]:
stop_words.extend(["gertrude", "horatio"])
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

We can also remove words from our list with the remove() method.

In [None]:
stop_words.remove("hamlet")
stop_words.remove("gertrude")
stop_words.remove("horatio")
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

## Storing Stopwords in a CSV File
We could also store our stop words in a CSV file. A CSV, or "Comma-Separated Values" file, is a plain-text file with commas separating each entry. The file could be opened and modified with a text editor or spreadsheet software such as Excel or Google Sheets. Here's what our NLTK stopwords list will look like as a CSV file opened in a plain text editor.

![The csv file as an image](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/stopwordsCSV.png)

Let's create an example CSV.

In [None]:
import csv #import the csv library to work with csv files
outputFile = open('stopWords.csv', 'w', newline='') #create a variable `outputFile` that will be linked to a new csv file called stopWords.csv
outputWriter = csv.writer(outputFile) #create a writer object to add to our `outputFile`
outputWriter.writerow(stop_words) #add our list `stop_words` to the CSV file
outputFile.close() #close the CSV file

We have created a new file called stopWords.csv that you can open to modify. Go ahead and make a change to your stopWords.csv (either adding or subtracting words). Remember, there are no spaces between words in the CSV file. If you want to edit the CSV right inside Jupyter Lab, right-click on the file and select "Open With > Editor." 

![Selecting "Open With > Editor" in Jupyter Lab](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/editCSV.png)

Now go ahead and add in a new word. Remember a few things:

* Each word is separated from the next word by a comma.
* There are no spaces between the words.
* You must save changes to the file if you're using a text editor, Excel, or the Jupyter Lab editor.
* You can reopen the file to make sure your changes were saved.

Now let's read our CSV file back and overwrite our original `stop_words` list variable. 

In [None]:
newStopwordsFile = open('stopWords.csv') # Open `stopWords.csv` as the variable newStopwordsFile
newStopwordsReader = csv.reader(newStopwordsFile) # Create newStopwordsReader variable to open the newStopwordsFile in Reader Mode
stop_words = list(newStopwordsReader)[0] # Define the stop_words variable as a list to the contents of newStopwordsReader
stop_words # Return the contents of the list variable stop_words



Refining a stopwords list for your analysis can take time. It depends on:

* What you are hoping to discover (for example, are function words important?)
* The material you are analyzing (a play text, for example, will repeat the speakers' names many times)

If your results are not satisfactory, you can always come back and adjust the stopwords. You may need to run your analysis many times to create a good stopword list.
___
## Cleaning and Standardizing Tokens

We can standardize and [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up the [tokens](https://docs.tdm-pilot.org/key-terms/#token) in our [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) by creating a function with several steps. The function will:
* discard [tokens](https://docs.tdm-pilot.org/key-terms/#token) less than 4 characters in length
* discard [tokens](https://docs.tdm-pilot.org/key-terms/#token) with non-alphabetical characters
* lowercase all characters in each [token](https://docs.tdm-pilot.org/key-terms/#token)
* remove [stopwords](https://docs.tdm-pilot.org/key-terms/#stop-words) based on the list we created in `stop_words`

In [None]:
clean_word_freq = Counter() # define a new variable `clean_word_freq` that is an empty counter type
for token, count in word_freq.items(): # for each token(`key`), value(`count`) pair in our word_freq Counter variable
    if len(token) < 4: # require tokens to be 4+ characters
        continue
    if not token.isalpha(): # require tokens to only contain alphabetical characters
        continue
    t = token.lower() # lowercase the token
    if t in stop_words: # don't include stopwords
        continue
    clean_word_freq[t] += count # add the token to `clean_word_freq if it passes each test

The resulting dictionary `clean_word_freq` contains only function words, lowercased, and greater than four characters. We can print the top 25 most common words.

In [None]:
for key, value in clean_word_freq.most_common(25): # For the top 25 most common key/value pairs in `clean_word_freq`
    print(key.ljust(15), value) # print the key (left-justified by 15 characters) followed by the value
    # Remember that the key above corresponds to the word and the value corresponds to the number of times that word occurs