By <a href="https://nkelber.com">Nathan Kelber</a> and Ted Lawless <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____
## Explore the metadata for a JSTOR and/or Portico dataset

**Difficulty:** Beginner

**Programming Knowledge Required:** 
This notebook can be run on a JSTOR/Portico [non-consumptive](./key-terms.ipynb#non-consumptive) [JSON Lines (.jsonl)](./key-terms.ipynb#jsonl) [dataset](./key-terms.ipynb#dataset) with little to no knowledge of [Python](./key-terms.ipynb#python). To have a full understanding of the code used in this [notebook](./key-terms.ipynb#jupyter-notebook), we recommend learning:
* [Python Basics](https://automatetheboringstuff.com/2e/chapter1/)
* [Flow Control](https://automatetheboringstuff.com/2e/chapter2/)
* [Functions](https://automatetheboringstuff.com/2e/chapter3/)
* [Lists](https://automatetheboringstuff.com/2e/chapter4/)
* [Dictionaries](https://automatetheboringstuff.com/2e/chapter5/)

A familiarity with [The Natural Language Toolkit](./key-terms.ipynb#nltk) and [Counter objects](./key-terms.ipynb#python-counter) is helpful but not required.

**Completion time:** 35 minutes

**Data Format:** [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico) [non-consumptive](./key-terms.ipynb#non-consumptive) [JSON Lines (.jsonl)](./key-terms.ipynb#jsonl)

**Libraries Used:**
* [json](./key-terms.ipynb#json-python-library) to convert our dataset from json lines format to a Python list
* [NLTK](./key-terms.ipynb#nltk) to help [clean](./key-terms.ipynb#clean-data) up our dataset

**Description of methods in this notebook:**
This [notebook](./key-terms.ipynb#jupyter-notebook) shows how to explore the [word frequencies](./key-terms.ipynb#word-frequency) of your [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico) [dataset](./key-terms.ipynb#dataset) using [Python](./key-terms.ipynb#python). The following processes are described:

* Converting your [JSTOR](./key-terms.ipynb#jstor) and/or [Portico](./key-terms.ipynb#portico)[dataset](./key-terms.ipynb#dataset) into a Python list
* Creating a raw word frequency count
* Creating and modifying a [stop words list](./key-terms.ipynb#stop-words)
* Cleaning up the [corpus](./key-terms.ipynb#corpus)
* Create a new word frequency list focused on [content words](./key-terms.ipynb#content-words)

A familiarity with the `Counter` datatype is helpful for understanding how this notebook sums word frequencies.
___

Before we can begin working with our [dataset](./key-terms.ipynb#dataset), we need to convert the [JSON lines](./key-terms.ipynb#jsonl) file written in [JavaScript](./key-terms.ipynb#javascript) into [Python](./key-terms.ipynb#python) so we can work with it. Remember that each line of our [JSON lines](./key-terms.ipynb#jsonl) file represents a single text, whether that is a journal article, book, or something else. We will create a [Python](./key-terms.ipynb#python) list that contains every document. Within each list item for each document, we will use a [Python dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) to store information related to that document. 

Essentially we will have a [list](./key-terms.ipynb#python-list) of documents numbered, from zero to the last document. Each [list](./key-terms.ipynb#python-list) item then will be composed of a [dictionary](./key-terms.ipynb#python-dictionary) of [key/value pairs](./key-terms.ipynb#key-value-pair) that allows us to retrieve information from that particular document by number. The structure will look something like this:

![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)

For each item in our list we will be able to use [key/value pairs](./key-terms.ipynb#key-value-pair) to get a **value** if we supply a **key**. We will call our [Python list](./key-terms.ipynb#python-list) variable `all_documents` since it will contain all of the documents in our [corpus](./key-terms.ipynb#corpus).

In [None]:
fileName = 'shakespeareQuarterly.jsonl' #Replace with your filename and be sure your file is in your datasets folder

import json #import the json module
all_documents = [] #create an empty new list variable named `all_documents`
with open('./datasets/' + fileName) as dataset_file: #temporarily open the file `filename` in the datasets/ folder
    for line in dataset_file: #for each line in the dataset file
        # Read each line into a Python dictionary.
        document = json.loads(line) #create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
        all_documents.append(document) #append a new list item to `all_documents` containing the dictionary we created 

Now all of our documents have been converted from our original [JSON lines](./key-terms.ipynb#jsonl) file format (.jsonl) into a [python List](./key-terms.ipynb#python-list) variable named `all_documents`. Let's see what we can discover about our [corpus](./key-terms.ipynb#corpus) with a few simple methods.

In [None]:
len(all_documents)

Above, we can determine how many texts are in our [dataset](./key-terms.ipynb#dataset) by using the `len()` function to get the size of `all_documents`. 

Next, let's select the first document to explore its word frequencies. We will create a new variable called `chosenDocument` and set it equal to first [list](./key-terms.ipynb#python-list) item in `all_documents`. (Remember, in computer code, 0 is the first item, 1 is the second item, 2 is the third item, etc.)

We'll also use the .get method to retrieve some information about the item and print it here.

In [None]:
chosenDocument = all_documents[0] # Create a dictionary variable that contains the first document from all_documents
print(chosenDocument.get('title')) # Get the value for the key title for `chosenDocument` and print it
print('written in ' + str(chosenDocument.get('isPartOf'))) # Print 'written in' and the journal value stored in the key 'isPartOf'
print(str(chosenDocument.get('publicationYear')) + ', Volume ' + chosenDocument.get('volumeNumber')) # Print the value of the key `publicationYear` and `volumeNumber`
print('URL is: ' + chosenDocument.get('url')) # Print 'URL is: ' and the value for the key 'url' in `chosenDocument`

Now, let's examine the word counts from the `chosenDocument`. First, we create a new variable `wordCounts` that will contain the word counts from our `chosenDocument`. These are stored as Python dictionary variables.

In [None]:
wordCounts = chosenDocument.get('unigramCount')
#dict(list(wordCounts.items())[:10]) #This code turns the wordCounts dictionary into a list and then shows 10 items (and then turns it back into a dictionary)

In order to help analyze our dictionary, we are going to use a special container datatype called a Counter. A Counter is like a dictionary. In fact, it uses brackets `{}` like a dictionary. Here's an example of a Counter:

In [None]:
from collections import Counter # Import Counter datatype
dictionaryDemo = {"Othello's": 23,
                  'Black': 3,
                  'Handkerchief': 4,
                  'cover': 4,
                  'of': 553} # Create example dictionary with key/value pairs of words and numbers
counterDemo = Counter(dictionaryDemo) # Turn the dictionary into a counter
counterDemo

As you can see, the Counter type looks identical to a dictionary with key/value pairs within {} surrounded by `Counter()`. The counter has some differences, though that are important. Both can return a **value** from a **key**. 

In [None]:
print(dictionaryDemo['cover']) #
print(counterDemo['cover'])

However, the Counter also returns a 0 for items that are not keys.

In [None]:
print(counterDemo['nosuchkeyexists'])

If a key is not in a dictionary, it returns a key error.

In [None]:
print(dictionaryDemo['nosuchkeyexists'])

For our purposes, the most useful aspect of the counter datatype for our purposes it lets us easily return the most common items through the `most_common()` method. We can specify an argument with this method to receive a specified number of results. Let's try it on our example `counterDemo`. 

In [None]:
counterDemo.most_common(3) # List the top 3 most common items in `counterDemo`

Let's return then to the dictionary we created called `wordCounts`. We'll turn that into a Counter datatype called `word_freq` and then print out the top 25 items.

In [None]:
word_freq = Counter(wordCounts) # Create word_freq that will be Counter datatype version of our wordCounts dictionary
for key, value in word_freq.most_common(25): # For each key/value pair in word_freq Counter's top 25 most common words
    print(key.ljust(15), value) #print the `key` left justified 15 characters from the `value` 

We have successfully created a word frequency list. There are a couple small issues, however, that we still need to address:
1. There are many [function words](./key-terms.ipynb#function-words), words like "the", "in", and "of" that are grammatically important but do not carry as much semantic meaning like [content words](./key-terms.ipynb#content-words), such as nouns and verbs. 
2. The words represented here are actually case-sensitive [strings](./key-terms.ipynb#string). That means that the string "the" is a different from the string "The". You may notice this in your results above.

To solve these issues, we need to find a way remove to remove common [function words](./key-terms.ipynb#function-words) and combine [strings](./key-terms.ipynb#string) that may have capital letters in them. We can solve these issues by:

1. Using a [stopwords](./key-terms.ipynb#stop-words) list to remove common [function words](./key-terms.ipynb#function-words)
2. Lowercasing all the characters in each string to combine our counts

We could create our own stopwords list, but luckily there are many examples out there already. We'll use NLTK's [stopwords](./key-terms.ipynb#stop-words) list to get started.

First, we create a new list variable `stop_words` and initialize it with the common English [stopwords](./key-terms.ipynb#stop-words) from the [Natural Language Toolkit](./key-terms.ipynb#nltk) library. 

In [None]:
from nltk.corpus import stopwords #import stopwords from nltk.corpus
stop_words = stopwords.words('english') #create a list `stop_words` that contains the English stop words list

We can print a slice of the first ten words in our list to get a preview.

In [None]:
stop_words[:10] #print the first 10 stop words in the list
#list(stop_words) #show the whole stopwords list

It may be that we want to add additional words to our stoplist. For example, we may want to remove character names. We can add items to the list by using the append method.

In [None]:
stop_words.append("hamlet")
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

We can also add multiple words to our stoplist by using the extend() method. Notice that this method requires using a set of brackets `[]` to clarify that we are adding "gertrude" and "horatio" as list items.

In [None]:
stop_words.extend(["gertrude", "horatio"])
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

We can also remove words from our list with the remove() method.

In [None]:
stop_words.remove("hamlet")
stop_words.remove("gertrude")
stop_words.remove("horatio")
stop_words[-10:] #evaluate and show me a slice of the last 10 items in the `stop_words` list variable

We could also store our stop words in a plaintext file and then modify that file with a text editor to add or subtract words. Let's assume, however, that we are happy with the current stopwords list.
___

We can standardize and [clean](./key-terms.ipynb#clean-data) up the [tokens](./key-terms.ipynb#token) in our [dataset](./key-terms.ipynb#dataset) by creating a function with several steps. The function will:
* discard [tokens](./key-terms.ipynb#token) less than 4 characters in length
* discard [tokens](./key-terms.ipynb#token) with non-alphabetical characters
* lowercase all characters in each [token](./key-terms.ipynb#token)
* remove [stopwords](./key-terms.ipynb#stop-words) based on the list we created in `stop_words`

In [None]:
clean_word_freq = Counter() # define a new variable `clean_word_freq` that is an empty counter type
for token, count in word_freq.items(): # for each token(`key`), value(`count`) pair in our word_freq Counter variable
    if len(token) < 4: # require tokens to be 4+ characters
        continue
    if not token.isalpha(): # require tokens to only contain alphabetical characters
        continue
    t = token.lower() # lowercase the token
    if t in stop_words: # don't include stopwords
        continue
    clean_word_freq[t] += count # add the token to `clean_word_freq if it passes each test

The resulting dictionary `clean_word_freq` contains only function words, lowercased, and greater than four characters. We can print the top 25 most common words.

In [None]:
for token, count in clean_word_freq.most_common(25): # for the top 25 most common token/count pairs in `clean_word_freq`
    print(token.ljust(15), count) # print the token (left-justified by 15 characters) followed by the count
    