Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
**For questions/comments/improvements, email nathan.kelber@ithaka.org.**<br />
![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)
____
# Exploring Word Frequencies

**Description of methods in this notebook:**
This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) shows how to explore the [word frequencies](https://docs.tdm-pilot.org/key-terms/#word-frequency) of your [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:

* Converting your [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico)[dataset](https://docs.tdm-pilot.org/key-terms/#dataset) into a Python list
* Creating a raw word frequency count
* Creating and modifying a [stop words list](https://docs.tdm-pilot.org/key-terms/#stop-words)
* Cleaning up the [corpus](https://docs.tdm-pilot.org/key-terms/#corpus)
* Create a new word frequency list focused on [content words](https://docs.tdm-pilot.org/key-terms/#content-words)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Python Basics I](./0-python-basics-1.ipynb)
* [Python Basics II](./0-python-basics-2.ipynb)
* [Python Basics III](./0-python-basics-3.ipynb)

**Knowledge Recommended:**
* [Exploring Metadata](https://docs.tdm-pilot.org/exploring-metadata/)
* A familiarity with [The Natural Language Toolkit](https://docs.tdm-pilot.org/key-terms/#nltk) and [Counter objects](https://docs.tdm-pilot.org/key-terms/#python-counter) is helpful

**Completion time:** 60 minutes

**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)

**Libraries Used:**
* **[tdm_client](https://docs.tdm-pilot.org/key-terms/#tdm-client)** to collect, unzip, and read our dataset
* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset
* **Counter** from the **Collections** module to help sum up our word frequencies
___

## Import your dataset

We'll use the tdm_client library to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.tdm-pilot.org/key-terms/#dataset-ID) in the next code cell. 

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://tdm-pilot.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://tdm-pilot.org/dataset/dashboard)

In [1]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the `tdm_client`, passing the `dataset_id` as an argument using the `get_dataset` method.

In [2]:
# Importing your dataset with a dataset ID
from tdm_package import tdm_client
# Pull in the dataset that matches `dataset_id`
# in the form of a gzipped JSON lines file.
compressed_json_lines_file = tdm_client.get_dataset(dataset_id)

INFO:root:Downloading 7e41317e-740f-e86a-4729-20dab492e925 to 7e41317e-740f-e86a-4729-20dab492e925.jsonl.gz


100% |########################################################################|


## Apply Pre-Processing Filters (if available)
If you completed pre-processing with the "Exploring Metadata and Pre-processing" notebook, you can use your CSV file of dataset IDs to automatically filter the dataset. Your pre-processed CSV file  must be in the root folder.

In [6]:
# Import a pre-processed CSV file of filtered dataset IDs.
# If you do not have a pre-processed CSV file, the analysis
# will run on the full dataset and may take longer to complete.
import pandas as pd
try: 
    df = pd.read_csv(f'pre-processed_{dataset_id}.csv')
    filtered_id_list = df["id"].tolist()
    print('Pre-Processed CSV Read Successfully read in ' + str(len(df)) + ' documents.')
except: 
    filtered_id_list = None
    print('No pre-processed CSV file found. Full dataset will be used.')

No pre-processed CSV file found. Full dataset will be used.


## Extract the Unigram Counts from the dataset JSON file

We pulled in our dataset using a `dataset_id`. The file, which resides in the datasets/ folder, is a compressed JSON Lines file (jsonl.gz) that contains all the metadata information found in the metadata CSV *plus* the textual data necessary for analysis including:

* Unigram Counts
* Bigram Counts
* Trigram Counts
* Full-text (if available)

To complete our analysis, we are going to pull out the unigram counts for each document and store them in a Counter() object. We will import `Counter` which will allow us to use Counter() objects for counting unigrams. Then we will initialize an empty Counter() object `word_frequency` to hold all of our unigram counts.

In [7]:
# Import Counter()
from collections import Counter

# Create an empty Counter object called `word_frequency`
word_frequency = Counter()

In [8]:
# Gather unigramCounts from documents in `filtered_id_list`
# If `filtered_id_list` == None, the exception will run on all IDs.

try: 
    
    for document in tdm_client.dataset_reader(compressed_json_lines_file):
        if document['id'] in filtered_id_list:
            unigrams = document.get("unigramCount", [])
            for gram, count in unigrams.items():
                word_frequency[gram] += count
   
    print('Unigrams have been collected for documents in filtered_id_list') # Success message

# Gather unigramCounts from documents without ID filtering
# when `filtered_id_list` == None
except: 
    
    for document in tdm_client.dataset_reader(compressed_json_lines_file):
        unigrams = document.get("unigramCount", [])
        for gram, count in unigrams.items():
            word_frequency[gram] += count

    print('Unigrams have been collected for all documents without filtering') # Success message

Unigrams have been collected for all documents without filtering


## Find Most Common Unigrams
Now that we have a list of the frequency of all the unigrams in our corpus, we need to sort them to find which are most common

In [9]:
for gram, count in word_frequency.most_common(25):
    print(gram.ljust(20), count)

the                  1160276
of                   906898
and                  682419
in                   461328
to                   418017
a                    334082
is                   214663
that                 204277
by                   181605
as                   177774
for                  161860
The                  153807
his                  132182
with                 124553
on                   113208
Shakespeare          98653
at                   88655
was                  80537
from                 79501
not                  79131
he                   78080
it                   77153
be                   72885
an                   70793
this                 67783


In [23]:
# Load a custom stop_words.csv if available
# Otherwise, load the nltk stopwords list in English

# Create an empty Python list to hold the stopwords
stop_words = []

# The filename of the custom stop_words.csv file
stopwords_list_filename = 'stop_words.csv'

# Load a custom stopwords list
try:

    with open(stopwords_list_filename, 'r') as f:
        stop_words = list(csv.reader(f))[0]
    print('Custom stopwords list loaded from CSV')

# Load the NLTK stopwords list
except:
    
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    print('NLTK stopwords list loaded')


Custom stopwords list loaded from CSV


In [24]:
transformed_word_frequency = Counter()

In [26]:
# Gather unigramCounts from documents in `filtered_id_list`
# If `filtered_id_list` == None, the exception will run on all IDs.

try: 
    
    for document in tdm_client.dataset_reader(compressed_json_lines_file):
        if document['id'] in filtered_id_list:
            unigrams = document.get("unigramCount", [])
            for gram, count in unigrams.items():
                clean_gram = gram.lower()
            if clean_gram in stop_words:
                continue
            if not clean_gram.isalpha():
                continue
                transformed_word_frequency[gram] += count
   
    print('Unigrams have been collected for documents in filtered_id_list') # Success message

# Gather unigramCounts from documents without ID filtering
# when `filtered_id_list` == None
except: 
    
    for document in tdm_client.dataset_reader(compressed_json_lines_file):
        unigrams = document.get("unigramCount", [])
        for gram, count in unigrams.items():
            clean_gram = gram.lower()
            if clean_gram in stop_words:
                continue
            if not clean_gram.isalpha():
                continue
            transformed_word_frequency[gram] += count

    print('Unigrams have been collected for all documents without filtering') # Success message

Unigrams have been collected for all documents without filtering


In [25]:
for document in tdm_client.dataset_reader(dataset_json_file):
    _id = document["id"]
    if _id in filtered_id_list:
        unigrams = document.get("unigramCount", [])
        for gram, count in unigrams.items():
            clean_gram = gram.lower()
            if clean_gram in stop_words:
                continue
            if not clean_gram.isalpha():
                continue
            transformed_word_frequency[clean_gram] += count
        

NameError: name 'dataset_json_file' is not defined

In [28]:
transformed_word_frequency

Counter({'Hidden': 86,
         'Study': 3195,
         'HANNES': 1,
         'Reviewed': 1697,
         'COURSEN': 39,
         'dilemma': 338,
         'Freudian': 308,
         'articulated': 261,
         'recently': 1362,
         'Madelon': 53,
         'Gohlke': 8,
         'difficult': 2348,
         'thesis': 687,
         'move': 1436,
         'beyond': 2838,
         'delivered': 924,
         'annual': 883,
         'meeting': 1048,
         'Shakespeare': 98653,
         'Association': 1960,
         'BOOK': 2020,
         'REVIEWS': 3363,
         'Richard': 21150,
         'useful': 1620,
         'extension': 341,
         'insights': 659,
         'bodily': 494,
         'continues': 794,
         'make': 9759,
         'ground': 1204,
         'inquiry': 286,
         'merely': 3603,
         'biological': 193,
         'produce': 1258,
         'contents': 320,
         'Carl': 636,
         'Jung': 57,
         'called': 3720,
         'personal': 2754,
         'c

In [27]:
for gram, count in transformed_word_frequency.most_common(25):
    print(gram.ljust(20), count)

Shakespeare          98653
one                  39471
play                 31278
English              26848
New                  26409
also                 25410
John                 23831
would                23621
Item                 23381
SHAKESPEARE          23112
Henry                22230
two                  21773
Richard              21150
first                20830
may                  20745
University           19941
QUARTERLY            19865
plays                17815
see                  17418
King                 16522
Hamlet               16474
stage                16243
like                 15427
even                 14821
de                   14588


# Lesson Complete
---


* Exploring Metadata and Pre-Processing
* Creating a Custom Stopwords List