By <a href="https://nkelber.com">Nathan Kelber</a> and Ted Lawless <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____

## Explore the metadata for a JSTOR and/or Portico dataset

**Difficulty:** Beginner

**Programming Knowledge Required:** 
This notebook can be run on a JSTOR/Portico non-consumptive JSON Lines (.jsonl) dataset with little to no knowledge of Python. To have a full understanding of the code used in this notebook, we recommend learning:
* [Python Basics](https://automatetheboringstuff.com/2e/chapter1/)
* [Flow Control](https://automatetheboringstuff.com/2e/chapter2/)
* [Functions](https://automatetheboringstuff.com/2e/chapter3/)
* [Lists](https://automatetheboringstuff.com/2e/chapter4/)
* [Reading and Writing Files](https://automatetheboringstuff.com/2e/chapter9/)

A familiarity with Pandas is helpful but not required.

**Completion time:** 20 minutes

**Data Format:** JSTOR/Portico non-consumptive JSON Lines (.jsonl)

**Libraries Used:**
* json to convert our dataset from json lines format to python
* [Pandas](./key-terms.ipynb#pandas) to help visualize the metadata

**Description of methods in this notebook:**
This notebook shows how to explore the metadata of your JSTOR and/or Portico dataset using Python. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Turning your dataset into a pandas dataframe
* Visualizing the contents of your dataset as a graph with pandas
____

In [1]:
import pandas as pd #imports pandas and allows us to call it with the phrase pd

The `as pd` just lets us use the shorthand `pd` when we want to call pandas instead writing out the entire word `pandas`. 

In [8]:
fileName = 'shakespeareQuarterly.jsonl' #Replace with your filename and be sure your file is in your datasets folder

import json #import the json module
all_documents = [] #create an empty new list variable named `all_documents`
with open('./datasets/' + fileName) as dataset_file: #temporarily open the file `filename` in the datasets/ folder
    for line in dataset_file: #for each line in the dataset file
        # Read each line into a Python dictionary.
        document = json.loads(line) #create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
        all_documents.append(document) #append a new list item to `all_documents` containing the dictionary we created 

Before we can begin working with our dataset, we need to convert the JSON lines file written in JavaScript into valid Python. Remember that each line of our JSON lines file represents a single text, whether that is a journal article, book, or something else. We can create a Python list where every item in the list represents a single text. 

![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)

Each item of our list can then contain a Python dictionary of key/value pairs that allows us to get a value if we supply a key. Our whole corpus is stored in a list variable called `all_documents`.

In [3]:
len(all_documents)

6687

We can determine how many texts are in our dataset by using the `len()` function to get the size of `all_documents`. 

In [7]:
chosenDocument = all_documents[0] #define a new dictionary variable `chosenDocument` that is equal to the first item in our `all_documents` list
chosenDocument.get('title') #get the corresponding value for the key 'title'

"Othello's Black Handkerchief"

We can choose a single document and get metadata for that item. Let's try the first document. (In computer code, 0 is the first item, 1 is the second item, 2 is the third item, etc.) In this case, we use the `.get` method to retrieve the title for our the first item in our list `all_documents[0]`. We can use this method to discover many kinds of metadata in our original file. Here are a few more:
* .get('') returns the title
* .get('')
* .get(


In [9]:
print("'title' returns " + chosenDocument.get('title'))
chosenDocument.get('title')
chosenDocument.get('title')
chosenDocument.get('title')
chosenDocument.get('title')
chosenDocument.get('title')
chosenDocument.get('title')




'title' returns Othello's Black Handkerchief


"Othello's Black Handkerchief"

We can also check if a particular item is in our list if we know the stable url using the `in` or `not in` operators. Let's check to see if Shakespearean scholar Theodore B. Leinwand's article "Shakespeare and the Middling Sort" is in our dataset. From a JSTOR search, I know that the stable URL for the article is: https://www.jstor.org/stable/2871420 . We can put this stable URL in a string between single quotes and evaluate the phrase with `in dset.items`. If the article is in our dataset, we will receive `true`. If it is not our dataset, we will receive `false`. 

In [None]:
'http://www.jstor.org/stable/2871420' in dset.items

The document metadata can be retrieved by calling the `get_metadata` method. The metadata is a list of Python dictionaries containing attributes for each document. We create a new list variable `metadata` by using the `get_metadata` method on dset. 

In [None]:
metadata = dset.get_metadata()

Print the contents of **metadata** for the first document in the dataset. The data is displayed as a dictionary of key/value pairs. 

In [None]:
print(metadata[0])

We can convert `metadata` to a Pandas dataframe to take advantage of its plotting and manipulation functions. This will help us learn more about what's in our metadata. We define this new dataframe as `df`.

In [None]:
df = pd.DataFrame(metadata)

Print the first 5 rows of the dataframe `df` with the `head` attribute.

In [None]:
df.head()

We can find the year range in our pandas dataframe by finding the minimum and maximum of `datePublished`.

In [None]:
minYear = df['publicationYear'].min()
maxYear = df['publicationYear'].max()

print(str(minYear) + ' to ' + str(maxYear))

Now let's do some preliminary analysis. Let's say we want to plot the number of documents by decade in the sample set. 

Since `decade` isn't a value in our dataset, we need to add it to the dataframe. We can do this by defining a new dataframe column `decade`. To translate a year (1925) to a decade (1920), we need to subtract the final digit so it becomes a zero. We can find the value for the final digit in any particular case by using modulo (which provides the remainder of a division). Here's an example using the date 1925.

In [None]:
1925 - (1925 % 10)

We can translate this example to the whole dataframe using the following code.

In [None]:
def add_decade(value): 
    yr = int(value) 
    decade = yr - ( yr % 10 )
    return decade

df['decade'] = df['publicationYear'].apply(add_decade)

To see the new decade column in our data, let's print the first 5 rows of the dataframe again.

In [None]:
df.head()

Now we can use the built in plotting tools of Pandas to plot the number of documents from each provider by decade.

In [None]:
df.groupby(['decade', 'provider'])['id'].agg('count').unstack()\
    .plot.bar(title='Documents by decade', figsize=(20, 5), fontsize=12, stacked=True); ##There is a weird bug where this cell needs to be run twice.

And do the same for the total number of pages.

In [None]:
df.groupby(['decade', 'provider'])['pageCount'].agg('sum').unstack()\
    .plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12, stacked=True);