By <a href="https://nkelber.com">Nathan Kelber</a> and Ted Lawless <br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____

## Explore the metadata for a JSTOR and/or Portico dataset

**Difficulty:** Beginner

**Programming Knowledge Required:** 
This notebook can be run on a JSTOR/Portico non-consumptive JSON Lines (.jsonl) dataset with little to no knowledge of Python. To have a full understanding of the code used in this notebook, we recommend learning:
* [Python Basics](https://automatetheboringstuff.com/2e/chapter1/)
* [Flow Control](https://automatetheboringstuff.com/2e/chapter2/)
* [Functions](https://automatetheboringstuff.com/2e/chapter3/)
* [Lists](https://automatetheboringstuff.com/2e/chapter4/)
* [Reading and Writing Files](https://automatetheboringstuff.com/2e/chapter9/)

A familiarity with Pandas is helpful but not required.

**Completion time:** 20 minutes

**Data Format:** JSTOR/Portico non-consumptive JSON Lines (.jsonl)

**Libraries Used:**
* json to convert our dataset from json lines format to python
* [Pandas](./key-terms.ipynb#pandas) to help visualize the metadata

**Description of methods in this notebook:**
This notebook shows how to explore the metadata of your JSTOR and/or Portico dataset using Python. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Turning your dataset into a pandas dataframe
* Visualizing the contents of your dataset as a graph with pandas
____

In [1]:
import pandas as pd #imports pandas and allows us to call it with the phrase pd

The `as pd` just lets us use the shorthand `pd` when we want to call pandas instead writing out the entire word `pandas`. 

In [2]:
fileName = 'shakespeareQuarterly.jsonl' #Replace with your filename and be sure your file is in your datasets folder

import json #import the json module
all_documents = [] #create an empty new list variable named `all_documents`
with open('./datasets/' + fileName) as dataset_file: #temporarily open the file `filename` in the datasets/ folder
    for line in dataset_file: #for each line in the dataset file
        # Read each line into a Python dictionary.
        document = json.loads(line) #create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
        all_documents.append(document) #append a new list item to `all_documents` containing the dictionary we created 

Before we can begin working with our dataset, we need to convert the JSON lines file written in JavaScript into valid Python. Remember that each line of our JSON lines file represents a single text, whether that is a journal article, book, or something else. We can create a Python list where every item in the list represents a single text. 

![Structure of the corpus, a list of dictionaries](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CorpusView.png)

Each item of our list can then contain a Python dictionary of key/value pairs that allows us to get a value if we supply a key. Our whole corpus is stored in a list variable called `all_documents`.

In [3]:
len(all_documents)

6687

We can determine how many texts are in our dataset by using the `len()` function to get the size of `all_documents`. 

In [4]:
chosenDocument = all_documents[0] #define a new dictionary variable `chosenDocument` that is equal to the first item in our `all_documents` list
chosenDocument.get('title') #get the corresponding value for the key 'title'

"Othello's Black Handkerchief"

We can choose a single document and get metadata for that item. Let's try the first document. (In computer code, 0 is the first item, 1 is the second item, 2 is the third item, etc.) In this case, we use the `.get` method to retrieve the title for our the first item in our list `all_documents[0]`. 


We can also use the `.get` method to discover many kinds of biblographic metadata. Here are the most significant bibliographic metadata items found with a JSTOR item:
* `title` returns the title
* `creators` returns the authors in a Python list
* `isPartOf` returns the journal title
* `datePublished` returns the publication date
* `id` returns the stable URL for a JSTOR item
* `identifier` returns a Python list of dictionaries containing the ISSN #, OCLC #, and DOI #. 
* `volumeNumber` returns the journal volume number
* `pageCount` returns the number of pages in the print article
* `pagination` returns the page number range of the print article
* `pageStart` returns the first print page
* `pageEnd` returns the last print page
* `wordCount` returns the number of words in the article
* `docType` returns the type of document, usually `article` for journal article
* `url` returns the stable url for the document
* `provider` returns the source of the data, for JSTOR articles usually `jstor`

Let's try all these on our `chosenDocument`. 

In [19]:
print("Title: " + chosenDocument.get('title'))
print(chosenDocument.get('creators'))
print(chosenDocument.get('isPartOf'))
print(chosenDocument.get('datePublished'))
print(chosenDocument.get('publisher'))
print(chosenDocument.get('id'))
print(chosenDocument.get('identifier'))
print(chosenDocument.get('volumeNumber'))
print(chosenDocument.get('pageCount'))
print(chosenDocument.get('pagination'))
print(chosenDocument.get('pageStart'))
print(chosenDocument.get('pageEnd'))
print(chosenDocument.get('wordCount'))
print(chosenDocument.get('docType'))
print(chosenDocument.get('url'))
print(chosenDocument.get('provider'))

Title: Othello's Black Handkerchief
['Ian Smith']
Shakespeare Quarterly
2013-04-01T00:00:00Z
Folger Shakespeare Library
http://www.jstor.org/stable/24778431
[{'name': 'issn', 'value': '00373222'}, {'name': 'oclc', 'value': '39852252'}, {'name': 'local_doi', 'value': '10.2307/24778431'}]
1
25
pp. 1-25
1
25
11500
article
http://www.jstor.org/stable/24778431
jstor


['identifier',
 'volumeNumber',
 'pageCount',
 'pagination',
 'sourceCategory',
 'wordCount',
 'docType',
 'creators',
 'language',
 'tdmCategory',
 'title',
 'isPartOf',
 'unigramCount',
 'url',
 'datePublished',
 'provider',
 'pageStart',
 'pageEnd',
 'publicationYear',
 'publisher',
 'id',
 'outputFormat']

We can see every Python dictionary key in the metadata by using the `.keys` method. We can use this in conjunction with the `print()` function, but the `list()` function will make it a little neater to read.

In [21]:
#print(chosenDocument.keys())# Uncomment the # in front of print to run this line of code
list(chosenDocument.keys()) # Create a list of every Python dictionary key within `chosen_document`

['identifier',
 'volumeNumber',
 'pageCount',
 'pagination',
 'sourceCategory',
 'wordCount',
 'docType',
 'creators',
 'language',
 'tdmCategory',
 'title',
 'isPartOf',
 'unigramCount',
 'url',
 'datePublished',
 'provider',
 'pageStart',
 'pageEnd',
 'publicationYear',
 'publisher',
 'id',
 'outputFormat']

Of course, we could also list all the Python dictionary values, but the output will be quite long since it includes the wordcounts for every word that is in the article. (In fact, it includes the count for every unique *string* in the article. We'll address the distinction in the word frequencies notebooks.)

In [None]:
#list(chosenDocument.values()) # Uncomment the # in front of list to run this line of code

In addition to bibliographic metadata, we can take a look at the word counts by using `unigramCount`. We'll do this in an upcoming notebook looking at word frequencies. For now, let's return to our larger corpus `all_documents`.

What if we wanted to check if a particular item was in the corpus?

If we search out any journal article on jstor.org, the article description page will feature a stable url.

![A JSTOR description page](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/jstorDescription.png)

Since we know the stable URL is stored in both the `id` and `url` dictionaries, we can check our whole corpus for a particular article if we know the stable URL. The image above shows a stable URL of: https://www.jstor.org/stable/2871420

**Notice that the stable URL above uses a secure address starting with "https://". Our dictionary values, however, use standard "http://" without the "s". You'll need to remove the "s" to do run this test.**

We can accomplish this task with the `in` or `not in` operators. Let's check to see if Shakespearean scholar Theodore B. Leinwand's article "Shakespeare and the Middling Sort" is in our dataset using the stable URL. We can put this stable URL in a string between single quotes and evaluate the phrase with


`in dset.items`. If the article is in our dataset, we will receive `true`. If it is not our dataset, we will receive `false`. 





[[{'name': 'issn', 'value': '00373222'},
  {'name': 'oclc', 'value': '39852252'},
  {'name': 'local_doi', 'value': '10.2307/24778431'}],
 '1',
 25,
 'pp. 1-25',
 ['Language & Literature', 'Humanities', 'Performing Arts', 'Arts'],
 11500,
 'article',
 ['Ian Smith'],
 ['eng'],
 ['Literature on music', 'Visual arts', 'Literature (General)'],
 "Othello's Black Handkerchief",
 'Shakespeare Quarterly',
 {"Othello's": 23,
  'Black': 3,
  'Handkerchief': 4,
  'Ian': 2,
  'Smith': 1,
  'The': 64,
  'cover': 4,
  'of': 553,
  'the': 824,
  '1997': 1,
  'Arden': 1,
  'edition': 1,
  'Othello': 32,
  'boasts': 1,
  'a': 190,
  'striking': 4,
  'image:': 1,
  'single,': 1,
  'white': 21,
  'handkerchief': 56,
  'suspended': 1,
  'in': 310,
  'mid-air,': 1,
  'tilted': 1,
  'at': 25,
  'downward': 1,
  'angle': 1,
  'against': 4,
  'smoky': 1,
  'gray': 1,
  'background.1': 1,
  'art': 1,
  'insists': 1,
  'on': 47,
  'centrality': 1,
  "play's": 6,
  'controversial': 2,
  'piece': 3,
  'fabric': 4,

In [15]:
for documents in 

'http://www.jstor.org/stable/2871420' in all_documents.get('url')

AttributeError: 'list' object has no attribute 'get'

The document metadata can be retrieved by calling the `get_metadata` method. The metadata is a list of Python dictionaries containing attributes for each document. We create a new list variable `metadata` by using the `get_metadata` method on dset. 

In [None]:
metadata = dset.get_metadata()

Print the contents of **metadata** for the first document in the dataset. The data is displayed as a dictionary of key/value pairs. 

In [None]:
print(metadata[0])

We can convert `metadata` to a Pandas dataframe to take advantage of its plotting and manipulation functions. This will help us learn more about what's in our metadata. We define this new dataframe as `df`.

In [None]:
df = pd.DataFrame(metadata)

Print the first 5 rows of the dataframe `df` with the `head` attribute.

In [None]:
df.head()

We can find the year range in our pandas dataframe by finding the minimum and maximum of `datePublished`.

In [None]:
minYear = df['publicationYear'].min()
maxYear = df['publicationYear'].max()

print(str(minYear) + ' to ' + str(maxYear))

Now let's do some preliminary analysis. Let's say we want to plot the number of documents by decade in the sample set. 

Since `decade` isn't a value in our dataset, we need to add it to the dataframe. We can do this by defining a new dataframe column `decade`. To translate a year (1925) to a decade (1920), we need to subtract the final digit so it becomes a zero. We can find the value for the final digit in any particular case by using modulo (which provides the remainder of a division). Here's an example using the date 1925.

In [None]:
1925 - (1925 % 10)

We can translate this example to the whole dataframe using the following code.

In [None]:
def add_decade(value): 
    yr = int(value) 
    decade = yr - ( yr % 10 )
    return decade

df['decade'] = df['publicationYear'].apply(add_decade)

To see the new decade column in our data, let's print the first 5 rows of the dataframe again.

In [None]:
df.head()

Now we can use the built in plotting tools of Pandas to plot the number of documents from each provider by decade.

In [None]:
df.groupby(['decade', 'provider'])['id'].agg('count').unstack()\
    .plot.bar(title='Documents by decade', figsize=(20, 5), fontsize=12, stacked=True); ##There is a weird bug where this cell needs to be run twice.

And do the same for the total number of pages.

In [None]:
df.groupby(['decade', 'provider'])['pageCount'].agg('sum').unstack()\
    .plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12, stacked=True);