## Explore metadata for a curated dataset

This notebook shows how to explore the metadata of your JSTOR and/or Portico dataset using Python. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Turning your dataset into a pandas dataframe
* Visualizing the contents of your dataset as a graph with pandas

A familiarity with pandas is helpful but not required.
____
We import the [pandas](./key-terms.ipynb#pandas) module to help visualize and manipulate our data. Importing `as pd` allows us to call pandas' functions using the short phrase `pd` instead of typing out `pandas` each time. 

In [None]:
import pandas as pd

We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

In [None]:
from tdm_client import Dataset

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** featuring Shakespeare Quarterly (1950-2014) is provided here ('59c090b6-3851-3c65-e016-9181833b4a2c'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [None]:
dset = Dataset('59c090b6-3851-3c65-e016-9181833b4a2c')

Find total number of documents in the dataset using the `len()` function. 

In [None]:
len(dset)

The dset variable now contains all the documents in our corpus. We can take a peak at our documents using in our dataset by taking a slice of the first five items.

In [None]:
dset.items[0:5]

We can also check if a particular item is in our list if we know the stable url using the `in` or `not in` operators. Let's check to see if volume 5.1 of the journal *Mosaic* is `in` the dataset by using its stable URL (https://www.jstor.org/stable/i24775424).

In [None]:
'http://www.jstor.org/stable/44990760' in dset.items

The document metadata can be retrieved by calling the `get_metadata` method. The metadata is a list of Python dictionaries containing attributes for each document. We create a new list variable `metadata` by using the `get_metadata` method on dset. 

In [None]:
metadata = dset.get_metadata()

Print the contents of **metadata** for the first document in the dataset. The data is displayed as a dictionary of key/value pairs. 

In [None]:
print(metadata[0])

We can convert `metadata` to a Pandas dataframe to take advantage of its plotting and manipulation functions. This will help us learn more about what's in our metadata. We define this new dataframe as `df`.

In [None]:
df = pd.DataFrame(metadata)

Print the first 5 rows of the dataframe `df` with the `head` attribute.

In [None]:
df.head()

We can find the year range in our pandas dataframe by finding the minimum and maximum of `datePublished`.

In [None]:
minYear = df['datePublished'].min()
maxYear = df['datePublished'].max()

print(str(minYear) + ' to ' + str(maxYear))

Now let's do some preliminary analysis. Let's say we want to plot the number of documents by decade in the sample set. 

Since `decade` isn't a value in our dataset, we need to add it to the dataframe. We can do this by defining a new dataframe column `decade`. To translate a year (1925) to a decade (1920), we need to subtract the final digit so it becomes a zero. We can find the value for the final digit in any particular case by using modulo (which provides the remainder of a division). Here's an example using the date 1925.

In [None]:
1925 - (1925 % 10)

We can translate this example to the whole dataframe using the following code.

In [None]:
def add_decade(value):
    yr = int(value[:4])
    decade = yr - ( yr % 10 )
    return decade

df['decade'] = df['datePublished'].apply(add_decade)

To see the new decade column in our data, let's print the first 5 rows of the dataframe again.

In [None]:
df.head()

Now we can use the built in plotting tools of Pandas to plot the number of documents from each provider by decade.

In [None]:
df.groupby(['decade', 'provider'])['id'].agg('count').unstack()\
    .plot.bar(title='Documents by decade', figsize=(20, 5), fontsize=12, stacked=True);

And do the same for the total number of pages.

In [None]:
df.groupby(['decade', 'provider'])['pageCount'].agg('sum').unstack()\
    .plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12, stacked=True);