## Explore metadata for a curated dataset

This notebook shows how to explore the metadata of your JSTOR and/or Portico dataset using Python. The following processes are described:

* Importing your dataset
* Discovering the size and contents of your dataset
* Turning your dataset into a pandas dataframe
* Visualizing the contents of your dataset as a graph with pandas

A familiarity with pandas is helpful but not required.
____
We import the [pandas](./key-terms.ipynb#pandas) module to help visualize and manipulate our data. Importing `as pd` allows us to call pandas' functions using the short phrase `pd` instead of typing out `pandas` each time. 

In [3]:
import pandas as pd

We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](./key-terms.ipynb#corpus) [dataset](./key-terms.ipynb#dataset). To analyze your dataset, use the [dataset ID](./key-terms.ipynb/#dataset-ID) provided when you created your [dataset](./key-terms.ipynb/#dataset). A copy of your [dataset ID](./key-terms.ipynb/#dataset-ID) was sent to your email when you created your [corpus](./key-terms.ipynb#corpus). It should look like a long series of characters surrounded by dashes. 

In [None]:
from tdm_client import Dataset

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of journals focused on Shakespeare is provided here ('a517ef1f-0794-48e4-bea1-ac4fb8b312b4'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server.

**Note**: If you are curious what is in your dataset, there is a download link in the email you received. The format and content of the files is described in the notebook [Building a Dataset](./1-building-a-dataset.ipynb). 

In [None]:
dset = Dataset('a517ef1f-0794-48e4-bea1-ac4fb8b312b4')

Find total number of documents in the dataset. 

In [None]:
len(dset)

1000

After initializing the `Dataset`, the documents in the dataset are downloaded in the background.

A list of all of the items in the Dataset is available. 

In [None]:
dset.items[:5]

['http://www.jstor.org/stable/i40075057',
 'http://www.jstor.org/stable/i40103856',
 'http://www.jstor.org/stable/i40075051',
 'http://www.jstor.org/stable/i40075048',
 'http://www.jstor.org/stable/i40075029']

The document metadata can be retrieved by calling the `get_metadata` method. The metadata is a list of Python dictionaries containing attributes for each document.

In [None]:
metadata = dset.get_metadata()

Print the metadata for the first document in the dataset.

In [None]:
print(metadata[0])

Convert the metadata to a Pandas dataframe to take advantage of its plotting and manipulation functions.

In [None]:
df = pd.DataFrame(metadata)

Print the first 10 rows of the dataframe

In [None]:
df.head()

Find the year range of documents in this dataset

In [None]:
print('{} to {}'.format(df['yearPublished'].min(), df['yearPublished'].max()))

Now do some preliminary analysis. Let's say you want to plot the number of volumes by decade in the sample set. 

Since `decade` isn't a value in our dataset, we need to add it to the dataframe. This can be done with the following step.

In [None]:
df['decade'] = df['yearPublished'] - ( df['yearPublished'] % 10)

Print the first 10 rows of the dataframe again to see how the column was applied.

In [None]:
df.head()

Now use the built in plotting tools of Pandas to plot the number of issues from each provider by decade.

In [None]:
df.groupby(['decade', 'provider'])['id'].agg('count').unstack()\
    .plot.bar(title='Issues by decade', figsize=(20, 5), fontsize=12, stacked=True);

And do the same for the total number of pages.

In [None]:
df.groupby(['decade', 'provider'])['pageCount'].agg('sum').unstack()\
    .plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12, stacked=True);