<a href="https://colab.research.google.com/github/ithaka/tdm-notebooks/blob/master/2-metadata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Explore metadata for a curated dataset

This notebook shows how to connect to a curated TDM dataset and explore the metadata.

In [0]:
import pandas as pd

from tdm_client import Dataset

Initialize a TDM dataset object with the dataset ID provided by the Digital Scholars Workbench or in the email you received after curating your dataset.

In [0]:
dset = Dataset('a517ef1f-0794-48e4-bea1-ac4fb8b312b4')

Find total number of documents in the dataset. 

In [5]:
len(dset)

1000

After initializing the `Dataset`, the documents in the dataset are downloaded in the background.

A list of all of the items in the Dataset is available. 

In [6]:
dset.items[:5]

['http://www.jstor.org/stable/i40075057',
 'http://www.jstor.org/stable/i40103856',
 'http://www.jstor.org/stable/i40075051',
 'http://www.jstor.org/stable/i40075048',
 'http://www.jstor.org/stable/i40075029']

The document metadata can be retrieved by calling the `get_metadata` method. The metadata is a list of Python dictionaries containing attributes for each document.

In [0]:
metadata = dset.get_metadata()

Print the metadata for the first document in the dataset.

In [0]:
print(metadata[0])

Convert the metadata to a Pandas dataframe to take advantage of its plotting and manipulation functions.

In [0]:
df = pd.DataFrame(metadata)

Print the first 10 rows of the dataframe

In [0]:
df.head()

Find the year range of documents in this dataset

In [0]:
print('{} to {}'.format(df['yearPublished'].min(), df['yearPublished'].max()))

Now do some preliminary analysis. Let's say you want to plot the number of volumes by decade in the sample set. 

Since `decade` isn't a value in our dataset, we need to add it to the dataframe. This can be done with the following step.

In [0]:
df['decade'] = df['yearPublished'] - ( df['yearPublished'] % 10)

Print the first 10 rows of the dataframe again to see how the column was applied.

In [0]:
df.head()

Now use the built in plotting tools of Pandas to plot the number of issues from each provider by decade.

In [0]:
df.groupby(['decade', 'provider'])['id'].agg('count').unstack()\
    .plot.bar(title='Issues by decade', figsize=(20, 5), fontsize=12, stacked=True);

And do the same for the total number of pages.

In [0]:
df.groupby(['decade', 'provider'])['pageCount'].agg('sum').unstack()\
    .plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12, stacked=True);