# Quantifying characteristics of documents in your dataset

This notebook demonstrates how to quantify and ultimately visualize some characteristics of your dataset.  It includes steps to:

* Import software that will let us visualize our data
* Examine metadata from our documents
* Look at quantitative characteristics of our datasets to answer and visualize our ultimate question:  

* **Are the documents containing my keywords getting longer over time?**

A quick note before we get started. As you work through this notebook you'll see cells marked ***'optional'***. These are opportunities for you to try modifying and applying Python code to see what happens. I encourage you to try them, but you can also just run the notebook as written.

Let's do this! First we'll bring in pandas, which is a software library that will help us visualize our data.

In [None]:
import pandas as pd

Next we import the tdm_client library. The tdm_client library contains functions for connecting to the JSTOR server containing our corpus dataset.

In [None]:
import tdm_client

Next we'll pull in our datasets. To analyze your dataset, you'll use the dataset ID provided when you created your dataset.

**Didn't create a dataset?**  Here are a few to choose from:


* Documents published in African American Review, Black American Literature Forum, and Negro American Literature Forum (from JSTOR): <font color=red> b4668c50-a970-c4d7-eb2c-bb6d04313542 </font>


* Articles containing keywords 'plague', 'pandemic', or 'pestilence', subject areas biological sciences or health sciences, 1915-2000 in JSTOR: <font color=red> 6ef4b79b-73a2-7590-afcd-0b22e64a2a46</font>


* 'Civilian Conservation Corps' from Chronicling America:<font color=red> 9fa82dbc-9269-6deb-9720-179b4ba5e451</font>

To analyze your dataset, use the dataset ID provided when you created your dataset. It should look like a long series of characters surrounded by dashes. Create a new variable called dataset_id to reference this value.

We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of data derived from searching JSTOR for 'Civilian Conservation Corps' is provided here ('e2a07be0-39f4-4b9f-b3d1-680bb04dc580'). Pasting your unique **dataset ID** here will import your dataset from the JSTOR server. (No output will show.)

In [None]:
dataset_id = 'e2a07be0-39f4-4b9f-b3d1-680bb04dc580'

In [None]:
dset_info = tdm_client.get_description(dataset_id)

Since running that code cell doesn't yield any output, let's double-check to make sure we have the correct dataset before going further. We can look at the original query by looking at the `search_description`.

In [None]:
dset_info["search_description"]

Let's take a look at what's in there!  We can find the total number of documents in the dataset in the `num_documents` key.

In [None]:
dset_info["num_documents"]

The next command pulls the metadata elements out of our dataset. (No output will show.)

In [None]:
metadata = tdm_client.get_metadata(dataset_id)

We grabbed the metadata successfully, but at the moment it's just a csv file.

We can convert that metadata to a pandas DataFrame to take advantage of its plotting and manipulation functions. This will present our metadata in a table-like format, and help us learn more about what's in our dataset. 

First we define this new dataframe as 'df'.

In [None]:
df = pd.read_csv(metadata)

Now we'll organize the first 4 rows (aka documents) of the DataFrame we named 'df' into a table with a header, and take a look at their metadata.


In [None]:
df.head(4)

That's a little easier on the eyes!

*Optional:  How would you look at the **last 5** documents in the dataframe?  Try it in the code block below. (Hint: the opposite of 'head' is 'tail').*

Ultimately we're trying to figure out whether the documents in our dataset are getting longer over time.  We'll take a few steps to do that.

First, let's take a look at the years these documents were published.  The following code will show us the publication year range.

In [None]:
minYear = df['publicationYear'].min()
maxYear = df['publicationYear'].max()

print(str(minYear) + ' to ' + str(maxYear))

The code to find the word count range in our documents looks very similar.

In [None]:
minWords = df['wordCount'].min()
maxWords = df['wordCount'].max()

print(str(minWords) + ' to ' + str(maxWords))

*Optional:  How would you modify the code above to count the **page range** in your set of documents?  Give it a try in the code cell below.*

*Optional:  Run the code cell below to find the mean number of words in the documents in your dataset.*

In [None]:
meanWords = df['wordCount'].mean()
print(str(meanWords))

**Grand finale time!** 

Now that we know a bit more about the documents in our dataset, we'll visualize the dataset to see if documents containing our keywords are getting longer over time.


This might take a while to run, especially if your dataset is large.  
Recall that if the kernel is working you'll see this **In [ * ]** to the left of the cell.



**Note**: There is an odd bug in this code that I've been unable to squash. If no output appears, run the below code cell again and you should then see the graph.

In [None]:
df.groupby(['publicationYear'])['wordCount'].agg('sum').plot.bar(title='Word count by year', figsize=(20, 5), fontsize=12);

What do you think -- are documents with your search terms getting longer over time?

Want to learn more and/or try setting up your own Jupyter Notebook?   [This is a great tutorial.](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)