<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](http://nkelber.com), [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) and [Melanie Walsh](https://melaniewalsh.org) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
___

# Pandas 3 

**Description:** This notebook describes how to:
* Build a dataset from Constellate
* Make a dataframe from the dataset
* Group and aggregate data
* Plot using Pandas

This is the third notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Pandas 1](./pandas-1.ipynb)
* [Pandas 2](./pandas-2.ipynb)
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 1](./python-intermediate-1.ipynb)
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 90 minutes

**Data Format:** JSONL 

**Libraries Used:** Pandas

**Research Pipeline:** None
___


# Build a dataset from Constellate

In [None]:
# install and import constellate
!pip install constellate-client
import constellate

The dataset we are going to use for today's lesson is the documents from JSTOR about Economics limited to document type(s) chapter and language(s) English from 2007 - 2012 and limited to full text availability.

In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
# The dataset is Full-Text chapters in English 
# from the JSTOR about economics published between 2007-2012 
dataset_id = 'f7390385-7fc6-5dde-bcdf-79724bb916e5'

In [None]:
# Download the dataset
dataset_file = constellate.get_dataset(dataset_id, 'jsonl')

## Read in the data
After we download the dataset, we can use the `dataset_reader()` method to read in the data. 

In [None]:
# Use the .dataset_reader() method to read in the documents
docs = constellate.dataset_reader(dataset_file)

In [None]:
# Check the type of docs
type(docs)

Recall from [Python Intermediate 1](./python-intermediate-1.ipynb) that the difference between a list and a generator is that the latter yields only one element at a time. To return the elements one by one, we use the `next()` function.

In [None]:
# Take a look at the first element of the generator docs
doc1 = next(docs)
doc1

We can see that the doc data are stored in a dictionary. 

In [None]:
# Get all keys from the dict
doc1.keys()

## Create a dataframe

In [None]:
# import the Pandas library
import pandas as pd

In [None]:
# Data of interest
data_of_interest = ['id', 'fullText', 'title', 'publicationYear', 'wordCount']

In [None]:
# Create a dataframe
df = pd.DataFrame(columns=data_of_interest)
df

In [None]:
# Get the docs again
docs = constellate.dataset_reader(dataset_file)

From each doc in docs, we want to grab the values corresponding to the keys in the list of data_of_interest and put those data under the relevant header in the dataframe.  

In [None]:
index = 0 # initialize a variable 'index' and give it a value of 0
for doc in docs:
    df.loc[index] = [doc[column] for column in data_of_interest] # use a list comprehension to add rows
    index = index + 1
df

All the document ids start with "http://www.jstor.org/stable/". We can get rid of this part of the string and use the rest as the ids. 

In [None]:
# Shorten the ids
df['id'] = df['id'].apply(lambda r: r.split('stable/')[1])
df

In [None]:
# Explore the dataframe
df.info()

## Group and aggregate data

Our dataset contains the full-text chapters in English from the JSTOR about economics published between 2007-2012. Now, suppose we would like to know how many documents there are from each year. We can use the `.grouby()` method to group the data by the publication year and then use the `.count()` method to get the number of documents from each year. 

In [None]:
# Group the data by year
df.groupby('publicationYear').count()

In [None]:
# Group the data by year, get the num of rows in 'id' column for each year, reset the index
doc_by_year = df.groupby('publicationYear')['id'].count()
doc_by_year

We can plot a bar chart to show the number of documents from each year in the dataset visually. To do that, we will need to download the `matplotlib` library first. Then, we will import the submodule `pyplot` from `matplotlib`. Conventionally, we will give the submodule a shorter name `plt` when we import it, in the same way that we call Pandas as `pd` when we import it. 

By default, when we use matplotlib to plot a chart, the chart will show in a separate window. We could set a parameter `inline` to show the chart immediately below the code cell that produces the chart. 

In [None]:
# Plot a bar chart to show number of docs from each year in the dataset
doc_by_year.plot(x='publicationYear', y='id', kind='bar', color='blue', ylabel='num_of_doc')

We have seen that we can group data in a dataframe by a certain label and then count how many data points we have in each subgroup. Another operation we could do is to calculate the sum of all the numerical values in a certain column after data grouping.  

For example, let's say we would like to know the sum of the word count of all the documents from each year in our dataset. To achieve this goal, we can group the data by `publicationYear`, and then aggregate the data by summing the numerical values in the column of `wordCount` for each subgroup.  

In [None]:
# Get the sum of word count for each year in the dataset, sort the result by word count
sum_word_count = df.groupby('publicationYear')['wordCount'].agg('sum').sort_values()
sum_word_count

In [None]:
# Plot the sum of word count of the docs from each year
sum_word_count.plot(x='wordCount', y='publicationYear', kind='barh', color='purple')

We know that between 2007 and 2009, there was a global recession called The Great Recession. Suppose we would like to know what percentage of all the documents in our dataset mentioned recession. 

In [None]:
# Take a look at our original df
df

To check whether a document mentioned 'recession' or not, we will search the full text of each document for the word 'recession'. Let's first grab the full text from one document and take a look.

In [None]:
# grab the full text of the first document
df.loc[0, 'fullText']

In [None]:
# Join the strings in each list in the column 'fullText' into a big string
df['fullText'] = df['fullText'].apply(lambda r: ''.join(r))

In [None]:
# Create a new column storing whether a document mentioned 'recession'
df['recession'] = df['fullText'].str.contains('recession')

In [None]:
df

In [None]:
# Get the rows where the value in the 'recession' column is true
recession_docs = df[df['recession']==True]
recession_docs

In [None]:
# Get how many docs from each year mentioned 'recession'
recession_docs_by_year = recession_docs.groupby('publicationYear')['recession'].count()
recession_docs_by_year

In [None]:
# Plot a pie graph showing 
# of all docs that mentioned 'recession'
# what percentage of the docs is from 2007, what percentage of the docs is from 2008,
# so on and so forth
recession_docs_by_year.plot(kind='pie', label="")

We can also plot a line graph to track the trend of the precentage of docs that mentioned 'recession' over the years. 

In [None]:
# Merge doc_by_year and recession_docs_by_year
recession_doc_count_by_year = pd.concat([doc_by_year, recession_docs_by_year], axis=1)
recession_doc_count_by_year

In [None]:
# Change the headers to make them more descriptive
recession_doc_count_by_year.rename(columns={'id':'num_of_doc', 'recession':'num_recession_doc'}, inplace=True)
recession_doc_count_by_year

In [None]:
# Make a new column storing the percentage of docs that mentioned 'recession' for each year
recession_doc_count_by_year['perc_recession_doc'] = recession_doc_count_by_year['num_recession_doc']/recession_doc_count_by_year['num_of_doc']
recession_doc_count_by_year.reset_index(inplace=True)

Now we are ready to plot a line graph that shows the trend of the percentage of docs that mentioned 'recession' over the years.

In [None]:
recession_doc_count_by_year.plot(x='publicationYear', y='perc_recession_doc', kind='line')

We can also make a bar chart showing the number of docs oer year and the number of docs that mentioned 'recession' for each year.

In [None]:
recession_doc_count_by_year.plot(x='publicationYear', y=['num_of_doc', 'num_recession_doc'], kind='bar')

<h1 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h1>

Build a dataset on Constellate; make a dataframe from your dataset; manipulate the data; get some useful information from your dataset; plot the information you get.