<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is adapted by Zhuo Chen from the notebooks created by [Nathan Kelber](http://nkelber.com), [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) and [Melanie Walsh](https://melaniewalsh.org) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
___

# Pandas 3 

**Description:** This notebook describes how to:
* Build a dataset from Constellate
* Make a dataframe from the dataset
* Group and aggregate data
* Plot using Pandas

This is the third notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Pandas 1](./pandas-1.ipynb)
* [Pandas 2](./pandas-2.ipynb)
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 1](./python-intermediate-1.ipynb)
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 60 minutes

**Data Format:** JSONL 

**Libraries Used:** Pandas

**Research Pipeline:** None
___


# Build a dataset from Constellate

The dataset we are going to use for today's lesson is the documents from JSTOR about Economics limited to document type(s) chapter and language(s) English from 2007 - 2012 and limited to full text availability.

In [None]:
# install and import constellate
!pip3 install constellate-client
import constellate

In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
# The dataset is Full-Text chapters in English 
# from the JSTOR about economics published between 2007-2012 
dataset_id = 'f7390385-7fc6-5dde-bcdf-79724bb916e5'

In [None]:
# use .get_dataset to download the dataset (sampled to 1500 documents)
# in the Constellate Document Format (jsonl) and give the file a name
dataset_file = constellate.get_dataset(dataset_id, 'economics')

If you would like to download the full dataset (up to a limit of 25,000 documents),
request it first in the builder environment. See the Constellate Client
documentation at: https://constellate.org/docs/constellate-client
Then use the `constellate.download()` method to download the dataset.

## Read in the data
After we download the dataset, we can use the `dataset_reader()` method to read in the data. 

In [None]:
# Use the .dataset_reader() method to read in the documents
docs = constellate.dataset_reader(dataset_file)

In [None]:
# Check the type of docs
type(docs)

Recall from [Python Intermediate 1](./python-intermediate-1.ipynb) that the difference between a list and a generator is that the latter yields only one element at a time. As a result, generators are more memory-efficient than lists. 

To return the elements in a generator one by one, we use the `next()` function.

In [None]:
# Take a look at the first element of the generator docs
doc1 = next(docs)
doc1

We can see that the document is loaded as a Python dictionary. 

In [None]:
# Get all keys from the dict
doc1.keys()

## Create a dataframe

In [None]:
# import the Pandas library
import pandas as pd

Suppose not all data in the documents are of interest to us. Let's select the data we are interested in.  

In [None]:
# Data of interest
data_of_interest = ['id', 'fullText', 'title', 'publicationYear', 'wordCount']

We can create an empty dataframe with the strings in this list as the column headers.

In [None]:
# Create a dataframe
df = pd.DataFrame(columns=data_of_interest)
df

In [None]:
# Get the docs again
docs = constellate.dataset_reader(dataset_file)

From each doc in docs, we want to grab the values corresponding to the keys in the list of `data_of_interest` and put those data under the relevant header in the dataframe.  

In [None]:
index = 0 # initialize a variable 'index' and give it a value of 0
for doc in docs:
    df.loc[index] = [doc[column] for column in data_of_interest] # use a list comprehension to add rows
    index = index + 1
df

If we want, we can do some data cleaning or pre-processing after we create a dataframe. For example, when we look at the 'id' column, we can see that all document ids start with "http://www.jstor.org/stable/". We can get rid of this prefix and use the rest of the string as the ids. 

In [None]:
# Shorten the ids
df['id'] = df['id'].apply(lambda r: r.split('stable/')[1])
df

In [None]:
# Explore the dataframe
df.info()

# Group and aggregate data

In total, there are 1217 documents in our dataset. 

Suppose we would like to know the number of documents from each year in this dataset.

We can use the `.grouby()` method to group the documents by the publication year and then use the `.size()` method to count how many rows there are in each group.  

In [None]:
# Group the data by year and count number of rows in each group
df.groupby('publicationYear').size()

In [None]:
# Create a dataframe storing the number of documents by year
doc_by_year = df.groupby('publicationYear').size()
doc_by_year

We can plot a bar chart to show the number of documents from each year in the dataset visually.

In [None]:
# Give a command to show the charts in the notebook
%matplotlib inline

In [None]:
# Plot a bar chart to show number of docs from each year in the dataset
doc_by_year.plot(kind='bar', title='Doc_by_year')

There are other calculations we can do after grouping data.

For example, let's say we would like to know the sum of the word count of all the documents from each year in our dataset. To achieve this goal, we can group the data by `publicationYear`, and then aggregate the data by summing the numerical values in the column of `wordCount` for each subgroup.  

In [None]:
# Get the sum of word count for each year in the dataset, sort the result by word count
sum_word_count = df.groupby('publicationYear')['wordCount'].agg('sum').sort_values()
sum_word_count.to_frame().reset_index()

In [None]:
# Plot the sum of word count of the docs from each year
sum_word_count.plot(x='wordCount', y='publicationYear', kind='barh', xlabel='Sum_word_count')

We know that between 2007 and 2009, there was a global recession called The Great Recession. Suppose we would like to know what percentage of all the documents in our dataset mentioned recession. 

To check whether a document mentioned 'recession' or not, we will search the full text of each document for the word 'recession'. Let's first grab the full text from one document and take a look.

In [None]:
# grab the full text of the first document
df.loc[0, 'fullText']

In [None]:
len(df.loc[0, 'fullText'])

In [None]:
# Join the strings in each list in the column 'fullText' into a big string
df['fullText'] = df['fullText'].str.join('')

In [None]:
# A quick refresher of join
'-'.join(['a', 'b', 'c'])

In [None]:
# Create a new column storing whether a document mentioned 'recession'
df['recession'] = df['fullText'].str.contains('recession', case=False)
df

In [None]:
# Get the rows where the value in the 'recession' column is true
rec_docs = df[df['recession']==True]
rec_docs

In [None]:
# Get how many docs from each year mentioned 'recession'
rec_docs_by_year = rec_docs.groupby('publicationYear').size()
rec_docs_by_year

In [None]:
# Plot a pie graph showing 
# of all docs that mentioned 'recession'
# what percentage of the docs is from 2007, what percentage of the docs is from 2008,
# so on and so forth
rec_docs_by_year.plot(kind='pie', autopct="%.2f", figsize=(6,6), ylabel='pct_rec_doc')

We can also plot a line graph to track the trend of the precentage of docs that mentioned 'recession' over the years. 

In [None]:
# Merge doc_by_year and rec_docs_by_year
new_df = pd.concat([doc_by_year, rec_docs_by_year], axis=1)
new_df

In [None]:
# Change the headers to make them more descriptive
new_df.rename(columns={0:'num_of_doc', 1:'num_rec_doc'}, inplace=True)
new_df

In [None]:
# Make a new column storing the percentage of docs that mentioned 'recession' for each year
new_df['pct_rec_doc'] = round(new_df['num_rec_doc']/new_df['num_of_doc'],2)
new_df

Now we are ready to plot a line graph that shows the trend of the percentage of docs that mentioned 'recession' over the years.

In [None]:
# Plot of a line graph showing
# the percentage of recession docs over the years
new_df.plot(use_index=True, y='pct_rec_doc', kind='line')

We can also make a bar chart where the number of docs in a year stands side by side with the number of docs that mentioned 'recession' in the same year.

In [None]:
# Plot a bar chart with
# number of doc and number of rec doc side by side
new_df.plot(use_index=True, y=['num_of_doc', 'num_rec_doc'], kind='bar')

<h1 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h1>

Build a dataset on Constellate; make a dataframe from your dataset; manipulate the data; get some useful information from your dataset; plot the information you get.

___
# Lesson Complete
Congratulations! You've completed the *Pandas* series. 

Considering the amount of material in *Pandas 1-3* there's a good chance you won't retain it all. That's okay. Programmers often need to look up things to accomplish a task they haven't done in a while, particularly if it is in a language they don't often use. When you're working on a project, you can always come back to these lessons as reference materials. In other words, you've learned an incredible amount, so don't be surprised if it doesn't all stick at first.

If you want to help yourself retain what you've learned, the best way is to start putting it into practice. Try your hand at creating some small Pandas projects and recognize that the things you've learned here will cement with time and practice. When you do forget a particular thing&mdash;as we all do&mdash;a quick web search often turns up some useful examples.

## Start a Text Analysis Lesson:
* [Exploring Metadata](./exploring-metadata.ipynb) 