<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is created by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
___

# Pandas 3 

**Description:** This notebook describes how to:
* Build a dataset from Constellate
* Make a dataframe from the dataset
* Summarize data in a dataframe
* Group and aggregate data
* Make pivot tables in Pandas

This is the third notebook in a series on learning to use Pandas. 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Pandas 1](./pandas-1.ipynb)
* [Pandas 2](./pandas-2.ipynb)
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 1](./python-intermediate-1.ipynb)
* [Python Intermediate 2](./python-intermediate-2.ipynb)
* [Python Intermediate 4](./python-intermediate-4.ipynb)

**Completion Time:** 60 minutes

**Data Format:** JSONL 

**Libraries Used:** Pandas

**Research Pipeline:** None
___


# Build a dataset from Constellate

The dataset we are going to use for today's lesson is the documents from JSTOR or Portico with the key word "machine learning" or "artificial intelligence" about Arts, History, Philosophy, Religion limited to document type(s) article, chapter, book from 2011 - 2020. There are 12,286 documents in this dataset in total, but in class, we will only use the sampled 1500 documents from this dataset. 

After we build a dataset, we will use the [Constellate Client](https://pypi.org/project/constellate-client/) to download the dataset and read in the documents. 

If you are running the notebook locally and have never installed the Constellate Client before, you may need to install the Constellate Client using `!pip install constellate-client`.

In [None]:
# Import constellate
import constellate

After you build a dataset using the Constellate builder, it will show up in the section of `All datasets` in your dashboard. A dataset id will be assigned to your dataset. To download the dataset, you will need to use the dataset id. 

In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
dataset_id = 'd6232206-93bf-f6b8-9ad2-b2add01cf231'

The [Constellate Client]((https://constellate.org/docs/constellate-client)) has several methods we can use to download a dataset. You can choose the type of data you would like to download. The data available for downloading include the metadata, full data, unigrams, bigrams and trigrams. 

The `get_metadata()` method downloads the dataset metadata (sampled to 1500 documents) for a dataset in csv format.

The `get_dataset()` method downloads the dataset full data  (sampled to 1500 documents) in the Constellate Document Format (jsonl).

The `download()` method can download the non-sampled metadata, full data and ngram counts. 

See the Constellate Client documentation at: https://constellate.org/docs/constellate-client for more details. 

Here in the example, we download the sampled 1500 documents using the `get_dataset()` method. 

In [None]:
# use get_dataset() to download the sampled dataset
# and give the file a name 
dataset_file = constellate.get_dataset(dataset_id, 'ML_AI')

In [None]:
# Take a look at where the dataset is downloaded to
print(dataset_file)

If you would like to see where you downloaded file is located, follow these steps. 

1. Go to `File -> Open`. 
2. On the upper righthand corner, go to `New ->Terminal`. 
3. In the Terminal, type the following command `cd ../root/data`. Then type `ls`, you will see the downloaded file. 

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas3_Terminal.png' width=1000></center>


## Read in the data
After we download the dataset, we can use the `dataset_reader()` method to read in the data. 

In [None]:
# Use the .dataset_reader() method to read in the documents
docs = constellate.dataset_reader(dataset_file)

In [None]:
# Check the type of docs
type(docs)

Recall from [Python Intermediate 5](./python-intermediate-5.ipynb) that the difference between a list and a generator is that the latter yields only one element at a time. As a result, generators are more memory-efficient than lists. 

To return the elements in a generator one by one, we use the `next()` function.

In [None]:
### Take a look at the first element of the generator docs
doc1 = next(docs)
doc1

We can see that the document is loaded as a Python dictionary. You can confirm that it is a dictionary by checking the type of `doc1` using the `type()` function.

In [None]:
# use type() to check the type of doc1
type(doc1)

In [None]:
# Get all keys from the dict
doc1.keys()

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

Now you know that the full data of a Constellate dataset is read into a generator of dictionaries. Here, let's get the sample dataset we have used in [Pandas 2](./pandas-2_ipynb) again. The sample dataset contains all documents from JSTOR published in Shakespeare Quarterly from 1950 - 2020. 

First of all, we read in the data from the dataset and inspect the first document from the dataset. 

In [None]:
# Read in the full data sampled to 1500 documents 
dataset_id = 'f6ae29d4-3a70-36ee-d601-20a8c0311273'

# Use constellate.get_dataset() to download the dataset(sampled to 1500 documents)
path = constellate.get_dataset(dataset_id, 'Shake')

# Read in the data from the dataset using .dataset_reader()
docs_shake = constellate.dataset_reader(path)

# Grab the first document from the dataset
doc1_shake = next(docs_shake)

# Get the keys from the first document doc1_shake
doc1_shake.keys()

Can you follow the prompts below to create a dataframe out of the Shakespeare dataset? Use what you have learned from [Python Basics](./python-basics-1.ipynb), [Pandas 1](./pandas-1.ipynb) and [Pandas 2](./pandas-2.ipynb) to do this exercise.

From the keys of the first document, you get an idea of what kind of data are provided for each document in the dataset. In the next code cell, select the keys that are of interest to you and put them in a list. 

In [None]:
# Select the keys that are of interest to you and
# put them in a list 
keys_of_interest_shake = 

Create a dataframe storing the data of interest to you. The headers of the dataframe are the keys of interest you have just selected. Each row of the dataframe contains the relevant data from one document of the dataset. For example, if you choose 'id' and 'publicationYear' as the keys of interest. Then, the first row of the dataframe will have the id of the first document in the 'id' column and the publication year of the first document in the 'publicationYear' column. 

To help you figure out how to create the dataframe, I'll use a very simple example to illustrate. Suppose the only information you are interested in is the title of the documents in the Shakepeare dataset. How will we create a one-column dataframe storing the titles of the documents? 

In [None]:
# An example 

# import pandas
import pandas as pd

# Create a list storing all documents titles
titles = [] 
for doc in docs_shake: 
    titles.append(doc['title'])
    
# Create a dataframe with only one column called 'title'
# with each row storing the title of one document
pd.DataFrame({'title':titles})

Now, create a dataframe storing the data of interest to you from the Shakespeare dataset.

In [None]:
# Get the docs again
docs_shake = constellate.dataset_reader(path)

# Create a dataframe 


## Create a dataframe

In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
dataset_id = 'd6232206-93bf-f6b8-9ad2-b2add01cf231'
dataset_file = constellate.get_dataset(dataset_id, 'jsonl', 'ML_AI')

Suppose not all data in the documents are of interest to us. Let's select the data we are interested in.  

In [None]:
# Data of interest
data_of_interest = ['id', 'title', 'docType', 'publicationYear', 'bigramCount']

In [None]:
# Get the docs again
docs = constellate.dataset_reader(dataset_file)

From each doc in docs, we want to grab the values corresponding to the keys in the list of `data_of_interest` and create a dataframe from the data. For a quick review of list comprehensions, take a look at [Python Intermediate 1](./python-intermediate-1.ipynb).

In [None]:
# Get all the data we need for creating a dataframe
data = [
            [doc['id'], 
             doc['title'], 
             doc['docType'], 
             doc['publicationYear'], 
             doc['bigramCount']
            ] 
            for doc in docs
       ]

In [None]:
# Create a dataframe
df = pd.DataFrame(data, columns=['id', 'title', 'docType', 'publicationYear', 'bigramCount'])
df

## Data cleaning and pre-processing

We will often need to do some data cleaning and pre-processing after we create a dataframe. What kind of data cleaning and pre-processing you need to do depends on the specific task at hand. Here, we only give some examples. 

When we look at the 'id' column, we can see that all document ids start with "ark://27972/". We can get rid of this prefix and use the rest of the string as the ids. 

In [None]:
# Shorten the ids
prefix_len = len("ark://27972/")
def shorten_id(r): 
    r['id'] = r['id'][prefix_len:]
    return r
df = df.apply(shorten_id, axis=1)
df

The bigramCount column gives the number of occurrences of every bigram string in a document. As you can see, the puntuations do not count as a gram. This is why 'a Mistake?' is seen as a bigram, not a trigram. With this in mind, let's make a new column storing the count of the bigram 'machine learning' and a new column storing the bigram 'artificial intelligence'. 

In [None]:
# Example solution 
len_ML = len('machine learning')

# Define a function to return the count of 'machine learning' and 'artificial intelligence'
def count(r):
    count_ML = 0
    count_AI = 0
    for key in r['bigramCount'].keys():
        key_lower = key.lower()
        if len(key)<len_ML:
            continue
        elif 'machine learning' in key_lower:
            count_ML += r['bigramCount'][key]
        elif 'artificial intelligence' in key_lower:
            count_AI += r['bigramCount'][key]
    return [count_ML, count_AI]

# Create a column with the count of 'machine learning'
# and a column with the count of 'artificial intelligence'
df[['ML_count', 'AI_count']] = df.apply(count, axis=1, result_type='expand')

In [None]:
# Take a look at the updated df
df

In [None]:
# Drop the bigramCount column
df = df.drop('bigramCount', axis=1)

## Group and aggregate data

After data cleaning, filtering and preprocessing, the next step is to summarize the data to extract useful information.

Pandas makes summarising a dataframe very easy. For example, we can count how many non-null values there are in each column using the `.count()` method. 

In [None]:
# Get the number of non-null values in each column
df.count()

We can also get the max value or the min value of a column using the `.max()` and `.min()` methods. 

In [None]:
# Get the max value from the year column
df['publicationYear'].max()

In [None]:
# Get the min value from the year column
df['publicationYear'].min()

You can refer to the Pandas documentation for more methods that you can use to query the data. 

When you summarize a dataframe, a very useful method is `.describe()`. It can quickly display the statistics for any group of data it is applied to. 

In [None]:
# Use the .describe() method to explore the year column
df['title'].describe()

### Groupby()

Groupby is a powerful function built into Pandas that you can use to summarize your data. Groupby splits the data into different groups on a variable of your choice. 

In [None]:
# Group the data by docType
df.groupby('docType')

The groupby() method returns a GroupBy object which describes how the rows of the original dataset have been split by the selected variable. You can actually see how the rows of the original dataframe have been grouped using the `groups` attribute after applying `groupby()`.

In [None]:
# See how the rows have been grouped
df.groupby('docType').groups

As you can see, a dictionary is returned whose keys are the unique values in docType and whose values are lists of row indexes. Each key corresponds to a list of row indexes.

You can group the data using multiple variables. For example, you may want to group the documents first by their publication year and then by the document type. 

In [None]:
# Group by multiple variables 
# Take a look at the composite keys
df.groupby(['publicationYear', 'docType']).groups

If you take a look at the groups in the groupby object, you will see that essentially we have a composite key for each group. The first key, for example, is (2011, 'article'). The value associated with this key is a list of indexes, all of which are the rows storing the documents that were published in 2011 and are of the docType 'article'.

Of course, we don't just stop at grouping data. Grouping data is just a step towards data query. After we apply the `.groupby()` method, we can actually use different Pandas methods to query the data. For example, how do we get the number of documents in each docType by publicationYear?

In [None]:
# Create a series storing the number of documents in each doc type by year
df.groupby(['publicationYear', 'docType']).size()

### Agg() 

After we group the data in a dataframe, we can apply the `agg()` method to calculate multiple statistics per group in one calculation. 

For example, let's say we would like to know the sum of the occurrences of the bigram 'machine learning' in all the documents from each year. To achieve this goal, we can group the data by `publicationYear`, and then aggregate the data by summing the numerical values in the column of `ML_count` for each subgroup.  

In [None]:
# Get how many times 'machine learning' is
# mentioned in the docs each year
df.groupby('publicationYear').agg({'ML_count':'sum'})

Of course, you can choose other ways to aggregate the data in each subgroup. For example, suppose you are interested in the biggest frequency with which a document mentions 'artificial intelligence' by year.

In [None]:
# the biggest frequency of a document mentioning 
# 'artificial intelligence' by year
df.groupby('publicationYear').agg({'AI_count':'max'})

We can specify multiple columns to apply a function to. 

In [None]:
# apply a single function to selected columns in each subgroup
df.groupby('publicationYear').agg({'AI_count':'sum', 'ML_count': 'max'})

We can also apply multiple functions to each of the selected columns.

In [None]:
# apply multiple functions to selected columns in each subgroup
df.groupby('publicationYear').agg({'AI_count':['sum', 'max'], 'ML_count':['max', 'count']})

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

Take the following dataframe containing the information on the Covid19 cases in the state of Massachusetts. Can you work with the dataframe to find out which month of which year has the most positive new cases?

In [None]:
### Get the data of covid19 cases in MA and create a dataframe 

# Download the .csv file
from pathlib import Path
import urllib.request

# Check if a data folder exists. If not, create it.
data_folder = Path('./data/')
data_folder.mkdir(exist_ok=True)

# Download the file
url = 'https://www.mass.gov/doc/covid-19-raw-data-march-9-2023/download'
urllib.request.urlretrieve(url, './data/covid_MA.csv')

# Success message
print('Sample file ready.')

# install the openpyxl library
!pip install openpyxl

# Read in the sheet containing the info about positive cases
covid_ma = pd.read_excel('./data/covid_MA.csv', 'Cases (Report Date)')

# Change the dtype of the Date column for later use
covid_ma['Date'] = covid_ma['Date'].astype(str)

# Take a look at the dataframe
covid_ma

In [None]:
# Find out which month of which year has the most positive new cases
# Note that the dtype for the values in Date column is str


## Make pivot tables in Pandas

Pandas has a `.pivot_table()` method we can use to summarize data. It takes a dataframe as argument and has parameters specifying the shape of the pivot table. 

In the previous section, we have used the `.groupby()` and `agg()` methods to summarize data. For example, we grouped the documents in the dataframe df by their year of publication and calculated the sum of the mentions of the bigram 'artificial intelligence'. We can do the same thing using the `.pivot_table` method. 

In [None]:
# Create a pivot table giving the sum of 
# the mentions of 'artificial intelligence' by year
df.pivot_table(index='publicationYear', 
                       values='AI_count',
                      aggfunc='sum')

Again, when aggregating the data, you can apply a single function to multiple columns. 

In [None]:
# Create a pivot table giving the sum of 
# the mentions of 'machine learning' and 'artificial intelligence' by year
df.pivot_table(index='publicationYear', 
                       values=['AI_count', 'ML_count'],
                      aggfunc='sum')

You can also apply multiple functions to a single column. 

In [None]:
# Create a pivot table giving the sum and the max value of
# the mentions of 'artificial intelligence' by year
df.pivot_table(index='publicationYear', 
                       values='AI_count',
                      aggfunc=['sum', 'max'])

Or, you can apply different functions to different columns. 

In [None]:
# Create a pivot table giving the sum of
# the mentions of 'artificial intelligence' by year
# and the max value of the mentions of 'machine learning' by year
df.pivot_table(index='publicationYear', 
                       values=['AI_count', 'ML_count'],
                      aggfunc={'AI_count':'sum', 'ML_count':'max'})

<h2 style="color:red; display:inline">Coding Challenge! &lt; / &gt; </h2>

Get the dataframe stored in the variable `covid_ma`. Can you make a pivot table showing the sum of the positive cases from 2020 - 2023 in that table? 

## A teaser for the Data Visualization class

We have learned how to create a dataset from Constellate, how to preprocess the data and how to summarize the data. With the information we get from summarizing the data, we can go ahead and plot it!

For example, let's plot the number of docs that mentioned 'artificial intelligence' or 'machine learning' from 2011 - 2020. 

In [None]:
# To show the graph inside the notebook
%matplotlib inline

In [None]:
# Prepare the dataframe for plotting
df.groupby('publicationYear').size().plot(kind='bar', ylabel='num_doc')

___
# Lesson Complete
Congratulations! You've completed the *Pandas* series. 

Considering the amount of material in *Pandas 1-3* there's a good chance you won't retain it all. That's okay. Programmers often need to look up things to accomplish a task they haven't done in a while, particularly if it is in a language they don't often use. When you're working on a project, you can always come back to these lessons as reference materials. In other words, you've learned an incredible amount, so don't be surprised if it doesn't all stick at first.

If you want to help yourself retain what you've learned, the best way is to start putting it into practice. Try your hand at creating some small Pandas projects and recognize that the things you've learned here will cement with time and practice. When you do forget a particular thing&mdash;as we all do&mdash;a quick web search often turns up some useful examples.

## Start a Text Analysis Lesson:
* [Exploring Metadata](./exploring-metadata.ipynb) 

## Solutions to exercises

Here are the solutions to some of the exercises in this notebook.

In [None]:
### Find out which month of which year has the most positive cases of covid19 in MA

# Download the .csv file
from pathlib import Path
import urllib.request

# Check if a data folder exists. If not, create it.
data_folder = Path('./data/')
data_folder.mkdir(exist_ok=True)

# Download the file
url = 'https://www.mass.gov/doc/covid-19-raw-data-march-9-2023/download'
urllib.request.urlretrieve(url, './data/covid_MA.csv')

# Success message
print('Sample file ready.')

# Read in the sheet containing the info about positive cases
covid_ma = pd.read_excel('./data/covid_MA.csv', 'Cases (Report Date)')

# Change the dtype of the Date column for later use
covid_ma['Date'] = covid_ma['Date'].astype(str)

# Extract the year and month from the date column
covid_ma['group_var'] = covid_ma['Date'].apply(lambda r: r.rsplit('-', 1)[0])

# Use the new column as a grouping variable and divide the data into subgroups
# Aggregate the data using 'sum' function and sort the results in descending order
covid_ma.groupby('group_var').agg({'Positive New':'sum'}).sort_values('Positive New', ascending=0)

In [None]:
### Make a pivot table showing the sum of positive covid19 cases by year in covid_ma

covid_ma['Year'] = covid_ma['Date'].str.slice(0,4)
covid_ma.pivot_table(index='Year',
                    values='Positive Total',
                    aggfunc='sum')