# Understanding PyPlot, Matplotlib's Implicit Interface
__________________________________________________

## About:

This tutorial provides an overview of using Matplotlib's implicit plotting interface, PyPlot. It is not intended as a data visualization tutorial. 

This tutorial was developed by Margaret Gratian and is adapted from Matplotlib's official documentation: https://matplotlib.org/stable/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py. It illustrates how Matplotlib's implicit plotting interface can be used with public award data from NIH RePORTER. The goal will be to plot unique counts of NCI application IDs and base projects each fiscal year.

## Inputs:
- Input Filepath 1: "../data/public_nih_reporter_data.csv"
    - Public NCI R01 awards in FY 2022 - 2024 from NIH RePORTER. Data is as of 3/13/2025. 

## Import Packages 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Read in Data 

In [None]:
raw_reporter_df = pd.read_csv("../data/public_nih_reporter_data.csv", skiprows=1, index_col=0)

# See the shape
print(raw_reporter_df.shape)

# Preview the data
raw_reporter_df.head()

## Dataset Development

### Prep Data for Analysis

In [None]:
# Check for duplicates and drop any
# Note we make a copy and do not modify our original input data
reporter_df = raw_reporter_df.copy().drop_duplicates()

reporter_df.shape

In [None]:
# Check if the data is unique for Appl Id, a unique identifier in the NIH data
# This should match the shape
reporter_df["appl_id"].nunique()

### Group Data for Plotting

In [None]:
# Group by fiscal year and count unique appl ids and unique project serial numbers (also known as base project numbers in the NIH data)
grouped_df = reporter_df.groupby(["fiscal_year"], as_index=False).agg({"appl_id":  "nunique", "project_serial_num":  "nunique"})

# Reset column names
grouped_df.columns = ["fiscal_year", "appl_id_count", "project_serial_num_count"]

In [None]:
grouped_df.head()

## Analyze and Extract Insights from Data

### Plot Unique Appl IDs

In [None]:
# Plot the data

plt.plot("fiscal_year", "appl_id_count", data=grouped_df[["fiscal_year", "appl_id_count"]])

# Show the plot
plt.show()

In [None]:
# Plot again, this time adding some labels 
plt.plot("fiscal_year", "appl_id_count", data=grouped_df[["fiscal_year", "appl_id_count"]])

# Add the x and y labels
plt.ylabel('Number of Applications')
plt.xlabel('Fiscal Year')

# Add a title
plt.title('NCI R01 Applications per Fiscal Year, 2022-2024')

# Show the plot again
plt.show()

In [None]:
# Plot again, this time changing the color

# A third optional argument is now added, specifying the color, red ('r')
plt.plot("fiscal_year", "appl_id_count", 'r', data=grouped_df[["fiscal_year", "appl_id_count"]])

# Add the x and y labels
plt.ylabel('Number of Applications')
plt.xlabel('Fiscal Year')

# Add a title
plt.title('NCI R01 Applications per Fiscal Year, 2022-2024')

# Show the plot again
plt.show()

In [None]:
# Plot again, this time formatting the axis
plt.plot("fiscal_year", "appl_id_count", 'r', data=grouped_df[["fiscal_year", "appl_id_count"]])

# Add the x and y labels
plt.ylabel('Number of Applications')
plt.xlabel('Fiscal Year')

# Add a title
plt.title('NCI R01 Applications per Fiscal Year, 2022-2024')

# Adjust the y axis - we should start at 0 
plt.ylim(0, 5000)

# Adjust the x tick marks 
# Here, we are passing a list of numbers using np.arange, indicating the start of the list (inclusive), 
# end of the list (non inclusive), and step size
# This is equivalent as plt.xticks([2022, 2023, 2024])

plt.xticks(np.arange(2022, 2025, 1))

# Show the plot again
plt.show()

In [None]:
# Now, let's adjust the figure size and the font sizes

# Setting the figure size MUST happen first! 
# figsize=(x, y) (x controls width, y controls the height)
plt.figure(figsize=(15, 7))

# Plot again, this time formatting the axis
plt.plot("fiscal_year", "appl_id_count", 'r', data=grouped_df[["fiscal_year", "appl_id_count"]])

# Add the x and y labels
plt.ylabel('Number of Applications', size=15)
plt.xlabel('Fiscal Year', size=15)

# Add a title
plt.title('NCI R01 Applications per Fiscal Year, 2022-2024')

# Adjust the y axis
plt.ylim(0, 5000)
plt.yticks(fontsize=13)

# Adjust the x axis
plt.xticks(np.arange(2022, 2025, 1))

# Show the plot again
plt.show()

### Plot Unique Appl Ids and Base Project Numbers on One Plot

In [None]:
# Setting the figure size MUST happen first! 
# figsize=(x, y) (x controls width, y controls the height)
plt.figure(figsize=(15, 7))

# Plotting on the same axes
plt.plot('fiscal_year', 'appl_id_count', 'r', data=grouped_df)
plt.plot('fiscal_year', 'project_serial_num_count', 'b', data=grouped_df)

# Add the x and y labels
plt.ylabel('Count', size=15)
plt.xlabel('Fiscal Year', size=15)

# Add a title
plt.title('NCI R01 Applications per Fiscal Year, 2022-2024', size=20)

# Adjust the x tick marks and add fontsize and a rotation
plt.xticks(np.arange(2022, 2025, 1), fontsize=13, rotation=45)

# Adjust the y axis
plt.ylim(0, 5000)
plt.yticks(fontsize=13)

# Add a legend and specify where we want it 
# Options here are: 'best', 'upper right', 'upper left', 'lower left', 'lower right', 
# 'right', 'center left', 'center right', 'lower center', 'upper center', 'center'
# We'll also specify the fontsize
plt.legend(loc='best', fontsize=15)

plt.show()

### Plot Unique Appl Ids and Base Project Numbers on Separate Plots

In [None]:
# figsize=(x, y) (x controls width, y controls the height)
plt.figure(figsize=(15, 10))

## SWITCHING TO SUBPLOTS ##
# Now, we specify that we are creating subplots, here we make 2 vertically stacked ones
plt.subplot(2, 1, 1)
plt.plot('fiscal_year', 'appl_id_count', 'r', data=grouped_df)

# Add a title which we'll use for Both
# Note that this title is technically associated with subplot(2, 1, 1)
plt.title('NCI R01 Applications and Base Projects per Fiscal Year, 2022-2024', size=20)

# Add the y label
# We'll rely on subplot(2, 1, 2) for the x label
plt.ylabel('Count', size=15)

# Adjust the y axis start
plt.ylim(0, 5000)

# Adjust the x tick marks and add fontsize and a rotation, again for subplot(2, 1, 1)
plt.xticks(np.arange(2022, 2025, 1), fontsize=13, rotation=45)

# Add a legend for subplot(2, 1, 1)
plt.legend(fontsize=15)

## NEXT SUBPLOT ##
# Note the number change in the 3rd value
plt.subplot(2, 1, 2)
plt.plot('fiscal_year', 'project_serial_num_count', 'b', data=grouped_df)

# Add the x and y labels for subplot(2, 1, 2)
plt.ylabel('Count', size=15)
plt.xlabel('Fiscal Year', size=15)

# Adjust the x tick marks and add fontsize and a rotation for subplot(2, 1, 2)
plt.xticks(np.arange(2022, 2025, 1), fontsize=13, rotation=45)

# Adjust the y axis start
plt.ylim(0, 5000)

# For the y axis, we will just adjust the fontsize and leave other things as is for subplot(2, 1, 2)
plt.yticks(fontsize=13)

# Add a legend for subplot(2, 1, 2)
plt.legend(fontsize=15)

plt.show()

In [None]:
# figsize=(x, y) (x controls width, y controls the height)
plt.figure(figsize=(15, 10))

## SWITCHING TO SUBPLOTS ##
# Now, we specify that we are creating subplots, here we make 2 vertically stacked ones
plt.subplot(2, 1, 1)
plt.bar('fiscal_year', 'appl_id_count', color='g', data=grouped_df)

# Add a title which we'll use for Both
# Note that this title is technically associated with subplot(2, 1, 1)
plt.title('NCI R01 Applications and Base Projects per Fiscal Year, 2022-2024', size=20)

# Add the y label
# We'll rely on subplot(2, 1, 2) for the x label
plt.ylabel('Count', size=15)

# Adjust the y axis start
plt.ylim(0, 5000)

# Adjust the x tick marks and add fontsize and a rotation, again for subplot(2, 1, 1)
plt.xticks(np.arange(2022, 2025, 1), fontsize=13, rotation=45)

## NEXT SUBPLOT ##
# Note the number change in the 3rd value
plt.subplot(2, 1, 2)
plt.bar('fiscal_year', 'project_serial_num_count', color='b', data=grouped_df)

# Add the x and y labels for subplot(2, 1, 2)
plt.ylabel('Count', size=15)
plt.xlabel('Fiscal Year', size=15)

# Adjust the x tick marks and add fontsize and a rotation for subplot(2, 1, 2)
plt.xticks(np.arange(2022, 2025, 1), fontsize=13, rotation=45)

# Adjust the y axis start
plt.ylim(0, 5000)

# For the y axis, we will just adjust the fontsize and leave other things as is for subplot(2, 1, 2)
plt.yticks(fontsize=13)

plt.show()

### Annotating Text on a Plot

In [None]:
# Let's say we want to point out the max point on a graph with an arrow
# Let's first find the x and y point of the max point

print(grouped_df["appl_id_count"].max())
print(grouped_df["project_serial_num_count"].max())

# What year does this max occur?
print(grouped_df[grouped_df["appl_id_count"] == grouped_df["appl_id_count"].max()])

In [None]:
# figsize=(x, y) (x controls width, y controls the height)
plt.figure(figsize=(15, 7))

# Plotting on the same axes
plt.plot('fiscal_year', 'appl_id_count', 'r', data=grouped_df)
plt.plot('fiscal_year', 'project_serial_num_count', 'b', data=grouped_df)

# Add the x and y labels
plt.ylabel('Count', size=15)
plt.xlabel('Fiscal Year', size=15)

# Add a title
plt.title('NCI R01 Applications and Base Projects per Fiscal Year, 2022-2024', size=20)

# Adjust the x tick marks and add fontsize and a rotation
plt.xticks(np.arange(2022, 2025, 1), fontsize=13, rotation=45)

# For the y axis, we will just adjust the fontsize and leave other things as is
plt.yticks(fontsize=13)

# Adjust the y axis start
plt.ylim(0, 5000)

# Add a legend and specify where we want it 
# Options here are: 'best', 'upper right', 'upper left', 'lower left', 'lower right', 
# 'right', 'center left', 'center right', 'lower center', 'upper center', 'center'
# We'll also specify the fontsize
plt.legend(loc='best', fontsize=15)

### ADDING AN ARROW ###
# We'll use the values we determined earlier
plt.annotate('max', xy=(2023, 4067), xytext=(2023, 4500), fontsize=12,
             arrowprops=dict(facecolor='black'),
             )

plt.show()