# Impact of the 21st Century Cures Act on Stimulating Collaboration
________________________________________________________________________

## 1. Building a Network of 21st Century Cures Act Funded PIs

**About:** 

This notebook is provided to meet the data availability requirements for a scientific publication.

The data used in this notebook is provided to meet the data availability requirements for a scientific publication. Data is derived from NIH's internal datasets and from Digital Science's Dimensions Platform. Data from Digital Science is based on metadata as of March 2024 from Digital Science's Dimensions platform, available at https://app.dimensions.ai/.  Access was granted under license agreement with the National Cancer Institute. Researchers interested in exploring the data further should visit the Dimensions platform website.

**Notebook Goals:**

This notebook introduces two tables that together form a network of NCI supported extramural investigators, which we call the "NCI PI Network". This notebook demonstrates how fields in the tables can be used to produce a subset of the NCI PI Network that represents the PIs funded through the 21st Century Cures Act, referred to as the "Cures Act PI Network" in this notebook and subsequent ones. 

This notebook is part of a series of notebooks on evaluating the impact of the 21st Century Cures Act on stimulating collaboration in the cancer research community.

**Key Definitions:**
- NCI supported: indicates NCI extramural funding (grants).
- 21st Century Cures Act supported: indicates extramural funding through the 21st Century Cures Act.
- NCI PI Network: a network of investigators who received qualifying funding from NCI in FY 2017-2023.
- Cures Act PI Network: a network of investigators who were funded through the 21st Century Cures Act in FY 2017-2023.
- PI: an NCI supported extramural investigator. These form the nodes of the network. 
- Collaboration: A pairwise collaboration between two NCI supported PIs in the network. These form the edges of the network. Various edge count columns are available that describe the type and number of collaborations.
- Collaborative Event: A measureable collaboration between NCI supported extramural investigators. Here, these are defined as NIH base projects OR publications (peer-reviewed articles OR conference proceedings/abstracts).

**Required Packages:**
- Pandas

**Notebook Input Files:**
- Input Filepath 1: "../data/collaboration_network/nodes.csv"
    - The nodes table for the NCI PI Network
- Input Filepath 2: "../data/collaboration_network/agg_edges.csv"
    - The aggregated edge table for the NCI PI Network

**Notebook Output Files:**

We recommend saving the tables produced in this notebook with the following paths and filenames. Subsequent notebooks will use these tables for analysis.

- Output Filepath 1: "../data/cures_nodes.csv"
    - A nodes table containing NIH PIs who received 21st Century Cures Act funding in FY 2017-2023.
- Output Filepath 2: "../data/cures_agg_edges.csv"
    - An aggregated edge table containing publication and project collaborations among NIH PIs who received 21st Century Cures Act funding in FY 2017-2023.

## Import Packages

In [None]:
import pandas as pd

## Read in Data

In [None]:
# Read in the NCI PI Network node data
# We skip the first row, as the data availability statement is found there.
nodes_df = pd.read_csv("../data/collaboration_network/nodes.csv", skiprows=1)
print(nodes_df.shape)

nodes_df.head()

In [None]:
# Read in the NCI PI Network aggregated edge data 
# We skip the first row, as the data availability statement is found there.
agg_edges_df = pd.read_csv("../data/collaboration_network/agg_edges.csv", skiprows=1)
print(agg_edges_df.shape)

agg_edges_df.head()

## Dataset Development

### Understanding the Available Columns

The nodes table is unique for PPID, which is a unique identifier used by NIH. In the nodes table, there are various columns that can be used to understand the projects that qualified the PI to be included in the network, the publications and NIH projects the PIs are associated with, and some columns specific to PIs who received 21st Century Cures Act Funding in FY 2017-2023.

In the aggregated table, pairwise collaborations between PIs in the network are summarized. Each row is unique for a source PPID and target PPID. Each row represents an edge of the network. Various count columns can be used to understand these edges. 

For more information, please see the accompanying data dictionary. 

In [None]:
# See the columns in the nodes table.
nodes_df.info()

In [None]:
# See the columns in the agg_edges table.
agg_edges_df.info()

### Subset the Nodes Table to PIs Who Received 21st Century Cures Act Funding

In [None]:
# 21st Century Cures Act Funded PIs are identified using the moonshot_pi column
nodes_df["moonshot_pi"].value_counts(dropna=False)

In [None]:
# Make a copy and subset to where moonshot_pi == "y"
cures_nodes_df = nodes_df[nodes_df["moonshot_pi"] == "y"].copy()

print(cures_nodes_df.shape)

# Preview
cures_nodes_df.head()

### Subset Aggregated Edges to Collaborations Between PIs Who Received 21st Century Cures Act Funding

In [None]:
# Edges between 21st Century Cures Act Funded PIs are identified using the ms_collab column
agg_edges_df["ms_collab"].value_counts(dropna=False)

In [None]:
# Make a copy and subset to where ms_collab == "y"
cures_agg_edges_df = agg_edges_df[agg_edges_df["ms_collab"] == "y"]

print(cures_agg_edges_df.shape)

# Preview
cures_agg_edges_df.head()

In [None]:
# Both the source and target PIs are 21st Century Cures Act Funded PIs in the resulting table
print(cures_agg_edges_df["source_moonshot_pi"].value_counts(dropna=False))
print(cures_agg_edges_df["target_moonshot_pi"].value_counts(dropna=False))

## Save Outputs

We recommend saving the two tables with the following filepaths. Subsequent notebooks will assume these filepaths exist. If you choose to save to a different path or with a different file name, be sure to update this if you use the other notebooks.

In [None]:
# Nodes
cures_nodes_df.to_csv("../data/cures_nodes.csv", index=False)

In [None]:
# Aggregated Edges
cures_agg_edges_df.to_csv("../data/cures_agg_edges.csv", index=False)