# Impact of the 21st Century Cures Act on Stimulating Collaboration
________________________________________________________________________

## 3. Analyzing the Frequency of Collaborations

**About:** 

This notebook is provided to meet the data availability requirements for a scientific publication.

The data used in this notebook is provided to meet the data availability requirements for a scientific publication. Data is derived from NIH's internal datasets and from Digital Science's Dimensions Platform. Data from Digital Science is based on metadata as of March 2024 from Digital Science's Dimensions platform, available at https://app.dimensions.ai/.  Access was granted under license agreement with the National Cancer Institute. Researchers interested in exploring the data further should visit the Dimensions platform website.

**Notebook Goals:**

This notebook is used to evaluate if there is a difference in the frequency of collaborative events among NCI funded PIs and 21st Century Cures Act PIs. It provides one form of evidence for answering the following research question: Did the number of collaborations among 21st Century Cures Act-supported investigators increase? 

This notebook is part of a series of notebooks on evaluating the impact of the 21st Century Cures Act on stimulating collaboration in the cancer research community.

**Key Definitions:**
- NCI PI Network: a network of investigators who received qualifying funding from NCI in FY 2017-2023.
- Cures Act PI Network: a network of investigators who were funded through the 21st Century Cures Act in FY 2017-2023.
- Collaboration: A pairwise collaboration between two NCI supported PIs in the network. These form the edges of the network. Various edge count columns are available that describe the type and number of collaborations.
- Collaborative Event: A measureable collaboration between NCI supported extramural investigators. Here, these are defined as NIH base projects OR publications (peer-reviewed articles OR conference proceedings/abstracts).

**Required Packages:**
- Pandas

**Notebook Input Files:**

This notebook assumes you used the filepaths recommended in Notebook 1. If you did not, be sure to change the path to the cures_agg_edges.csv file to the filename and location where you saved the Cures Act PI Network aggregated edge table.

- Input Filepath 1: "../data/collaboration_network/agg_edges.csv"
    - The aggregated edge table for the NCI PI Network
- Input Filepath 2: "../data/cures_agg_edges.csv"
    - The aggregated edge table for the Cures Act PI Network

**Notebook Output Files:**

No outputs are generated in this notebook.

## Import Packages

In [None]:
import pandas as pd

## Functions

In [None]:
def add_volume_bin(row):
    """
    This function takes a row of a DataFrame containing the total number of collaborations
    between a PI pair and returns a collaboration frequency category of either 1, 2, or 3+.
    
    The function assumes you are applying it to an aggregated edge table. By definition of a row
    existing in the aggregated edge table, there is at least one collaborative event.
    
    Parameters:
    -----------
    row: Pandas DataFrame row (or dictionary)
        A row from the aggregated edge table that has a "n_tot_collabs" field
    
    Returns:
    --------
    bin_cat: string
        A string indicating a categorical label of 1, 2, or 3+
    
    """
    total = row["n_tot_collabs"]
    
    # Assign bin category
    if total == 1:
        bin_cat = "1"
    elif total == 2:
        bin_cat = "2"
    else:
        bin_cat = "3+"

    return bin_cat

In [None]:
# Demonstrate how the function works
print(add_volume_bin({"n_tot_collabs":1}))
print(add_volume_bin({"n_tot_collabs":2}))
print(add_volume_bin({"n_tot_collabs":4}))

## Read in Data

In [None]:
# Read in the NCI PI Network aggregated edge data 
# We skip the first row, as the data availability statement is found there.
agg_edges_df = pd.read_csv("../data/collaboration_network/agg_edges.csv", skiprows=1)
print(agg_edges_df.shape)

agg_edges_df.head()

In [None]:
# Read in the Cures Act PI Network aggregated edge data 
cures_agg_edges_df = pd.read_csv("../data/cures_agg_edges.csv")
print(cures_agg_edges_df.shape)

cures_agg_edges_df.head()

## Dataset Development

### Calculate Collaboration Frequency for the NCI PI Network

In [None]:
# First, preview and understand the data
# n_tot_collabs is the summation of n_pub_collabs and n_proj_collabs
agg_edges_df[["source", "target", "n_tot_collabs", "n_pub_collabs", "n_proj_collabs"]].head()

In [None]:
# Build a new DataFrame and add a count_category column by applying the function defined at the start of the notebook
agg_edges_df_cp = agg_edges_df.copy()

agg_edges_df_cp["count_category"] = agg_edges_df_cp.apply(add_volume_bin, axis=1)

# Preview
agg_edges_df_cp[["source", "target", "n_tot_collabs", "count_category"]].head()

In [None]:
# Now, use a group by to see how many pairs fall into each count_category
# Because the data is unique for source-target pairs, we can count the size of the resulting groups
# to know how many pairs there are per category
bin_counts_nci = agg_edges_df_cp.groupby(by="count_category").size().reset_index().rename(columns={0: "count"})

# See the data
bin_counts_nci

In [None]:
# Add a percentage column
bin_counts_nci["percentage"] = bin_counts_nci["count"]/sum(bin_counts_nci["count"])

# See the data
bin_counts_nci

### Calculate Collaboration Frequency for the Cures Act PI Network

In [None]:
# Build a new DataFrame and add a count_category column by applying the function defined at the start of the notebook
cures_agg_edges_df_cp = cures_agg_edges_df.copy()

cures_agg_edges_df_cp["count_category"] = cures_agg_edges_df_cp.apply(add_volume_bin, axis=1)

# Preview
cures_agg_edges_df_cp[["source", "target", "n_tot_collabs", "count_category"]].head()

In [None]:
# Again, use a group by to see how many pairs fall into each count_category
bin_counts_cures = cures_agg_edges_df_cp.groupby(by="count_category").size().reset_index().rename(columns={0: "count"})

# See the data
bin_counts_cures

In [None]:
# Add a percentage column
bin_counts_cures["percentage"] = bin_counts_cures["count"]/sum(bin_counts_cures["count"])

# See the data
bin_counts_cures