# Exploring the RECOVER Adult Cohort on BioData Catalyst

RECOVER is a first-of-its-kind, patient-centered research initiative to understand, diagnose, treat, and prevent Long COVID. RECOVER research includes observational cohort studies, electronic health records analysis, pathobiology studies, tissue pathology studies, and clinical trials.

RECOVER studies involve thousands of people from all walks of life, hundreds of research investigators, and millions of electronic health records (EHRs). RECOVER aims to achieve the following:

* Understand the range of recovery from Long COVID and the changes it can cause in people over time.
* Define risk factors, understand the number of people getting Long COVID, and determine whether there are specific, different Long COVID types.
* Study how Long COVID changes over time and how those changes may relate to other illnesses.
* Identify possible treatments for Long COVID symptoms.


Researchers can utilize BioData Catalyst Powered by PIC-SURE to search terms, apply filters, build cohorts, and export of the RECOVER Adult data in an analysis-ready format. 

------

## Set Up

### Pre-requisites
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed (link to pip)

### Install packages
The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:
* PIC-SURE Client
* PIC-SURE Adapter
* *BDC-PIC-SURE* Adapter

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

In [None]:
# Install packages
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")

In [None]:
# Install PIC-SURE packages
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

## Connecting to PIC-SURE

The following is required to get access to the PIC-SURE API:
* a network URL
* a user-specific security token

The following code specifies the network URL as the *BDC Powered by PIC-SURE* URL and references the user-specific token saved as `token.txt`.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the `README.md` file and the `Workspace_setup.ipynb` file.

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

## Using the RECOVER Adult Cohort

The RECOVER Adult dataset includes many variables related to Long COVID and COVID-19 symptoms. For a complete view of all RECOVER variables in PIC-SURE, you can refer to the [PIC-SURE RECOVER Data Dictionary spreadsheet](https://docs.google.com/spreadsheets/d/1A-BGTOjEgaPRG0KqSNWLuFFHMRkflSMh4Y_wYL2AGag/edit?usp=sharing). 

PIC-SURE can also be used in the coding interface to conduct searches for variables, apply filters to build cohorts, and export the data in an analysis-ready format. 

For the purposes of this example notebook, let's use variables related to:
* Postacute sequelae of SARS-CoV-2 infection (PASC)
* Headaches or head pain 

Postacute sequelae of SARS-CoV-2 infection (PASC), also known as *long COVID* is defined as ongoing, relapsing, or new symptoms or conditions present 30 or more days after infection. A recent publication developed a preliminary rule for defining PASC based on a score derived from the most frequently reported symptoms from those with long COVID. A PASC score between 0 and 34 is assigned based on a person's symptoms, where a greater score indicates more PASC symptoms. The publication also defined a cutoff based on this score to identify individuals as PASC positive or PASC negative:
* PASC score < 12: PASC negative
* PASC score >= 12: PASC positive

The RECOVER biostatistics team has used this definition to derive PASC scores for the RECOVER Adult cohort, which will be used in this notebook. For more information about these PASC scores, please refer to Thaweethai et al.'s [Development of a Definition of Postacute Sequelae of SARS-CoV-2 Infection](https://jamanetwork.com/journals/jama/fullarticle/2805540).



### PASC Scores
First, let's search for variables related to `PASC score`.

In [None]:
# Search for derived PASC score
dictionary = bdc.useDictionary().dictionary() # Set up the dictionary
pasc_search = dictionary.find("pasc score")
pasc_vars = pasc_search.dataframe()
pasc_vars = pasc_vars[pasc_vars.columnmeta_study_id == "phs003463"]
pasc_vars.head()

Here, we can see that there are many PASC-related variables in the RECOVER Adult dataset. The scores we are interested in are the `PASC score at time of survey, based on definition from Thaweethai et al. (2023)`, which are the variables generated by the RECOVER biostatistics team. We can find these by filtering the `columnmeta_name` column to show rows that contain this text. 

In [None]:
# Limit to only variables with matching text in columnmeta_name
biostats_pasc_vars = pasc_vars[pasc_vars.columnmeta_name.str.contains("PASC score at time of survey, based on definition from Thaweethai et al. \(2023\)")]
biostats_pasc_vars.head()

As shown above, there are derived PASC scores at different times during the data collection, including a baseline measurement and followup visits. Let's use the baseline measurement and information from the first three followup visits.

In [None]:
# Save PASC variables for baseline and first three followups
baseline_pasc = biostats_pasc_vars.HPDS_PATH[biostats_pasc_vars.varId.str.contains("baseline")].values[0]
f1_pasc = biostats_pasc_vars.HPDS_PATH[biostats_pasc_vars.varId.str.contains("followup_1_")].values[0]
f2_pasc = biostats_pasc_vars.HPDS_PATH[biostats_pasc_vars.varId.str.contains("followup_2_")].values[0]
f3_pasc = biostats_pasc_vars.HPDS_PATH[biostats_pasc_vars.varId.str.contains("followup_3_")].values[0]

### Headache / Head Pain 
Next, let's search for variables related to `head pain`.

In [None]:
# Search for headache variables
headpain_search = dictionary.find("head pain")
headpain_vars = headpain_search.dataframe()
headpain_vars = headpain_vars[headpain_vars.studyId == "phs003463"]
headpain_vars.head()

We can see that there are many variables related to head pain, such as `pain_head___around` for head pain around the time of index and `pain_head___now` for head pain at the time of the survey. Let's use the `pain_head___now` variables.

In [None]:
headpain_now_vars = headpain_vars[headpain_vars.varId.str.contains("pain_head___now")]
headpain_now_vars.head()

As shown above, there are derived head pain now scores at different times during the data collection, including a baseline measurement and followup visits. Let's use the baseline measurement and information from the first three followup visits.

In [None]:
# Save head pain variables for baseline and first three followups
baseline_headpain = headpain_now_vars.HPDS_PATH[headpain_now_vars.varId.str.contains("baseline")].values[0]
f1_headpain = headpain_now_vars.HPDS_PATH[headpain_now_vars.varId.str.contains("followup_1_")].values[0]
f2_headpain = headpain_now_vars.HPDS_PATH[headpain_now_vars.varId.str.contains("followup_2_")].values[0]
f3_headpain = headpain_now_vars.HPDS_PATH[headpain_now_vars.varId.str.contains("followup_3_")].values[0]

### Build a Query
Now that we have our variables selected, we can build a query. For more information on how to apply filters to a query, please refer to the `1_PICSURE_API_101` notebook. 

For this query, we will be requiring that participants have information for all selected variables: PASC scores and head pain information for baseline and first three followups.

In [None]:
# Build a query
authPicSure = bdc.useAuthPicSure()
pasc_headpain_query = authPicSure.query() # Initiate a query

# Add variables as a "require" 
pasc_headpain_query.require().add([baseline_pasc, f1_pasc, f2_pasc, f3_pasc, baseline_headpain, f1_headpain, f2_headpain, f3_headpain])

In [None]:
# Retrieve results
results = pasc_headpain_query.getResultsDataFrame(low_memory = False)
# results.head() # Uncomment to peek at the dataframe

The export has all variables added to the query as columns, with each RECOVER Adult study participant as rows. This is saved as a Python dataframe and can be used for analysis.

### Analysis
Let's make a visualization to quickly observe the correlation between PASC scores and head pain.

In [None]:
# Create a Box Plot

# Get data for different boxes
has_pain = True
base_nopain = results[results[baseline_headpain] != has_pain][baseline_pasc]
base_pain = results[results[baseline_headpain] == has_pain][baseline_pasc]
f1_nopain = results[results[f1_headpain] != has_pain][f1_pasc]
f1_pain = results[results[f1_headpain] == has_pain][f1_pasc]
f2_nopain = results[results[f2_headpain] != has_pain][f2_pasc]
f2_pain = results[results[f2_headpain] == has_pain][f2_pasc]
f3_nopain = results[results[f3_headpain] != has_pain][f3_pasc]
f3_pain = results[results[f3_headpain] == has_pain][f3_pasc]


# Set up boxplot
fig, ax = plt.subplots()

# Baseline boxplots
bp = ax.boxplot([base_nopain, base_pain], positions = [1, 2], widths = 0.6, patch_artist=True)#, tick_labels = ["No Pain", "Pain"])
for patch, color in zip(bp['boxes'], ['#1a568c', '#c0143c']):
    patch.set_facecolor(color)

# Followup 1 boxplots
bp = ax.boxplot([f1_nopain, f2_pain], positions = [4, 5], widths = 0.6, patch_artist=True)
for patch, color in zip(bp['boxes'], ['#1a568c', '#c0143c']):
    patch.set_facecolor(color)

# Followup 2 boxplots
bp = ax.boxplot([f2_nopain, f3_pain], positions = [7, 8], widths = 0.6, patch_artist=True)
for patch, color in zip(bp['boxes'], ['#1a568c', '#c0143c']):
    patch.set_facecolor(color)

# Followup 3 boxplots
bp = ax.boxplot([f3_nopain, f3_pain], positions = [10, 11], widths = 0.6, patch_artist=True)
for patch, color in zip(bp['boxes'], ['#1a568c', '#c0143c']):
    patch.set_facecolor(color)

# Settings and labels for aesthetics
ax.set_ylim(-1, 42) # Set Y axis
ax.set_xticks([1.5, 4.5, 7.5, 10.5]) # Set X axis ticks and labels
ax.set_xticklabels(['Baseline', 'Followup 1', 'Followup 2', 'Followup 3'])
ax.legend([bp["boxes"][0], bp["boxes"][1]], ['No Headpain', 'Headpain'], loc='upper right') # set colors
ax.set_ylabel("PASC Score")
ax.set_title("RECOVER Adult PASC Scores and Headpain")

plt.show()