# Accessing *NHLBI BioData Catalyst® (BDC)* Harmonized variables using python PIC-SURE API

This tutorial notebook will demonstrate how to query and work with the *BDC* cross-studies harmonized variables using python PIC-SURE API. For a more step-by-step introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the [`README.md` file](../README.md). It explains how to get a security token, which is mandatory to use the PIC-SURE API.**

To set up your token file, be sure to run the [`Workspace_setup.ipynb` file](./Workspace_setup.ipynb).

 -------   

# Environment set-up

## System Requirements
- Python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed

## Install Packages

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

## About Harmonized Variables

The data harmonization effort aims to produce "a high quality, lasting resource of publicly available and thoroughly documented harmonized phenotype variables". The TOPMed Data Coordinating Center collaborates with Working Group members and phenotype experts on this endeavor. So far, 44 harmonized variables are accessible through PIC-SURE API (in addition to the age at which each variable value as been collected for a given subject).

The following phenotypes are included as harmonized variables:

- Key NHLBI phenotypes    
    - Blood cell counts
    - VTE
    - Atherosclerosis-related phenotypes
    - Lipids
    - Blood pressure


- Common covariates
    - Height
    - Weight
    - BMI
    - Smoking status
    - Race/ethnicity

More information about the variable harmonization process is available at https://www.nhlbiwgs.org/sites/default/files/pheno_harmonization_guidelines.pdf

# Working with the DCC Harmonized Variables in PIC-SURE

## 1. Identifying harmonized variables of interest

First, let's explore what harmonized variables are available in PIC-SURE by searching for the keyword `harmonized`.

In [None]:
harmonized_dictionary = bdc.useDictionary().dictionary().find('harmonized')
harmonized_dataframe = harmonized_dictionary.dataframe()
print(harmonized_dataframe.shape)
harmonized_dataframe.head()

We can see that these variables are all part of the unique study ID pertaining to the DCC Harmonized dataset.

In [None]:
harmonized_dataframe.studyId.unique()

We can also see that although there are only 44 DCC Harmonized variables, we have found 125 'harmonized' variables in PIC-SURE. This is because our variable results include subject IDs and 'metadata variables' which contain information about the age of the subject when a certain measure was taken, or the units of a variables. Let's exclude these.

In [None]:
# Discarding "subject ID",
# the variables which only indicate age of the subject at which a given harmonized variable was been measured,
# and variables which indicate the units of a given harmonized variable
vars_to_remove = harmonized_dataframe.columnmeta_name.str.contains("age_at|SUBJECT_ID|unit_")

harmonized_dataframe = harmonized_dataframe[-vars_to_remove]

print(harmonized_dataframe.shape)
harmonized_dataframe.head()

In [None]:
harmonized_dataframe.columns

In [None]:
harmonized_dataframe[['varId', 'description', 'HPDS_PATH', 'columnmeta_hpds_path']]

We can now see our 44 harmonized variables. This is in line with the [DCC Harmonized Variables documentation](https://github.com/UW-GAC/topmed-dcc-harmonized-phenotypes). 

## 2. Selecting variables and retrieving data from the database

Let's say we are interested in the subset of Harmonized Variables pertaining to patient demographics. 

We might do this by selecting variables based on what **datatable** or **variable group** the variables belong to. 

We will do this by filtering on the 'columnmeta_var_group' column. We can see the values of this column and how many variables are in each group:

In [None]:
harmonized_dataframe.columnmeta_var_group_id.value_counts()

Since we are interested in patient demographics, we filter our dataframe to include all harmonized variables which are part of the `demographic` variable group or data table. We should be left with 6 variables.

In [None]:
demographic_vars = harmonized_dataframe.columnmeta_var_group_id.str.contains("demographic")

demographic_harmonized_dataframe = harmonized_dataframe[demographic_vars]

demographic_harmonized_dataframe

## Query PIC-SURE for participant-level data for harmonized variables of interest

In [None]:
# Initialize a query
authPicSure = bdc.useAuthPicSure()
demographic_query = authPicSure.query()
vars_of_interest = demographic_harmonized_dataframe['columnmeta_HPDS_PATH'].tolist()
vars_of_interest.append('\\DCC Harmonized data set\\demographic\\race_1\\')
demographic_query.anyof().add(vars_of_interest)
demographic_results = demographic_query.getResultsDataFrame()
demographic_results.head()

## Visualizing our sex and race harmonized variables across study cohorts

In [None]:
import matplotlib as plt

### Male:Female ratio across study cohorts

Below, we wrangle the data to calculate the male/female sex ratio per study cohort and prepare our data for plotting.


In [None]:
grouped_sex_res = demographic_results.groupby("\\DCC Harmonized data set\\demographic\\subcohort_1\\")
grouped_sex_counts = grouped_sex_res["\\DCC Harmonized data set\\demographic\\annotated_sex_1\\"].value_counts()
plot_sex_df = grouped_sex_counts.unstack()
plot_sex_df['mf_ratio'] = plot_sex_df.Male / plot_sex_df.Female
plot_sex_df['mf_ratio'] = plot_sex_df['mf_ratio'].fillna(0)
plot_sex_df

In [None]:
ax = plot_sex_df['mf_ratio'].plot(kind='bar', 
                                  title='Male/Female ratio across study cohorts',
                                  figsize = [12,6])
ax.set_xlabel("Study Cohort")
ax.set_ylabel("Male/Female ratio")

### Partipant race percentages across study cohorts

Below, we wrangle the data to calculate the percentage of participants in given racial categories per study cohort and prepare our data for plotting.


In [None]:
demographic_results

In [None]:
grouped_race_res = demographic_results.groupby("\\DCC Harmonized data set\\demographic\\race_1\\")
grouped_race_counts = grouped_race_res["\\DCC Harmonized data set\\demographic\\subcohort_1\\"].value_counts()
grouped_race_counts

In [None]:
grouped_race_res = demographic_results.groupby("\\DCC Harmonized data set\\demographic\\subcohort_1\\")
grouped_race_counts = grouped_race_res["\\DCC Harmonized data set\\demographic\\race_1\\"].value_counts()
plot_race_df = grouped_race_counts.unstack()
plot_race_df = plot_race_df.fillna(0)
plot_race_df['total_n_race'] = sum([plot_race_df['Asian'],
                                    plot_race_df['Black or African American'],
                                    plot_race_df['Native Hawaiian or other Pacific Islander'],
                                    plot_race_df['More than one race'],
                                    plot_race_df['Other race'],
                                    plot_race_df['White or Caucasian']])
plot_race_df['Asian'] = (plot_race_df['Asian'] / plot_race_df['total_n_race']) * 100
plot_race_df['Black or African American'] = (plot_race_df['Black or African American'] / plot_race_df['total_n_race']) * 100
plot_race_df['Native Hawaiian or other Pacific Islander'] = (plot_race_df['Native Hawaiian or other Pacific Islander'] / plot_race_df['total_n_race']) * 100
plot_race_df['More than one race'] = (plot_race_df['More than one race'] / plot_race_df['total_n_race']) * 100
plot_race_df['Other race'] = (plot_race_df['Other race'] / plot_race_df['total_n_race']) * 100
plot_race_df['White or Caucasian'] = (plot_race_df['White or Caucasian'] / plot_race_df['total_n_race']) * 100

plot_race_df = plot_race_df.drop('total_n_race', axis = 1)
plot_race_df                                                                                          

In [None]:
ax = plot_race_df.plot(kind = 'bar', 
                       stacked = True, 
                       title = 'Race percentage distribution across study cohorts',
                       figsize = [12,6])

ax.legend(loc='center left',bbox_to_anchor=(1.0, 0.5))
ax.set_xlabel("Study Cohort")
ax.set_ylabel("Race percentage")