# Accessing BioData Catalyst Harmonized variables using python PIC-SURE API

This tutorial notebook will demonstrate how to query and work with the BioData Catalyst cross-studies harmonized variables using python PIC-SURE API. For a more step-by-step introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to review the \"Get your security token\" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

 -------   

# Environment set-up

### System requirements
- Python 3.6 or later
- pip package manager
- bash interpreter

### Installation of external dependencies

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

In [None]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureBdcAdapter: 1.0.0\n- PicSureClient: 1.1.0")
print("The installed PIC-SURE API libraries versions:\n- PicSureBdcAdapter: {0}\n- PicSureClient: {1}".format(PicSureBdcAdapter.__version__, PicSureClient.__version__))

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

## Harmonized Variables

The data harmonization effort aims to produce "a high quality, lasting resource of publicly available and thoroughly documented harmonized phenotype variables". The TOPMed Data Coordinating Center collaborates with Working Group members and phenotype experts on this endeavor. So far, 44 harmonized variables are accessible through PIC-SURE API (in addition to the age at which each variable value as been collected for a given subject).

The following phenotypes are included as harmonized variables:

- Key NHLBI phenotypes    
    - Blood cell counts
    - VTE
    - Atherosclerosis-related phenotypes
    - Lipids
    - Blood pressure


- Common covariates
    - Height
    - Weight
    - BMI
    - Smoking status
    - Race/ethnicity

More information about the variable harmonization process is available at https://www.nhlbiwgs.org/sites/default/files/pheno_harmonization_guidelines.pdf

### 1. Retrieving variables dictionary from HPDS Database

Here we retrieve the harmonized variables information by searching for the keyword `harmonized`.

In [None]:
harmonized_dic = resource.dictionary().find("Harmonized").DataFrame()

In [None]:
pd.set_option("display.max.rows", 50)

In [None]:
%%capture
multiIndexdic = get_multiIndex_variablesDict(harmonized_dic)
multiIndexdic_sub = multiIndexdic.loc[~ multiIndexdic["simplified_name"].str.contains("(^[Aa]ge)|(SUBJECT_ID)", regex=True),:]

In [None]:
print(multiIndexdic.shape)
print(multiIndexdic_sub.shape)

Overall, there are 81 harmonized variables. After discarding "subject ID" and the variables only indicating age of the subject at which a given harmonized variable was been measured, there are 44 left.

In [None]:
multiIndexdic_sub

### 2. Selecting variables and retrieving data from the database

Let's say we are interested in the subset of Harmonized Variables pertaining to patient demographics. 

We will subset the data to keep only the phenotypic variables and the "affection status", which will be used as the dependent variable for this use-case.

In [None]:
mask_demo = multiIndexdic_sub.index.get_level_values(1) == '01 - Demographics'
variablesDict = multiIndexdic_sub.loc[mask_demo,:]

In [None]:
selected_vars = variablesDict.loc[:, "name"].tolist()
#print(selected_vars)
selected_vars.append("\\_harmonized_consent\\")
#print(selected_vars)

In [None]:
pprint(selected_vars[:5])

Retrieving the data:

In [None]:
query = resource.query()
query.select().add(selected_vars)
facts = query.getResultsDataFrame(low_memory=False)

In [None]:
facts = facts.set_index("Patient ID")\
    .dropna(axis=0, how="all")\
    .drop(["\\_harmonized_consent\\"], axis=1)
facts.shape

## Studying the Sex Repartition Across Studies

In [None]:
sex_varname = facts.keys()[facts.keys().str.contains('Subject sex')][0]
study_varname = facts.keys()[facts.keys().str.contains('distinct subgroup within a study')][0]
race_varname = facts.keys()[facts.keys().str.contains('Harmonized race category')][0]

In [None]:
import matplotlib.patches as mpatches
from matplotlib import cm
from matplotlib.offsetbox import (TextArea, DrawingArea, OffsetImage,
                                  AnnotationBbox)

In [None]:
plt.rcParams["figure.figsize"] = (14,8)
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

In [None]:
facts.head()

In [None]:
subset_facts = facts.loc[pd.notnull(facts[sex_varname]),:]
ratio_df = subset_facts.groupby(study_varname)[sex_varname]\
.apply(lambda x: pd.value_counts(x)/(np.sum(pd.notnull(x))))\
.unstack(1)
annotation_x_position = ratio_df.apply(np.max, axis=1)
number_subjects = subset_facts.groupby(study_varname)[sex_varname].apply(lambda x: x.notnull().sum())
annotation_gen = list(zip(number_subjects, annotation_x_position))

fig = ratio_df.plot.barh(title="Subjects sex-ratio across studies", figsize=(10, 12))
fig.legend(bbox_to_anchor=(1, 0.5))
fig.set_xlim(0, 1.15)
fig.set_ylabel(None)

for n, p in enumerate(fig.patches[:25]):
    nb_subject, x_position = annotation_gen[n]
    fig.annotate(nb_subject, (x_position + 0.03, p.get_y()+0.1), bbox=dict(facecolor='none',
                                                                       edgecolor='black',
                                                                       boxstyle='round'))

handles, labels = fig.get_legend_handles_labels()
red_patch = mpatches.Patch(label='Study nb subjects', edgecolor="black", facecolor="white")
handles.append(red_patch)
fig.legend(handles=handles)