# Harmonization across studies with PIC-SURE

This tutorial notebook will demonstrate how to query and work with the BioData Catalyst studies, particularly cross-study harmonization. For a more step-by-step introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to review the \"Get your security token\" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

 -------   

# Environment set-up

### System requirements
- Python 3.6 or later
- pip package manager
- bash interpreter

### Installation of external dependencies

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import json

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

import re

In [None]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureBdcAdapter: 1.0.0\n- PicSureClient: 1.1.0")
print("The installed PIC-SURE API libraries versions:\n- PicSureBdcAdapter: {0}\n- PicSureClient: {1}".format(PicSureBdcAdapter.__version__, PicSureClient.__version__))

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

 -------   

## Harmonizing variables with PIC-SURE
One of the key challenges to conducting analyses with several studies is ensuring correct data harmonization, or combining of data from different sources. There are many harmonization techniques, but this notebook will demonstrate how to find and extract similar variables from different studies in PIC-SURE. Two examples of this will be shown:
1. Retrieving variables for *sex and gender* across studies with BMI
2. Harmonizing the variable *"orthopnea"* across studies


*For more information about the TOPMed DCC Harmonized Data Set in PIC-SURE, please refer to the [`2_HarmonizedVariables_analysis.ipynb`](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/blob/master/NHLBI_BioData_Catalyst/python/2_HarmonizedVariables_analysis.ipynb) notebook*

### Sex and gender variables across studies

Let's start by doing separate searches for `sex` and `gender` to gain a better understanding of the variables that exist in PIC-SURE with these terms.

In [None]:
# Get dataframe of full results
full_dict = resource.dictionary().find().DataFrame()
full_multiindex_dict = get_multiIndex_variablesDict(full_dict)

In [None]:
sex = full_multiindex_dict['simplified_name'].str.contains('sex') # Find all instances where 'sex' in simplified_name
gender = full_multiindex_dict['simplified_name'].str.contains('gender') # Find all instances where 'gender' in simplified_name

In [None]:
# Uncomment the following lines of code to preview the filtered dataframes
#full_multiindex_dict[sex] # Sex variables
#full_multiindex_dict[gender] # Gender variables

After reviewing the variables using the dataframe (or the [user interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/psamaui/login)), let's say we are interested in sex/gender variables from the following studies:
- TOPMed Harmonized data set
- ECLIPSE (Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints)
- EOCOPD (Early Onset of COPD)

However, these concept paths are labelled differently for each of these studies. For example, some use the keyword `sex` while others use `gender`. To acccount for these differences, we need to develop a way to search for multiple keywords at once.

First, let's get all of the concept paths associated with each study.

In [None]:
topmed_harmonized = resource.dictionary().find("DCC Harmonized data set").DataFrame()
eclipse = resource.dictionary().find("Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE)").DataFrame()
eocopd = resource.dictionary().find("NHLBI TOPMed: Boston Early-Onset COPD Study").DataFrame()

Now we will search for the terms of interest (`sex` and `gender`) and filter out these concept paths.

Below is a simple user-defined function that you could use to accomplish this.

In [None]:
# Function that filters out variables from a dataframe (df) that contain any of the terms (list_of_terms)
def find_vars(df, list_of_terms):
    regex_version = '('+('|').join(list_of_terms)+')'
    var_filter = df.index.str.contains(regex_version, flags=re.IGNORECASE)
    vars_list = list(df[var_filter].index)
    return vars_list

In [None]:
# Search for 'sex' and 'gender' variables in TOPMed Harmonized dataset
topmed_var = find_vars(topmed_harmonized, ['sex', 'gender'])
print("Concept path from TOPMed Harmonized data set:\n", topmed_var)

In [None]:
# Search for 'sex' and 'gender' variables in ECLIPSE dataset
eclipse_var = find_vars(eclipse, ['sex', 'gender'])
print("Concept path from ECLIPSE data set:\n", eclipse_var)

In [None]:
# Search for 'sex' and 'gender' variables in EOCOPD dataset
eocopd_vars = find_vars(eocopd, ['sex', 'gender'])
print("Number of concept paths from EOCOPD data set:\n", len(eocopd_vars))

Since there are multiple concept paths that are contain either `gender` or `sex` in the EOCOPD dataset, we can investigate these concept paths to determine the true variable of interest.

In [None]:
# Uncomment following line to see full list of sex/gender variables in EOCOPD
#print("Full list of variables", eocopd_vars)

# Based on this, we can see that the variable we want for this analysis is the last in the list: Gender of participant
eocopd_var = find_vars(eocopd, ['gender of participant'])
print("Concept path from EOCOPD data set:\n", eocopd_var)

As part of our research, let's also say we are interested in body mass index (BMI) measurements across these studies. Let's save these concept paths to use in our queries as well.

In [None]:
topmed_bmi = find_vars(topmed_harmonized, ['body mass index'])
print(topmed_bmi)
eclipse_bmi = find_vars(eclipse, ['body mass index'])
print(eclipse_bmi)
eocopd_bmi = find_vars(eocopd, ['body mass index'])
print(eocopd_bmi)

Now that we know and have saved our concept paths of interest, we can use these to build our query. 

**Note: queries with the TOPMed DCC Harmonized data set cannot be combined with concept paths from other datasets. Because of this, we will run two separate queries and combine the dataframes.**

In [None]:
# Initialize a query
eclipse_query = resource.query()
# Build query using these concept paths
eclipse_query.anyof().add(eclipse_var)
eclipse_query.anyof().add(eclipse_bmi)

In [None]:
# Check results
eclipse_results = eclipse_query.getResultsDataFrame(low_memory=False)
eclipse_results.head()

In [None]:
# Initialize a query
eocopd_query = resource.query()
# Build query using these concept paths
eocopd_query.anyof().add(eocopd_var)
eocopd_query.anyof().add(eocopd_bmi)

In [None]:
# Check results 
eocopd_results = eocopd_query.getResultsDataFrame(low_memory=False)
eocopd_results.head()

In [None]:
# Initialize a query
dcc_harmonized_query = resource.query()
# Build query using TOPMed harmonized concept paths
dcc_harmonized_query.anyof().add(topmed_var)
dcc_harmonized_query.anyof().add(topmed_bmi)

In [None]:
# Check results
dcc_harmonized_results = dcc_harmonized_query.getResultsDataFrame(low_memory=False)
dcc_harmonized_results.head()

Now that we have our patient-level dataframes, we can combine them into a single, cohesive dataframe.

The following function accomplishes three main tasks:
1. Removes extra columns, such as Patient ID and consent information
2. Renames the BMI and Sex columns
3. Adds the Dataset column, which corresponds to the study

In [None]:
def clean_up_df(df, study):
    columns_to_drop = ['\\_Parent Study Accession with Subject ID\\', '\\_Topmed Study Accession with Subject ID\\', '\\_consents\\', '\\_harmonized_consent\\']
    df1 = df.drop(columns=columns_to_drop, errors='ignore')
    if 'body mass index' in df1.columns.values[1].lower():
        df1.columns = ['Patient ID', 'BMI', 'Sex']
    else:
        df1.columns = ['Patient ID', 'Sex', 'BMI']
    df2 = df1.dropna(subset=['BMI'])
    df2['Dataset'] = study
    return df2

In [None]:
clean_eclipse = clean_up_df(eclipse_results, 'ECLIPSE')
clean_eocopd = clean_up_df(eocopd_results, 'EOCOPD')
clean_dcc = clean_up_df(dcc_harmonized_results, 'TOPMed Harmonized')

The datasets have been prepped. We can now merge them and begin our analysis.

In [None]:
# Combine individual dataframes
final_df = pd.concat([clean for clean in [clean_eclipse, clean_eocopd, clean_dcc]], ignore_index=True)

In [None]:
# Comparison of the datasets and sample harmonization
separate = final_df.drop(columns = ['Patient ID']).groupby(['Dataset','Sex']).mean()
print(separate)
harmonized = final_df.drop(columns = ['Patient ID']).groupby(['Sex']).mean()
print(harmonized)

In [None]:
# Make lists of male and female mean BMI for plotting
male_means = list(separate[separate.index.get_level_values('Sex')=='Male']['BMI'])
male_means.append(*list(harmonized[harmonized.index.get_level_values('Sex')=='Male']['BMI']))
female_means = list(separate[separate.index.get_level_values('Sex')=='Female']['BMI'])
female_means.append(*list(harmonized[harmonized.index.get_level_values('Sex')=='Female']['BMI']))

In [None]:
# Bar plot of the results
width = 0.2
labels = ['Male', 'Female']
x = np.arange(len(labels))

fig, ax = plt.subplots()
study1 = ax.bar(x - width*1.5, [male_means[0], female_means[0]], width, label='ECLIPSE')
study2 = ax.bar(x - width*0.5, [male_means[1], female_means[1]], width, label='EOCOPD')
study3 = ax.bar(x + width*0.5, [male_means[2], female_means[2]], width, label='TOPMed Harmonized')
study4 = ax.bar(x + width*1.5, [male_means[3], female_means[3]], width, label='Combined')

ax.set_ylabel('Body Mass Index (BMI)')
ax.set_title('Body Mass Index by Sex and Dataset')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
fig.tight_layout()

plt.show()