# Harmonization across studies with PIC-SURE

This tutorial notebook will demonstrate how to query and work with the BioData Catalyst studies, particularly cross-study harmonization. For a more step-by-step introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to review the \"Get your security token\" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

 -------   

# Environment set-up

### System requirements
- Python 3.6 or later
- pip package manager
- bash interpreter

### Installation of external dependencies

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import json
#from pprint import pprint

import pandas as pd
import numpy as np 
#import matplotlib.pyplot as plt
#from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

import re

In [None]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureBdcAdapter: 1.0.0\n- PicSureClient: 1.1.0")
print("The installed PIC-SURE API libraries versions:\n- PicSureBdcAdapter: {0}\n- PicSureClient: {1}".format(PicSureBdcAdapter.__version__, PicSureClient.__version__))

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

 -------   

## Harmonizing variables with PIC-SURE
One of the key challenges to conducting analyses with several studies is ensuring correct data harmonization, or combining of data from different sources. There are many harmonization techniques, but this notebook will demonstrate how to find and extract similar variables from different studies in PIC-SURE. Two examples of this will be shown:
1. Retrieving variables for *sex and gender* across studies with BMI
2. Harmonizing the variable *"orthopnea"* across studies

### Sex and gender variables across studies

Let's start by doing separate searches for `sex` and `gender` to gain a better understanding of the variables that exist in PIC-SURE with these terms.

In [None]:
# Get dataframe of full results
full_dict = resource.dictionary().find().DataFrame()
full_multiindex_dict = get_multiIndex_variablesDict(full_dict)

In [None]:
sex = full_multiindex_dict['simplified_name'].str.contains('sex') # Find all instances where 'sex' in simplified_name
gender = full_multiindex_dict['simplified_name'].str.contains('gender') # Find all instances where 'gender' in simplified_name

In [None]:
# Uncomment the following lines of code to preview the filtered dataframes
#full_multiindex_dict[sex] # Sex variables
#full_multiindex_dict[gender] # Gender variables

After reviewing the variables using the dataframe (or the [user interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/psamaui/login)), let's say we are interested in sex/gender variables from the following studies:
- TOPMed Harmonized data set
- ECLIPSE (Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints)
- EOCOPD (Early Onset of COPD)

However, these concept paths are labelled differently for each of these studies. For example, some use the keyword `sex` while others use `gender`. To acccount for these differences, we need to develop a way to search for multiple keywords at once.

First, let's get all of the concept paths associated with each study.

In [None]:
topmed_harmonized = resource.dictionary().find("DCC Harmonized data set").DataFrame()
eclipse = resource.dictionary().find("Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE)").DataFrame()
eocopd = resource.dictionary().find("NHLBI TOPMed: Boston Early-Onset COPD Study").DataFrame()

Now we will search for the terms of interest (`sex` and `gender`) and filter out these concept paths.

Below is a simple user-defined function that you could use to accomplish this.

In [None]:
# Function that filters out variables from a dataframe (df) that contain any of the terms (list_of_terms)
def find_vars(df, list_of_terms):
    regex_version = '('+('|').join(list_of_terms)+')'
    var_filter = df.index.str.contains(regex_version, flags=re.IGNORECASE)
    vars_list = list(df[var_filter].index)
    return vars_list

In [None]:
# Search for 'sex' and 'gender' variables in TOPMed Harmonized dataset
topmed_var = find_vars(topmed_harmonized, ['sex', 'gender'])
print("Concept path from TOPMed Harmonized data set:\n", topmed_var)

In [None]:
# Search for 'sex' and 'gender' variables in ECLIPSE dataset
eclipse_var = find_vars(eclipse, ['sex', 'gender'])
print("Concept path from ECLIPSE data set:\n", eclipse_var)

In [None]:
# Search for 'sex' and 'gender' variables in EOCOPD dataset
eocopd_vars = find_vars(eocopd, ['sex', 'gender'])
print("Number of concept paths from EOCOPD data set:\n", len(eocopd_vars))

Since there are multiple concept paths that are contain either `gender` or `sex` in the EOCOPD dataset, we can investigate these concept paths to determine the true variable of interest.

In [None]:
# Uncomment following line to see full list of sex/gender variables in EOCOPD
#print("Full list of variables", eocopd_vars)

# Based on this, we can see that the variable we want for this analysis is the last in the list: Gender of participant
eocopd_var = find_vars(eocopd, ['gender of participant'])
print("Concept path from EOCOPD data set:\n", eocopd_var)

As part of our research, let's also say we are interested in Body Mass Index (BMI) measurements across these studies. Let's save these concept paths to use in our queries as well.

In [None]:
topmed_bmi = find_vars(topmed_harmonized, ['body mass index'])
print(topmed_bmi)
eclipse_bmi = find_vars(eclipse, ['body mass index'])
print(eclipse_bmi)
eocopd_bmi = find_vars(eocopd, ['body mass index'])
print(eocopd_bmi)

Now that we know and have saved our concept paths of interest, we can use these to build our query. 

**Note: Queries with the TOPMed DCC Harmonized Data Set cannot be combined with concept paths from other datasets. Because of this, we will run two separate queries and combine the dataframes.**

In [None]:
# Initialize a query
non_harmonized_query = resource.query()
# Combine concept paths from ECLIPSE and EOCOPD
non_harmonized_paths = [*eclipse_var, *eocopd_var]
non_harmonized_bmi = [*eclipse_bmi, *eocopd_bmi]

In [None]:
# Build query using these concept paths
non_harmonized_query.anyof().add(non_harmonized_paths)
non_harmonized_query.anyof().add(non_harmonized_bmi)

In [None]:
# Check results
non_harmonized_results = non_harmonized_query.getResultsDataFrame(low_memory=False)
non_harmonized_results.head()

In [None]:
# Initialize a query
dcc_harmonized_query = resource.query()

In [None]:
# Build query using TOPMed harmonized concept paths
dcc_harmonized_query.anyof().add(topmed_var)
dcc_harmonized_query.anyof().add(topmed_bmi)

In [None]:
# Check results
dcc_harmonized_results = dcc_harmonized_query.getResultsDataFrame(low_memory=False)
dcc_harmonized_results.head()

Now that we have our patient-level dataframes, we can combine them into a single, cohesive dataframe.

In [None]:
columns_to_drop = ['Patient ID', '\\_Parent Study Accession with Subject ID\\', '\\_Topmed Study Accession with Subject ID\\', '\\_consents\\']#, '\\_harmonized_consent\\']
nonharm_results = non_harmonized_results.drop(columns=columns_to_drop)

In [None]:
nonharm_results['\\Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) ( phs001252 )\\Body mass index (kg/m2)\\'].\
      fillna(nonharm_results["\\NHLBI TOPMed: Boston Early-Onset COPD Study ( phs000946 )\\Subject ID, subject age, gender, race, height, weight, BMI, age at sample collection, pregnancy, number of cigarettes per day, current or former smoker, and packs of cigarettes smoked per day multiplied by years of participants with early onset COPD and their pedigree and involved in the 'Boston Early-Onset COPD Study in the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) Program' project.\\Body Mass Index [BMI ]\\"])
                      
                      
                      
                      