# Harmonization across studies with PIC-SURE

This tutorial notebook will demonstrate how to query and work with the BioData Catalyst studies, particularly cross-study harmonization. For a more step-by-step introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to review the \"Get your security token\" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

 -------   

# Environment set-up

### System requirements
- Python 3.6 or later
- pip package manager
- bash interpreter

### Installation of external dependencies

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /tmp/pip-req-build-knxto1vc
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git /tmp/pip-req-build-knxto1vc
Collecting httplib2
  Using cached httplib2-0.19.1-py3-none-any.whl (95 kB)
Collecting pyparsing<3,>=2.4.2
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Building wheels for collected packages: PicSureHpdsLib
  Building wheel for PicSureHpdsLib (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureHpdsLib: filename=PicSureHpdsLib-0.9.0-py2.py3-none-any.whl size=22051 sha256=d87e4a747bff8b39a4c44c0b2dd1436e370873330819da519ca9b9baa5e5b009
  Stored in directory: /tmp/pip-ephem-wheel-cache-njfq6ovv/wheels/ae/d9/1a/c8c0ac8151b575c845efddc061fe014d86c51d1fd2c408907c
Successfully built PicSureHpdsLib
Installing collected packages: pyparsing, httplib2, PicSureHpdsLib
  Attempting un

In [3]:
import json
#from pprint import pprint

import pandas as pd
import numpy as np 
#import matplotlib.pyplot as plt
#from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

import re

In [4]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureBdcAdapter: 1.0.0\n- PicSureClient: 1.1.0")
print("The installed PIC-SURE API libraries versions:\n- PicSureBdcAdapter: {0}\n- PicSureClient: {1}".format(PicSureBdcAdapter.__version__, PicSureClient.__version__))

NB: This Jupyter Notebook has been written using PIC-SURE API following versions:
- PicSureBdcAdapter: 1.0.0
- PicSureClient: 1.1.0
The installed PIC-SURE API libraries versions:
- PicSureBdcAdapter: 1.0.0
- PicSureClient: 1.1.0


## Connecting to a PIC-SURE network

In [5]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [6]:
with open(token_file, "r") as f:
    my_token = f.read()

In [7]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 02e23f52-f354-4e8b-992c-d37c8b9ba140 |                                                      |
| 70c837be-5ffc-11eb-ae93-0242ac130002 |                                                      |
+--------------------------------------+------------------------------------------------------+


 -------   

## Harmonizing variables with PIC-SURE
One of the key challenges to conducting analyses with several studies is ensuring correct data harmonization, or combining of data from different sources. There are many harmonization techniques, but this notebook will demonstrate how to find and extract similar variables from different studies in PIC-SURE. Two examples of this will be shown:
1. Retrieving variables for sex and gender across studies
2. Harmonizing the variable "orthopnea" across studies

### Sex and gender variables across studies

Let's start by doing separate searches for `sex` and `gender` to gain a better understanding of the variables that exist in PIC-SURE with these terms.

In [8]:
# Get dataframe of full results
full_dict = resource.dictionary().find().DataFrame()
full_multiindex_dict = get_multiIndex_variablesDict(full_dict)

In [9]:
sex = full_multiindex_dict['simplified_name'].str.contains('sex') # Find all instances where 'sex' in simplified_name
gender = full_multiindex_dict['simplified_name'].str.contains('gender') # Find all instances where 'gender' in simplified_name

In [10]:
# Uncomment the following lines of code to preview the filtered dataframes
#full_multiindex_dict[sex] # Sex variables
#full_multiindex_dict[gender] # Gender variables

After reviewing the variables using the dataframe (or the [user interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/psamaui/login)), let's say we are interested in sex/gender variables from the following studies:
- TOPMed Harmonized data set
- ECLIPSE (Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints)
- EOCOPD (Early Onset of COPD)

However, the sex/gender variables are different for each of these studies is different.

First, let's get all of the variables associated with each study.

In [11]:
topmed_harmonized = resource.dictionary().find("DCC Harmonized data set").DataFrame()
eclipse = resource.dictionary().find("Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE)").DataFrame()
eocopd = resource.dictionary().find("NHLBI TOPMed: Boston Early-Onset COPD Study").DataFrame()

Now we will search for the terms of interest (`sex` and `gender`) and filter out these variables.

Below is a simple user-defined function that you could use to accomplish this.

In [12]:
# Function that filters out variables from a dataframe (df) that contain any of the terms (list_of_terms)
def find_vars(df, list_of_terms):
    regex_version = '('+('|').join(list_of_terms)+')'
    print("Using regex:", regex_version)
    var_filter = df.index.str.contains(regex_version, flags=re.IGNORECASE)
    vars_list = list(df[var_filter].index)
    return vars_list

In [13]:
topmed_var = find_vars(topmed_harmonized, ['sex', 'gender'])
print("Variable from TOPMed Harmonized data set:\n", topmed_var)

Using regex: (sex|gender)
Variable from TOPMed Harmonized data set:
 ['\\DCC Harmonized data set\\01 - Demographics\\Subject sex  as recorded by the study.\\']


  return func(self, *args, **kwargs)


In [14]:
eclipse_var = find_vars(eclipse, ['sex', 'gender'])
print("Variable from ECLIPSE data set:\n", eclipse_var)

Using regex: (sex|gender)
Variable from ECLIPSE data set:
 ['\\Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) ( phs001252 )\\Sex\\']


In [15]:
eocopd_vars = find_vars(eocopd, ['sex', 'gender'])
print("Number of variables from EOCOPD data set:\n", len(eocopd_vars))

Using regex: (sex|gender)
Number of variables from EOCOPD data set:
 14


Since there are multiple variables that are contain either `gender` or `sex`, we can investigate these variables to determine the true variable of interest.

In [16]:
eocopd_vars
# Based on this, we can see that the variable we want for this analysis is the last in the list: Gender of participant
eocopd_var = find_vars(eocopd, ['gender of participant'])
print("Variable from EOCOPD data set:\n", eocopd_var)

Using regex: (gender of participant)
Variable from EOCOPD data set:
 ["\\NHLBI TOPMed: Boston Early-Onset COPD Study ( phs000946 )\\Subject ID, subject age, gender, race, height, weight, BMI, age at sample collection, pregnancy, number of cigarettes per day, current or former smoker, and packs of cigarettes smoked per day multiplied by years of participants with early onset COPD and their pedigree and involved in the 'Boston Early-Onset COPD Study in the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) Program' project.\\Gender of participant [Male, Female]\\"]


Now that we know our variables of interest, we can use these to build our query.

In [17]:
my_query = resource.query()

In [18]:
full_list = [*topmed_var, *eclipse_var, *eocopd_var]
full_list

['\\DCC Harmonized data set\\01 - Demographics\\Subject sex  as recorded by the study.\\',
 '\\Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) ( phs001252 )\\Sex\\',
 "\\NHLBI TOPMed: Boston Early-Onset COPD Study ( phs000946 )\\Subject ID, subject age, gender, race, height, weight, BMI, age at sample collection, pregnancy, number of cigarettes per day, current or former smoker, and packs of cigarettes smoked per day multiplied by years of participants with early onset COPD and their pedigree and involved in the 'Boston Early-Onset COPD Study in the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) Program' project.\\Gender of participant [Male, Female]\\"]

In [19]:
my_query.select().add(full_list)

<PicSureHpdsLib.PicSureHpdsAttrListKeys.AttrListKeys at 0x7f5427215c88>

In [21]:
# Cannot run a query with some harmonized and some not harmonized data - will need to do this separately and put together after the fact

query_result = my_query.getResultsDataFrame(low_memory=False)
query_result.shape

ERROR: HTTP response was bad
https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure/query/sync
{'date': 'Fri, 27 Aug 2021 19:06:01 GMT', 'content-type': 'application/json', 'content-length': '91', 'connection': 'keep-alive', 'server': 'Apache/2.4.46 (Unix) OpenSSL/1.1.1k', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', 'vary': 'Accept-Encoding', 'status': '401', '-content-encoding': 'gzip'}
{"errorType":"Unauthorized","message":"User is not authorized. [Token invalid or expired]"}


PicSureClientException: Error: An error has occurred with the server

In [None]:
query_result.tail()