# Identifying and Extracting Longitudinal Variables using python PIC-SURE API

This tutorial notebook will demonstrate how to idetify and extract longitudinal variables using the R PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path.


In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

## Environment Set-Up

### System Requirements
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install Packages

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting tqdm>=4.38.0
  Downloading tqdm-4.60.0-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 1.1 MB/s  eta 0:00:01
Installing collected packages: tqdm
Successfully installed tqdm-4.60.0


In [2]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /tmp/pip-req-build-0pc0ivdr
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-client.git /tmp/pip-req-build-0pc0ivdr
Building wheels for collected packages: PicSureClient
  Building wheel for PicSureClient (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureClient: filename=PicSureClient-0.1.0-py2.py3-none-any.whl size=10225 sha256=68757d3d2e0bbd2ed09ca170a59d5d28fc9301106ec808ea3c1ed8f84f568fce
  Stored in directory: /tmp/pip-ephem-wheel-cache-37h4ehtv/wheels/31/ef/21/e362bba8de04e0072fafec9f77bd1abdf7e166213d27e98729
Successfully built PicSureClient
Installing collected packages: PicSureClient
Successfully installed PicSureClient-0.1.0
Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /tmp/pip-req-build-8sm6jlbh
  Runn

In [13]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

import re

In [4]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib display parameters
plt.rcParams["figure.figsize"] = (14,8)
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

## Connecting to a PIC-SURE Network
**Again, before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

In [5]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [6]:
with open(token_file, "r") as f:
    my_token = f.read()

In [7]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------
|  Resource UUID                       |  Resource Name                                  
+--------------------------------------+------------------------------------------------------
| 02e23f52-f354-4e8b-992c-d37c8b9ba140
+--------------------------------------+------------------------------------------------------


## Longitudinal Lipid Variable Example
Example showing how to extract lipid measurements from multiple visits for different cohorts

### Access the data
First, we will create a multiIndex variable dictionary of all variables we have access to.

In [11]:
fullVariableDict = resource.dictionary().find().DataFrame()
variablesDict = get_multiIndex_variablesDict(fullVariableDict)
variablesDict

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,simplified_name,name,observationCount,categorical,categoryValues,nb_modalities,min,max,HpdsDataType
level_0,level_1,level_2,level_3,level_4,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Gene_with_variant,,,,,Gene_with_variant,Gene_with_variant,,True,"[HTR4, AC121758.1, HTR6, HTR7, BBX, RN7SL563P,...",39598.0,,,info
Variant_class,,,,,Variant_class,Variant_class,,True,"[SNV, insertion, deletion]",3.0,,,info
Variant_consequence_calculated,,,,,Variant_consequence_calculated,Variant_consequence_calculated,,True,"[intergenic_variant, start_retained_variant, f...",28.0,,,info
Variant_frequency_as_text,,,,,Variant_frequency_as_text,Variant_frequency_as_text,,True,"[Novel, Rare, Common]",3.0,,,info
Variant_frequency_in_gnomAD,,,,,Variant_frequency_in_gnomAD,Variant_frequency_in_gnomAD,,False,[],0.0,,,info
...,...,...,...,...,...,...,...,...,...,...,...,...,...
_studies,NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE) ( phs001402 ),,,,NHLBI TOPMed: Whole Genome Sequencing of Venou...,\_studies\NHLBI TOPMed: Whole Genome Sequencin...,1535.0,True,[TRUE],1.0,,,phenotypes
_studies,NHLBI TOPMed: Women's Health Initiative (WHI) ( phs001237 ),,,,NHLBI TOPMed: Women's Health Initiative (WHI) ...,\_studies\NHLBI TOPMed: Women's Health Initiat...,11357.0,True,[TRUE],1.0,,,phenotypes
_studies,The Cleveland Clinic Foundation's Lone Atrial Fibrillation GWAS Study ( phs000820 ),,,,The Cleveland Clinic Foundation's Lone Atrial ...,\_studies\The Cleveland Clinic Foundation's Lo...,543.0,True,[TRUE],1.0,,,phenotypes
_studies,Women's Health Initiative Clinical Trial and Observational Study ( phs000200 ),,,,Women's Health Initiative Clinical Trial and O...,\_studies\Women's Health Initiative Clinical T...,143455.0,True,[TRUE],1.0,,,phenotypes


In this example, we are interested in variables related to lipids. We can find all variables related to the search terms 'lipid' and 'triglyceride' through applying the following filter on the multiIndex dictionary:

In [57]:
mask_lipid = [type(i) == str and "lipid" in i for i in variablesDict.index.get_level_values(2)]
mask_triglyceride = [type(i) == str and "triglyceride" in i for i in variablesDict.index.get_level_values(2)]

lipid_vars = variablesDict.loc[mask_lipid or mask_triglyceride,:]

lipid_vars.shape

(24, 9)

### Identify the longitudinal lipid variables
This block of code does the following:

- uses the multiindex dataframe containing variables which are related to 'lipid' or 'triglyceride'
- filters for variables with keywords 'exam #' or 'visit #'
- extracts the exam number of each variable into column exam_number
- groups variables by study (level_0) and longitudinal variable (longvar)
- returns a table showing the variables that have more than one exam recorded

In [37]:
test = "results at ExAm 43 lipids"
print(re.search('.* (exam|visit) \d+ .*', test, re.IGNORECASE).group())

results at ExAm 43 lipids


In [56]:
#re.search('.* (exam|visit) \d+ .*', lipid_vars["name"], re.IGNORECASE).group()
#lipid_vars[lipid_vars['name'].str.contains('.* (exam|visit) \d+ .*')==True]
#longitudinal_lipid_vars = lipid_vars[lipid_vars["name"].str.contains('(exam|visit) \d+')==True]
#longitudinal_lipid_vars.shape
lipid_vars[lipid_vars["name"].str.contains('visit \d+')==True]# | lipid_vars["name"].str.contains('visit \d+')==True)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,simplified_name,name,observationCount,categorical,categoryValues,nb_modalities,min,max,HpdsDataType
level_0,level_1,level_2,level_3,level_4,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
