# Identifying and Extracting Longitudinal Variables using python PIC-SURE API

This tutorial notebook will demonstrate how to idetify and extract longitudinal variables using the R PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path.


In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

## Environment Set-Up

### System Requirements
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install Packages

In [1]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting tqdm>=4.38.0
  Downloading tqdm-4.60.0-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 4.5 MB/s  eta 0:00:01
Installing collected packages: tqdm
Successfully installed tqdm-4.60.0


In [2]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /tmp/pip-req-build-0il4_nw0
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-client.git /tmp/pip-req-build-0il4_nw0
Building wheels for collected packages: PicSureClient
  Building wheel for PicSureClient (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureClient: filename=PicSureClient-0.1.0-py2.py3-none-any.whl size=10225 sha256=fe6c81d3b2a1ef7e33cd2d4e062c12a790d16b66d304ac0362c77dd9938c3a6c
  Stored in directory: /tmp/pip-ephem-wheel-cache-sit0nsc4/wheels/31/ef/21/e362bba8de04e0072fafec9f77bd1abdf7e166213d27e98729
Successfully built PicSureClient
Installing collected packages: PicSureClient
Successfully installed PicSureClient-0.1.0
Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /tmp/pip-req-build-yqo0qdlv
  Runn

In [3]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

import re

## Connecting to a PIC-SURE Network
**Again, before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

In [5]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [6]:
with open(token_file, "r") as f:
    my_token = f.read()

In [7]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------
|  Resource UUID                       |  Resource Name                                  
+--------------------------------------+------------------------------------------------------
| 02e23f52-f354-4e8b-992c-d37c8b9ba140
| 70c837be-5ffc-11eb-ae93-0242ac130002
+--------------------------------------+------------------------------------------------------


## Longitudinal Lipid Variable Example
Example showing how to extract lipid measurements from multiple visits for different cohorts

### Access the data
First, we will create a multiIndex variable dictionary of all variables we have access to.

In [8]:
fullVariableDict = resource.dictionary().find().keys()
variablesDict = pd.DataFrame(fullVariableDict, columns=['name'])
variablesDict

Unnamed: 0,name
0,\Multi-Ethnic Study of Atherosclerosis (MESA) ...
1,\Framingham Cohort ( phs000007 )\Tests\ECG\TRE...
2,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
3,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
4,\NHLBI Atherosclerosis Risk in Communities (AR...
...,...
122414,Gene_with_variant
122415,Variant_class
122416,Variant_consequence_calculated
122417,Variant_frequency_as_text


In this example, we are interested in variables related to lipids. We can find all variables related to the search terms 'lipid' and 'triglyceride' through applying the following filter on the multiIndex dictionary:

In [10]:
mask_lipid = [type(i) == str and "lipid" in i.lower() for i in variablesDict['name']]
mask_triglyceride = [type(i) == str and "triglyceride" in i.lower() for i in variablesDict['name']]
lipid_vars = variablesDict.loc[mask_lipid or mask_triglyceride,:]
lipid_vars

Unnamed: 0,name
34,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
85,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
90,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
119,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
139,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
...,...
122375,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
122376,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
122405,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
122406,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...


### Identify the longitudinal lipid variables
This block of code does the following:

- uses the multiindex dataframe containing variables which are related to 'lipid' or 'triglyceride'
- filters for variables with keywords 'exam #' or 'visit #'
- extracts the exam number of each variable into column exam_number
- groups variables by study (level_0) and longitudinal variable (longvar)
- returns a table showing the variables that have more than one exam recorded

In [11]:
lipid_concept_paths = []
for i in lipid_vars['name']:
    if re.search('.*(exam|visit) \d+.*', i, re.IGNORECASE):
        #print(i)
        lipid_concept_paths.append(i)
len(lipid_concept_paths)

674

In [12]:
lipid_concept_paths

['\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Lipids\\X-RAY: ARTHRITIS, GOUTY, EXAM 6\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Lipids\\X-RAY: ARTHRITIS, GOUTY, EXAM 7\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Lipids\\INTERIM HISTORY OF PERSISTENT COUGH, EXAM 4\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Lipids\\X-RAY: ABNORMALITY OF AORTA, EXAM 4\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\X-ray\\BLOOD ANALYSIS: TOTAL LIPIDS, EXAM 7\\',
 "\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Lipids\\EXAMINER'S OPINION: NEUROCIRCULATORY ASTHENIA PRESENT, EXAM 4\\",
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Lipids\\PRESENT HISTORY OF SMOKING: PORTION OF CIGARETTE SMOKED, EXAM 7\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Lipids\\PRESENT HISTORY OF SMOKING: NUMBER OF CIGARETTES/DAY, EXAM 7\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Lipi

In [13]:
df = lipid_vars[lipid_vars['name'].isin(lipid_concept_paths)]
df['exam_number'] = df['name'].str.extract(r'(exam \d+|visit \d+)', flags=re.IGNORECASE)
df['longvar'] = df['name'].str.replace(r'(exam \d+|visit \d+)', '', flags=re.IGNORECASE)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,name,exam_number,longvar
139,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,EXAM 6,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
226,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,EXAM 7,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
1014,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,EXAM 4,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
1142,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,EXAM 4,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
1415,\Framingham Cohort ( phs000007 )\Tests\X-ray\B...,EXAM 7,\Framingham Cohort ( phs000007 )\Tests\X-ray\B...
...,...,...,...
120692,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,EXAM 7,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
121013,\NHLBI Atherosclerosis Risk in Communities (AR...,Exam 1,\NHLBI Atherosclerosis Risk in Communities (AR...
121123,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,EXAM 7,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...
122232,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,EXAM 2,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...


In [34]:
n_exams = pd.DataFrame(
    df.pivot_table(index=['longvar'], aggfunc='size'), 
    columns=['n_exam']).sort_values(by='n_exam', ascending=False).reset_index()
n_exams

Unnamed: 0,longvar,n_exam
0,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,7
1,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,7
2,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,7
3,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,7
4,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,7
...,...,...
403,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,1
404,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,1
405,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,1
406,\Framingham Cohort ( phs000007 )\Lab Work\Bloo...,1


In [35]:
my_variable = n_exams['longvar'][0]
print(my_variable)

\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, \


In [44]:
query_vars = df.loc[[type(i) == str and my_variable in i for i in df['longvar']], 'name']
query_vars
#[type(i) == str and "lipid" in i.lower() for i in variablesDict['name']]

115480    \Framingham Cohort ( phs000007 )\Lab Work\Bloo...
115491    \Framingham Cohort ( phs000007 )\Lab Work\Bloo...
115492    \Framingham Cohort ( phs000007 )\Lab Work\Bloo...
115512    \Framingham Cohort ( phs000007 )\Lab Work\Bloo...
115526    \Framingham Cohort ( phs000007 )\Lab Work\Bloo...
115527    \Framingham Cohort ( phs000007 )\Lab Work\Bloo...
115539    \Framingham Cohort ( phs000007 )\Lab Work\Bloo...
Name: name, dtype: object

In [45]:
my_query = resource.query()
my_query.anyof().add(query_vars)

<PicSureHpdsLib.PicSureHpdsAttrListKeys.AttrListKeys at 0x7f84873fe860>

In [46]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [47]:
query_result

Unnamed: 0,Patient ID,"\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, EXAM 1\","\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, EXAM 2\","\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, EXAM 3\","\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, EXAM 4\","\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, EXAM 5\","\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, EXAM 6\","\Framingham Cohort ( phs000007 )\Lab Work\Blood\Lipids\RELATIVE WEIGHT, EXAM 7\",\_Parent Study Accession with Subject ID\,\_Topmed Study Accession with Subject ID\,\_consents\
0,54641,105.0,109.0,113.0,114.0,93.0,101.0,99.0,phs000007.v30_1,,phs000007.c1
1,54643,102.0,102.0,103.0,99.0,101.0,98.0,96.0,phs000007.v30_3,,phs000007.c1
2,54644,112.0,104.0,104.0,110.0,115.0,118.0,116.0,phs000007.v30_4,,phs000007.c1
3,54646,105.0,106.0,103.0,106.0,107.0,109.0,106.0,phs000007.v30_7,,phs000007.c1
4,54652,101.0,101.0,99.0,100.0,103.0,103.0,101.0,phs000007.v30_16,,phs000007.c1
...,...,...,...,...,...,...,...,...,...,...,...
5045,71762,85.0,86.0,87.0,90.0,83.0,84.0,89.0,phs000007.v30_26789,phs000974.v3_26789,phs000007.c1
5046,71766,97.0,100.0,100.0,99.0,96.0,95.0,95.0,phs000007.v30_26797,,phs000007.c1
5047,71770,99.0,109.0,108.0,116.0,113.0,114.0,111.0,phs000007.v30_26801,,phs000007.c1
5048,71774,76.0,71.0,79.0,78.0,77.0,79.0,79.0,phs000007.v30_26808,,phs000007.c1
