# Identifying and Extracting Longitudinal Variables using python PIC-SURE API

This tutorial notebook will demonstrate how to idetify and extract longitudinal variables using the R PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path.


In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

## Environment Set-Up

### System Requirements
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install Packages

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

import re

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib display parameters
plt.rcParams["figure.figsize"] = (14,8)
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

## Connecting to a PIC-SURE Network
**Again, before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

## Longitudinal Lipid Variable Example
Example showing how to extract lipid measurements from multiple visits for different cohorts

### Access the data
First, we will create a multiIndex variable dictionary of all variables we have access to.

In [None]:
fullVariableDict = resource.dictionary().find().keys()
variablesDict = pd.DataFrame(fullVariableDict, columns=['name'])
variablesDict


In this example, we are interested in variables related to lipids. We can find all variables related to the search terms 'lipid' and 'triglyceride' through applying the following filter on the multiIndex dictionary:

In [None]:
mask_lipid = [type(i) == str and "lipid" in i.lower() for i in variablesDict['name']]
mask_triglyceride = [type(i) == str and "triglyceride" in i.lower() for i in variablesDict['name']]

lipid_vars = variablesDict.loc[mask_lipid or mask_triglyceride,:]

lipid_vars

### Identify the longitudinal lipid variables
This block of code does the following:

- uses the multiindex dataframe containing variables which are related to 'lipid' or 'triglyceride'
- filters for variables with keywords 'exam #' or 'visit #'
- extracts the exam number of each variable into column exam_number
- groups variables by study (level_0) and longitudinal variable (longvar)
- returns a table showing the variables that have more than one exam recorded

In [None]:
test = "results at ExAm 43 lipids"
print(re.search('.* (exam|visit) \d+ .*', test, re.IGNORECASE).group())

In [None]:
lipid_concept_paths = []
for i in lipid_vars['name']:
    if re.search('.*(exam|visit) \d+.*', i, re.IGNORECASE):
        #print(i)
        lipid_concept_paths.append(i)
len(lipid_concept_paths)

#re.search('.* (exam|visit) \d+ .*', lipid_vars["name"], re.IGNORECASE).group()
#lipid_vars[lipid_vars['name'].str.contains('.* (exam|visit) \d+ .*')==True]
#longitudinal_lipid_vars = lipid_vars[lipid_vars["name"].str.contains('(exam|visit) \d+')==True]
#longitudinal_lipid_vars.shape
#lipid_vars[lipid_vars["name"].str.contains('visit \d+')==True]# | lipid_vars["name"].str.contains('visit \d+')==True)]

In [None]:
lipid_concept_paths

In [None]:
df = lipid_vars[lipid_vars['name'].isin(lipid_concept_paths)]
df['exam_number'] = df['name'].str.extract(r'(exam \d+|visit \d+)', flags=re.IGNORECASE)
df['longvar'] = df['name'].str.replace(r'(exam \d+|visit \d+)', '', flags=re.IGNORECASE)
#df = df.drop(['name', 'exam_number'], axis=1)
#df['n_exam'] = df.groupby('longvar').count()
dups = df.pivot_table(index=['name', 'longvar'], aggfunc='size')
print(dups)
