# Identifying and Extracting Longitudinal Variables using python PIC-SURE API

This tutorial notebook will demonstrate how to identify and extract longitudinal variables using the R PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path.


In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

## Environment Set-Up

### System Requirements
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install Packages

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

import re

## Connecting to a PIC-SURE Network
**Again, before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

## Longitudinal Lipid Variable Example
Example showing how to extract lipid measurements from multiple visits for different cohorts

### Access the data
First, we will create a variable dictionary of all variables we have access to.

In [None]:
fullVariableDict = resource.dictionary().find("NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )").keys()
#fullVariableDict
#variablesDict = get_multiIndex_variablesDict(fullVariableDict)
variablesDict = pd.DataFrame(fullVariableDict, columns=['name'])
variablesDict

In this example, we are interested in variables related to lipids. We can find all variables related to the search terms 'lipid' and 'triglyceride' through applying the following filter on the multiIndex dictionary:

In [None]:
mask_lipid = [type(i) == str and "lipid" in i.lower() for i in variablesDict['name']]
mask_triglyceride = [type(i) == str and "triglyceride" in i.lower() for i in variablesDict['name']]
lipid_vars = variablesDict.loc[mask_lipid or mask_triglyceride,:]
lipid_vars

### Identify the longitudinal lipid variables
The following code does the following:

- uses the dataframe containing variables which are related to 'lipid' or 'triglyceride'
- filters for variables with keywords 'exam #' or 'visit #'
- extracts the exam number of each variable into column `exam_number`
- groups variables longitudinal variable (`longvar`)
- returns a table showing the variables that have more than one exam recorded

First, the longitudinal concept paths that have exam or visit number information will be saved to `longitudinal_concept_paths`.

In [None]:
longitudinal_concept_paths = []
for i in lipid_vars['name']:
    if re.search('exam \d+', i, re.IGNORECASE):
        longitudinal_concept_paths.append(i)
    if re.search('visit \d+', i, re.IGNORECASE):
        longitudinal_concept_paths.append(i)
len(longitudinal_concept_paths)

Now `longitudinal_concept_paths` will be used to extract the longitudinal variables from the `lipid_vars` dataframe. The exam or visit number will be extracted into the `exam_number` column and the longitudinal variable will be saved to the `longvar` column.

In [None]:
long_info = lipid_vars[lipid_vars['name'].isin(longitudinal_concept_paths)].reset_index()
long_info['exam_number'] = long_info['name'].str.extract('(exam \d+|visit \d+)', flags=re.IGNORECASE)
long_info['longvar'] = long_info['name'].str.replace('(exam \d+|visit \d+)', 'exam', flags=re.IGNORECASE).str.lower()
long_info

To find the longitudinal variables with the greatest number of exams, a new dataframe `longitudinal_lipid_vars` is used to display the `longvar` and number of exams (`n_exam`). 

In [None]:
longitudinal_lipid_vars = pd.DataFrame(
    long_info.pivot_table(index=['longvar'], aggfunc='size'), 
    columns=['n_exam']).sort_values(by='n_exam', ascending=False).reset_index()
longitudinal_lipid_vars

Now that we know which longitudinal variables are available to us, we can choose a variable of interest and extract the patient and visit level data associated with it.

However, note that the `longvar` we extracted is not equivalent to the actual PIC-SURE concept path needed to query for this variable. 

*We can filter for specific studies after this and then extract the longitudinal variable names (note that longvar is not equivalent to the actual PIC-SURE concept path, will need to use original name from lipid vars); you won't be able to use the table above by itself to get the data of interest*

### Isolate variables of interest

In this example, we will choose to further investigate the first longitudinal variable in the `longitudinal_lipid_vars` dataframe we generated above.

In [None]:
my_variable = longitudinal_lipid_vars['longvar'][0]
print(my_variable)

To add the longitudinal variable of interest to our PIC-SURE query, we will need to search for our variable within the overall data dictionary we created before (`long_info`).

In [None]:
keywords = my_variable.split('exam' or 'visit')
keywords

In [None]:
def check_keywords(variable, keywords):
    final_result = []
    for var in variable:
        result = 0
        var = var.lower()
        for i in keywords:
            if i in var:
                result += 1
        final_result.append(result)
    return final_result

In [None]:
test_val = check_keywords(variablesDict['name'], keywords)
variablesDict['test_val'] = test_val
mask = variablesDict['test_val'] == len(keywords)
query_vars = variablesDict[mask]
qvars = query_vars['name']

In [None]:
query_vars = long_info.loc[[type(i) == str and my_variable in i.lower() for i in long_info['longvar']], 'name']
query_vars

The resulting `query_vars` variable contains the variables we will want to add to our query. 

### Create & run query
First, we will create a new query object.

In [None]:
my_query = resource.query()

We will use the `query.anyof().add()` method. This will allow us to include all input variables, but only patient records that contain at least one non-null value for those variables in the output. See the `1_PICSURE_API_101.ipynb` notebook for a more in depth explanation of query methods.

In [None]:
my_query.anyof().add(qvars)

#### Update consent codes if necessary
Uncomment this code below and run as necessary to restrict your query to certain consent codes.
In the current example, the query is restricted to the 'phs000179.c2' consent code.

In [None]:
# Delete current consents
#my_query.filter().delete("\\_consents\\")

# Add new consents
#my_query.filter().add("\\_consents\\", ['phs000179.c2'])

We can now run our query:

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

Our dataframe contains each exam / visit for the longitudinal variable of interest, with each row representing a patient. In order to be included in the output, each patient must have at least one reported value for one of the exams / visits for the variable of interest

In [None]:
query_result