# Identifying and Extracting Longitudinal Variables using python PIC-SURE API

This tutorial notebook will demonstrate how to identify and extract longitudinal variables using the python PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path.


In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token).**

## Environment set-up

### Pre-requisites
* python 3.6 or later
* pip python package manager, already available in most systems with a [python interpreter installed](https://pip.pypa.io/en/stable/installing/)

### Install packages
The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:
* PIC-SURE Client
* PIC-SURE Adapter
* BioData Catalyst PIC-SURE Adapter

In [None]:
import re

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git@new-search

import PicSureBdcAdapter

## Connecting to a PIC-SURE resource

The following is required to get access to the PIC-SURE API:
* a network URL
* a user-specific security token

The following code specifies the network URL as the BioData Catalyst Powered by PIC-SURE URL and references the user-specific token saved as `token.txt`.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [None]:
# Uncomment production URL when testing in production
# PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

## Longitudinal Lipid Variable Example
In this example, we will extract lipid measurements from multiple visits for different cohorts.

First, let's search PIC-SURE for all variables that contain 'lipid' or 'triglyceride'. 

In [None]:
lipid_dictionary = bdc.useDictionary().dictionary().find('lipid|trigliceride')
lipid_dataframe = lipid_dictionary.dataframe()
print(lipid_dataframe.shape)
lipid_dataframe.head()


filter to variables from phs000007

In [None]:
filtered_lipid_dataframe = lipid_dataframe[lipid_dataframe.studyId.str.contains('phs000007')]
filtered_lipid_dataframe

In [None]:
filtered_lipid_dataframe.columns

### Identify the longitudinal lipid variables
The following code does the following:

- uses the dataframe containing variables which are related to 'lipid' or 'triglyceride'
- filters for variables with keywords 'exam #' or 'visit #'
- extracts the exam number of each variable into column `exam_number`
- groups variables longitudinal variable (`longvar`)
- returns a table showing the variables that have more than one exam recorded

First, lets see which lipid variables appear to be longitudinal by searching for the words exam and visit.

In [None]:
filtered_lipid_dataframe = lipid_dataframe[lipid_dataframe.studyId.str.contains('phs000007')]

filtered_lipid_dataframe = filtered_lipid_dataframe[filtered_lipid_dataframe.description.str.contains('exam|visit', case = False)]

# Save exam # as exam_number
filtered_lipid_dataframe['exam_number'] = filtered_lipid_dataframe['description'].str.extract('(exam \d+|visit \d+)', flags = re.IGNORECASE)
filtered_lipid_dataframe['exam_number'] = filtered_lipid_dataframe['exam_number'].str.replace('(exam|visit)', '', flags = re.IGNORECASE).str.lower()
filtered_lipid_dataframe['exam_number'] = filtered_lipid_dataframe['exam_number'].astype('int')

# Save variable name without exam # as longvar
filtered_lipid_dataframe['varname_noexam'] = filtered_lipid_dataframe['description'].str.replace('(exam \d+|visit \d+)', '', flags = re.IGNORECASE).str.lower()


filtered_lipid_dataframe = filtered_lipid_dataframe[['varId', 'var_name', 'description', 'dataTableId', 'dataTableDescription', 'exam_number', 'varname_noexam']]

filtered_lipid_dataframe = filtered_lipid_dataframe.drop_duplicates(subset=['description', 'exam_number', 'varname_noexam'])
filtered_lipid_dataframe


In [None]:
longitudinal_lipid_summary = filtered_lipid_dataframe.pivot(index = 'exam_number', columns = 'varname_noexam', values = 'varId')
longitudinal_lipid_summary.fillna('', inplace=True)
longitudinal_lipid_summary


Now that we know which longitudinal variables are available to us, we can choose a variable of interest and extract the patient and visit level data associated with it.


## Isolate variables of interest

In this example, we will further investigate the 'treated for lipids' variable, which appears to be the most robust.

We will add all the associated variable IDs to our PIC-SURE query.

To do so, we need the HPDS_PATH for each variable ID.


In [None]:
variable_ids = longitudinal_lipid_summary[['treated for lipids, ']]
hpds_paths = variable_ids.merge(lipid_dataframe[['varId', 'HPDS_PATH']], 
                                left_on = 'treated for lipids, ', 
                                right_on = "varId", 
                                how = 'left')
hpds_paths = hpds_paths['HPDS_PATH']
hpds_paths


## Query PIC-SURE for longitudinal variables of interest
First, we will create a new query object.

In [None]:
authPicSure = bdc.useAuthPicSure()

longitudinal_query = authPicSure.query()

We will use the `query.anyof().add()` method. This will allow us to include all input variables, but only participant records that contain at least one non-null value for those variables in the output. See the `1_PICSURE_API_101.ipynb` notebook for a more in depth explanation of query methods.

In [None]:
longitudinal_query.anyof().add(hpds_paths)

Retrieve the query results as a dataframe

In [None]:
longitudinal_results = longitudinal_query.getResultsDataFrame()


Our dataframe contains each exam / visit for the longitudinal variable of interest, with each row representing a patient. In order to be included in the output, each patient must have at least one reported value for one of the exams / visits for the variable of interest

In [None]:
longitudinal_results