# Identifying and Extracting Longitudinal Variables using python PIC-SURE API for *NHLBI BioData Catalyst® (BDC)*

This tutorial notebook will demonstrate how to identify and extract longitudinal variables using the python PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path.


In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the [`README.md` file](../README.md). It explains how to get a security token, which is mandatory to use the PIC-SURE API.**

To set up your token file, be sure to run the [`Workspace_setup.ipynb` file](./Workspace_setup.ipynb).

## Environment set-up

### System requirements
- Python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed

### Install packages

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

In [None]:
import sys
import pandas as pd
import re
import numpy as np
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

## Longitudinal Lipid Variable Example
<font color='darkgreen'>**Goal: Extract lipid measurements from multiple visits. In this example, we will focus on the Framingham Heart Study (phs000007).**</font> 

In this notebook example, we will:
1. Identify lipid-related variables in the Framingham Heart Study
2. Identify which lipid variables are measured over time, for example across multiple visits or exams
3. Identify which longitudinal lipid variable(s) are of interest
4. Query PIC-SURE for the longitudinal lipid variable(s) of interest


### Identify lipid-related variables in the Framingham Heart Study

First, let's search the data dictionary in PIC-SURE. We will use a regular expression for the search term: `lipid|trigliceride`. This allows us to find all variables related to `lipid` *or* `triglyceride`. 

In [None]:
lipid_dictionary = bdc.useDictionary().dictionary().find('lipid|triglyceride')
lipid_dataframe = lipid_dictionary.dataframe()
print(lipid_dataframe.shape)
lipid_dataframe.head()

We are interested in variables from the Framingham Heart Study. The PHS number associated with this study is `phs000007`. If you don't know the PHS number for a study of interest, you can check the Data Access Dashboard in the PIC-SURE [User Interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/psamaui/login).

Here, we filter our variables dataframe to only include those where the studyId matches our PHS number of interest.

In [None]:
filtered_lipid_dataframe = lipid_dataframe[lipid_dataframe.studyId.str.contains('phs000007')]
filtered_lipid_dataframe

As you can see, there are a number of variables in the Framingham Heart Study which are related to lipids or triglicerides. In this case study, we are interested specifically in `longitudinal` data, or variables which have been measured over time. 

### Identify the longitudinal lipid variables
In order to identify which lipid variables are measured over time, we will take advantage of the keywords `exam` and `visit`. Through a brief review of our lipid variables in the Framingham Heart Study, we can see that many variables contain an exam or visit number, indicating that it is longitudinal data.


First, we will filter our dataframe containing variables which are related to `lipid` or `triglyceride` in Framingham Heart Study to those which have the keywords `exam #` or `visit #`.

In [None]:
filtered_lipid_dataframe = filtered_lipid_dataframe[filtered_lipid_dataframe.description.str.contains('exam|visit', case = False)].reset_index()

Next, we will extract the exam or visit number of each variable into column `exam_number`. In some cases, there may be variables that have the word "exam" or "visit" in the description, but do not have a visit or exam number. An example of this could be: "Since your last exam, have you had a lipid panel?"

We will remove these variables.

In [None]:
# Save exam # as exam_info, first convert string to lowercase, then extract exam/visit number
# If there is no visit number, use -1 as the visit
exam_info = filtered_lipid_dataframe['description'].str.lower().str.extract('(exam \d+|visit \d+)')[0].fillna("-1")

# Remove "exam" or "visit" text to only get number, convert to integer
exam_info = list(exam_info.str.replace('(exam|visit)', '', regex=True).astype('int'))
exam_info = pd.DataFrame(data=exam_info, columns=['exam_number'])

# Save exam_info as new column "exam_number" in dataframe and drop those entries with no exam number, or exam number = -1
filtered_lipid_dataframe = pd.concat([filtered_lipid_dataframe, exam_info], axis=1)
filtered_lipid_dataframe = filtered_lipid_dataframe[filtered_lipid_dataframe.exam_number != -1]

In [None]:
filtered_lipid_dataframe.head()

Now we save the variable name without the exam number as `varname_noexam`. This prepares us for the next step, where we will group the data by the variable name root.

In [None]:
# Save variable name without exam # as varname_noexam
filtered_lipid_dataframe['varname_noexam'] = filtered_lipid_dataframe['description'].str.lower().str.replace('(exam \d+|visit \d+)', '', regex=True)

Finally, we can return a summary table showing which variables have more than one exam recorded.

In [None]:
# Isolate columns of interest
filtered_lipid_dataframe = filtered_lipid_dataframe[['columnmeta_var_id', 
                                                     'columnmeta_name', 
                                                     'columnmeta_description', 
                                                     'columnmeta_var_group_id', 
                                                     'columnmeta_var_group_description', 
                                                     'exam_number', 'varname_noexam']]

# Remove duplicated rows
filtered_lipid_dataframe = filtered_lipid_dataframe.drop_duplicates(subset=['columnmeta_description', 
                                                                            'exam_number', 
                                                                            'varname_noexam'])

# Create summary table by pivoting the dataframe to show which variables have which exam # provided.
longitudinal_lipid_summary = filtered_lipid_dataframe.pivot(index = 'exam_number', columns = 'varname_noexam', values = 'columnmeta_var_id')
longitudinal_lipid_summary.fillna('', inplace=True)
longitudinal_lipid_summary

Now that we know which longitudinal variables are available to us, we can choose a variable of interest and extract the patient and visit level data associated with it.


### Identify which longitudinal lipid variable(s) are of interest

We can see from the table above that the variable `treated for lipids` appears to be the most robust, with 32 exams recored.

In this example, we will further investigate the `treated for lipids` variable by adding all the associated variable IDs to our PIC-SURE query.

To do so, we need the HPDS_PATH for each variable ID.


In [None]:
variable_ids = longitudinal_lipid_summary[['treated for lipids, ']]
hpds_paths = variable_ids.merge(lipid_dataframe[['varId', 'HPDS_PATH']], 
                                left_on = 'treated for lipids, ', 
                                right_on = "varId", 
                                how = 'left')
hpds_paths = hpds_paths['HPDS_PATH']
hpds_paths

### Query PIC-SURE for longitudinal variables of interest
First, we will create a new query object.

In [None]:
authPicSure = bdc.useAuthPicSure()

longitudinal_query = authPicSure.query()

We will use the `query.anyof().add()` method. This will allow us to include all input variables, but only participant records that contain at least one non-null value for those variables in the output. See the `1_PICSURE_API_101.ipynb` notebook for a more in depth explanation of query methods.

In [None]:
longitudinal_query.anyof().add(hpds_paths)

Retrieve the query results as a dataframe.

In [None]:
longitudinal_results = longitudinal_query.getResultsDataFrame()

In [None]:
longitudinal_results

Our dataframe contains each exam / visit for the longitudinal variable of interest, with each row representing a patient. In order to be included in the output, each patient must have at least one reported value for one of the exams / visits for the variable of interest.

### Visualize the results
Let's plot a graph to see whether patients were or were not treated for lipids over time.

In [None]:
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
sns.set(rc={'figure.figsize':(20,15)})

First, we will clean the data by removing the subject identifiers and renaming the columns to simply represent the visit number. We can see that our data values are in the form "Yes", "No". We will map them to a boolean representation.

In [None]:
plotdf = longitudinal_results

# drop columns not containing data
plotdf.drop(plotdf.columns[[0, 1, 2, 3]], axis=1, inplace=True)

# rename columns with just the visit number
cols = []
for c in plotdf.columns:
    cnew = re.sub('^.*LIPRX', '', c)
    cnew = cnew.strip('\\')
    cols.append(cnew)
plotdf.columns = cols

In [None]:
# map yes/no values to boolean representation
map_df = pd.DataFrame({'raw':['Yes', 'No', 'No Data'],
                      'numeric':[1,-1, 0]})
map_df = dict(zip(map_df.raw, map_df.numeric))
for column in plotdf:
    plotdf[column] = plotdf[column].map(map_df)

Although we have 12792 patients in this dataset with at least one 'treated for lipids' value, some of the data is quite sparse. Let's focus on visualizing patients which have at least 20 values recorded.

In [None]:
plotdf['sum'] = plotdf.count(axis=1)
plotdf = plotdf[plotdf['sum'] >= 20]
plotdf = plotdf.sort_values(by=['sum'])
plotdf = plotdf.drop(['sum'], axis=1)
plotdf = plotdf.fillna(0)

In [None]:
plotdf

The heatmap below represents each patient with at least 20 observations on each row. We can see distinct trends regarding the reporting of lipid treatment over time.

In [None]:
n = 3
cmap = sns.color_palette("Spectral", 3) 
ax = sns.heatmap(plotdf, cmap=cmap, yticklabels=False)
# modify colorbar:
colorbar = ax.collections[0].colorbar 
r = colorbar.vmax - colorbar.vmin 
colorbar.set_ticks([colorbar.vmin + r / 3 * (0.5 + i) for i in range(n)])
colorbar.set_ticklabels(list(['No', 'No Data', 'Yes']))                                          