# Identifying and Extracting Longitudinal Variables using the R PIC-SURE API

This tutorial notebook will demonstrate how to identify and extract longitudinal variables using the R PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path. 

In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the R PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the "Get your security token" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token).**

## Environment Set-Up

### System Requirements
R >= 3.4

### Install Packages

In [None]:
system(command = 'conda install -c conda-forge r-devtools --yes')
devtools::install_github("hms-dbmi/pic-sure-r-client", ref="master", force=T, quiet=TRUE)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", ref="master", force=T, quiet=TRUE)
devtools::install_github("hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds", ref="new-search", force=T)
library(hpds)


In [None]:
install.packages(c('dplyr', 'stringr', 'tidyr', 'ggplot2'))
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)

## Connecting to a PIC-SURE Network

In [None]:
# Uncomment production URL when testing in production
# PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file <- "token.txt"
token <- scan(token_file, what = "character")
connection <- picsure::connect(PICSURE_network_URL, token)
authPicSure = bdc::use.authPicSure(connection)

## Longitudinal Lipid Variable Example
<font color='darkgreen'>**Goal: Extract lipid measurements from multiple visits. In this example, we will focus on the Framingham Heart Study (phs000007).**</font> 

In this notebook example, we will:
1. Identify lipid-related variables in the Framingham Heart Study
2. Identify which lipid variables are measured over time, for example across multiple visits or exams
3. Identify which longitudinal lipid variable(s) are of interest
4. Query PIC-SURE for the longitudinal lipid variable(s) of interest


### Identify lipid-related variables in the Framingham Heart Study

First, let's search the data dictionary in PIC-SURE. We will use a regular expression for the search term: `lipid|trigliceride`. This allows us to find all variables related to `lipid` *or* `trigliceride`. 

In [None]:
dictionary <- bdc::use.dictionary(connection) # set up the variable dictionary
lipid_dictionary <- bdc::find.in.dictionary(dictionary, "lipid|triglyceride")
lipid_df <- bdc::extract.dataframe(lipid_dictionary)


We are interested in variables from the Framingham Heart Study. The PHS number associated with this study is `phs000007`. If you don't know the PHS number for a study of interest, you can check the Data Access Dashboard in the PIC-SURE [User Interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/psamaui/login).

Here, we filter our variables dataframe to only include those where the studyId matches our PHS number of interest.

In [None]:
filtered_lipid_df <- lipid_df %>% filter(grepl('phs000007', study_id))
head(filtered_lipid_df)

As you can see, there are a number of variables in the Framingham Heart Study which are related to lipids or triglicerides. In this case study, we are interested specifically in `longitudinal` data, or variables which have been measured over time. 

### Identify the longitudinal lipid variables
In order to identify which lipid variables are measured over time, we will take advantage of the keywords `exam` and `visit`. Through a brief review of our lipid variables in the Framingham Heart Study, we can see that many variables contain an exam or visit number, indicating that it is longitudinal data.

First, we will filter our dataframe containing variables which are related to `lipid` or `triglyceride` in Framingham Heart Study to those which have the keywords `exam #` or `visit #`.

In [None]:
filtered_lipid_df <- filtered_lipid_df %>% 
    filter(grepl('(exam \\d+|visit \\d+)', var_description, ignore.case = TRUE))

In [None]:
filtered_lipid_df

Next, we will extract the exam or visit number of each variable into column `exam_number`.

In [None]:
filtered_lipid_df <- filtered_lipid_df %>%
    mutate(exam_number = str_extract(var_description, regex('exam \\d+|visit \\d+', ignore_case = 'T')))


Now we save the variable name without the exam number as `varname_noexam`. This prepares us for the next step, where we will group the data by the variable name root.

In [None]:
filtered_lipid_df <- filtered_lipid_df %>%
    mutate(varname_noexam = str_replace(var_description, regex('exam \\d+|visit \\d+', ignore_case = 'T'), ''))

Finally, we can return a summary table showing which variables have more than one exam recorded.

In [None]:
# Isolate columns of interest
filtered_lipid_df <- filtered_lipid_df %>% 
    select(var_name, var_description, exam_number, varname_noexam) %>%
    distinct

# Create summary table by pivoting the dataframe to show which variables have which exam # provided.
longitudinal_lipid_summary <- filtered_lipid_df %>%
    pivot_wider(id_cols = exam_number,
                names_from = varname_noexam,
                values_from = var_name)

In [None]:
longitudinal_lipid_summary

Now that we know which longitudinal variables are available to us, we can choose a variable of interest and extract the patient and visit level data associated with it.

### Identify which longitudinal lipid variable(s) are of interest

We can see from the table above that the variable `treated for lipids` appears to be the most robust, with 32 exams recored.

In this example, we will further investigate the `treated for lipids` variable by adding all the associated variable IDs to our PIC-SURE query.

To do so, we need the HPDS_PATH for each variable ID.


In [None]:
hpds_paths <- lipid_df %>% filter(grepl('Treated for lipids,', var_description)) %>% pull(HPDS_PATH)
hpds_paths

### Query PIC-SURE for longitudinal variables of interest
First, we will create a new query object.

In [None]:
longitudinal_query <- bdc::new.query(authPicSure) # Start a new query


We will use the `query.anyof().add()` method. This will allow us to include all input variables, but only participant records that contain at least one non-null value for those variables in the output. See the `1_PICSURE_API_101.ipynb` notebook for a more in depth explanation of query methods.

In [None]:
invisible(lapply(hpds_paths, bdc::query.select.add, query = longitudinal_query))


Retrieve the query results as a dataframe

In [None]:
longitudinal_results <- bdc::query.run(longitudinal_query)
head(longitudinal_results)

### Visualize the results
Let's plot a graph to see whether patients were or were not treated for lipids over time.

First, we will clean the data by removing the subject identifiers and renaming the columns to simply represent the visit number. We can see that our data values are in the form "Yes", "No". We will map them to a boolean representation.

In [None]:
plotdf <- longitudinal_results

# drop columns not containing data
plotdf <- plotdf[,-c(1:4)]

# rename columns with just the visit number
colnames(plotdf) <- gsub('LIPRX', '', str_extract(colnames(plotdf), 'LIPRX\\d+'))

###########remove after removing duplicates by using var_id
plotdf <- plotdf[,-which(duplicated(colnames(plotdf)))]

In [None]:
# map yes/no values to numeric representation
my_func <- function(vec) {
    ifelse(vec == 'Yes', 1, 
           ifelse(vec == 'No', -1, 
                  0))
}

plotdf <- plotdf %>%
    mutate_all(my_func)

Although we have 12792 patients in this dataset with at least one 'treated for lipids' value, some of the data is quite sparse. Let's focus on visualizing patients which have at least 20 values recorded.

In [None]:
plotdf <- plotdf %>%
    mutate(recorded_values = rowSums(. != 0)) %>%
    filter(recorded_values >= 20) %>%
    select(-recorded_values)

The heatmap below represents each patient with at least 20 observations on each row. We can see distinct trends regarding the reporting of lipid treatment over time.

In [None]:
plotdf$id <- rownames(plotdf)
plotdf <- pivot_longer(plotdf, cols = colnames(plotdf)[-33], names_to = 'visit')
plotdf$visit <- factor(test$visit, levels = c(1:32))
plotdf$value <- factor(test$value)

In [None]:
ggplot(plotdf, aes(visit, id)) + 
    geom_tile(aes(fill = value)) + 
    scale_fill_manual(values=c("darkorange", "lightyellow", "forestgreen"),
                      labels = c('No', 'No Data', 'Yes')) +
    ylab(label = 'Participants') +
    theme(axis.text.y = element_blank()) 