# Identifying and Extracting Longitudinal Variables using R PIC-SURE API

This tutorial notebook will demonstrate how to idetify and extract longitudinal variables using the R PIC-SURE API. Longitudinal variables are defined as containing multiple 'Exam' or 'Visit' descriptions within their concept path. 

In this example, we will find the patient level data for a lipid-related longitudinal variable within the Framingham Heart study. We will:
1. Identify what longitudinal variables are associated with the keywords of interest (lipid, triglyceride), and how many exams / visits are associated with each one
2. Select a longitudinal variable of interest from a specific study (Framingham heart study)
3. Extract patient level data into a dataframe where each rows represent patients and columns represent visits

For a more basic introduction to the R PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.

**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the "Get your security token" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

## Environment Set-Up

### System Requirements
R >= 3.4

### Install Packages

In [None]:
source("R_lib/requirements.R")

Install latest R PIC-SURE API libraries from github

In [None]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")
install.packages("https://cran.r-project.org/src/contrib/Archive/devtools/devtools_1.13.6.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/R6_2.5.0.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/hash_2.2.6.1.tar.gz", repos=NULL, type="source")
install.packages(c("urltools"),repos = "http://cran.us.r-project.org")
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)
devtools::install_github("hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds", force=T)

Load user-defined functions

In [None]:
source("R_lib/utils.R")

## Connecting to a PIC-SURE Network
**Again, before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the "Get your security token" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/harmonized_lipid_measurements_example/NHLBI_BioData_Catalyst#get-your-security-token).**

In [None]:
PICSURE_network_URL <- "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id <- "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file <- "token.txt"

In [None]:
token <- scan(token_file, what = "character")


In [None]:
myconnection <- picsure::connect(url = PICSURE_network_URL,
                                 token = token)

In [None]:
resource <- bdc::get.resource(myconnection,
                               resourceUUID = resource_id)

## Longitudinal Lipid Variable Example
Example showing how to extract lipid measurements from multiple visits for different cohorts

### Access the data
First, we will create a multiIndex variable dictionary of all variables we have access to.

In [None]:
varDict <- bdc::find.in.dictionary(resource) %>% bdc::extract.entries() # all variables
multiindex <- get_multiIndex_variablesDict(varDict) # get multiindex table of all variables

In this example, we are interested in variables related to lipids. We can find all variables related to the search terms 'lipid' and 'triglyceride' through applying the following filter on the multiIndex dictionary:

In [None]:
lipid_vars <- multiindex %>% filter(grepl('triglyceride', name, ignore.case = TRUE) |
                                   grepl('lipid', name, ignore.case = TRUE))
lipid_vars

### Identify the longitudinal lipid variables
This block of code does the following:
- uses the multiindex dataframe containing variables which are related to 'lipid' or 'triglyceride'
- filters for variables with keywords 'exam #' or 'visit #'
- extracts the exam number of each variable into column `exam_number`
- groups variables by study (`level_0`) and longitudinal variable (`longvar`)
- returns a table showing the variables that have more than one exam recorded (longitudinal variables?)

In [None]:
longitudinal_lipid_vars <- lipid_vars %>% 

  # filter for longitudinal varibles; here defined as variable names containing 'exam' or 'visit' (case insensitive)
  filter((grepl('Exam \\d+', name, ignore.case = TRUE) | 
         grepl('Visit \\d+', name, ignore.case = TRUE))) %>%

  # extract the exam / visit number and store it as a new variable: 'exam_number'
  mutate(exam_number = str_extract(name, regex("(exam \\d+)|(visit \\d+)", ignore_case=T)),
         
         # extract the longitudinal variable name by removing the exam / visit number and store as new variable: 'longvar'
         longvar =  str_replace(name, regex('(exam \\d+.$)|(visit \\d+.$)', ignore_case = T), '')) %>%

  # group our variable data by study name (level_0) and longitudinal variable name (longvar)
  group_by(level_0, longvar) %>%

  # count number of exams / visits within each distinct study - variable pairing and store as new variable: 'n_exams'
  summarise(n_exams = n_distinct(exam_number)) %>% 

  # filter results to only include variables which have more than one exam / visit (n_exams > 1)
  filter(n_exams > 1) %>%

  # sort results by number of exams / visits 
  arrange(desc(n_exams))


longitudinal_lipid_vars
   

Now that we know which longitudinal variables are available to us, we can choose a variable of interest and extract the patient and visit level data associated with it.

However, note that the 'longvar' we extracted is not equivalent to the actual PIC-SURE concept path needed to query for this variable. 

*We can filter for specific studies after this and then extract the longitudinal variable names (note that longvar is not equivalent to the actual PIC-SURE concept path, will need to use original name from lipid vars); you won't be able to use the table above by itself to get the data of interest*

### Isolate variables of interest

In this example, we will choose to further investigate the first longitudinal variable in the `longitudinal_lipid_vars` dataframe we generated above.

In [None]:
my_variable <- longitudinal_lipid_vars$longvar[1]
print(my_variable)

To add the longitudinal variable of interest to our PIC-SURE query, we will need to search for our variable within the overal multiindex data dictionary we created before (`multiindex`)

In [None]:
# remove all punctuation when searching as R has issues with many characters
query_vars <- multiindex %>% 
  filter(grepl(str_replace_all(my_variable, '[[:punct:]]', ''),
               str_replace_all(name, '[[:punct:]]', '')) &
         grepl(regex('(exam \\d+.$)|(visit \\d+.$)', ignore_case = T), tolower(name))) %>%
  pull(name)

query_vars


In [None]:
# remove all punctuation when searching as R has issues with many characters
query_vars <- multiindex %>% 
  filter(grepl(str_replace_all(my_variable, '[[:punct:]]', ''),
               str_replace_all(name, '[[:punct:]]', '')) &
         grepl(regex('(exam \\d+.$)|(visit \\d+.$)', ignore_case = T), tolower(name))) %>%
  pull(name)

query_vars

The resulting `query_vars` variable contains the variables we will want to add to our query. 

### Create & run query
First, we will create a new query object.

In [None]:
my_query <- bdc::new.query(resource = resource)

We will use the `bdc::query.select.add()` method, as we want to include all variables in `query_vars`. See the `1_PICSURE_API_101.ipynb` notebook for a more in depth explanation of query methods.

In [None]:
bdc::query.select.add(query = my_query,
                      keys = lapply(query_vars, as.character))

We can now run our query:

In [None]:
my_df <- bdc::query.run(my_query, result.type = "dataframe")

Our dataframe contains each exam / visit for the longitudinal variable of interest, with each row representing a patient.

In [None]:
my_df