# PIC-SURE API use-case: quick analysis on CIBMTR data

This is a tutorial notebook aimed to get the user quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE R API
### What is PIC-SURE?

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI).

Original data exposed through PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on the downstream analyses and to facilitate reproducible sciences.


### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using either language.


PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patients that match criteria, and create cohorts from this interactive exploration.

The R API is actively developed by the Avillach Lab at Harvard Medical School.

PIC-SURE API GitHub repo:
* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds




 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

# Environment set-up

### Pre-requisites
- R 3.4 or later

### Install packages

Install the following:
- packages listed in the `requirements.R` file
- PIC-SURE API components (from Github)
    - PIC-SURE Adapter
    - PIC-SURE Client

In [None]:
source("R_lib/requirements.R")

Install latest R PIC-SURE API libraries from github

In [None]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")

devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)

Load user-defined functions

In [None]:
source("R_lib/utils.R")

## Connecting to a PIC-SURE network

You will need the following information before connecting to the PIC-SURE network:
* resource ID: ID of the resource that you are trying to access. You can leave the default value for this project.
* user-specific token text file: A text file called `token.txt` should contain the token retrieved from your user profile in PIC-SURE UI. This file needs to be located at the R root folder.

In [None]:
token_file <- "token.txt"
my_token <- scan(token_file, what = "character")

In [None]:
PICSURE_network_URL <- "https://curesc.hms.harvard.edu/picsure/"
connection <- picsure::connect(url = PICSURE_network_URL, 
                               token = my_token)

In [None]:
resource <- hpds::get.resource(connection,
                               resourceUUID = picsure::list.resources(connection))

In [None]:
picsure::list.resources(connection)

Two objects were created: a `connection` and a `resource` object, using the `picsure` and `hpds` libraries, respectively. 

Since will only be using one single resource, **the `resource` object is the only one we will need to proceed with this data analysis.** It should be noted that the `connection` object is useful to get access to different databases stored in different resources. 

The `resource` object is connected to the specific resource ID and enables us to query and retrieve data from this source.

## Getting help with the R PIC-SURE API

The `?` operator prints out the helper message for any PIC-SURE library function. For example, we can learn more about getting a resource using the following code:

In [None]:
?hpds::get.resource()

## Using the *variables dictionary*

Once a connection to the desired resource has been established, we first need to get an understanding of which variables are available in the database. We will use the `dictionary` method of the `resource` object to do this.

A `dictionary` instance retrieves matching records that match a specific term. The `find.in.dictionary()` function can be used to retrieve information about all available variables. For instance, looking for variables containing the term 'Sex' is done this way: 

In [None]:
dictionary_search <- hpds::find.in.dictionary(resource, "Sex")

Objects created by the `find.in.dictionary()` function can expose the search results using three different methods: `extract.count()`, `extract.keys()`, and `extract.entries()`. 

In [None]:
print(list("Count"   = hpds::extract.count(dictionary_search), 
           "Keys"    = hpds::extract.keys(dictionary_search)[1:2], # Show first two keys
           "Entries" = hpds::extract.entries(dictionary_search)[1:2])) # Show first two entries

In [None]:
hpds::extract.entries(dictionary_search) %>% tail() #View last entries as a dataframe

Viewing the dictionary as a DataFrame allows us to:

* Use the various information exposed in the dictionary (patient count, variable type ...) as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variable names to be used in the query, as shown below.

Variable names aren't very pratical to use right away for two reasons:
1. Very long
2. Presence of backslashes that prevent copy-pasting. 

However, retrieving the dictionary search result in the form of a dataframe can help access the variable names.

Let's say we want to retrieve every variable for the "CIBMTR - Cure Sickle Cell Disease" study in the form of a DataFrame. We can do this using the code below:

In [None]:
plain_variablesDict <-  hpds::find.in.dictionary(resource, "CIBMTR - Cure Sickle Cell Disease") %>% 
                        hpds::extract.entries()

In [None]:
plain_variablesDict[1:5,]

### Extract Full Data Dictionary to CSV

The `hpds::find.in.dictionary` function without arguments can be used to return all entries, which is described in the help documentation. We can extract the entire data dictionary by performing an empty search:

In [None]:
fullVariableDict <- hpds::find.in.dictionary(resource, "") %>% 
                    hpds::extract.entries() %>%
                    mutate(categoryValues = stri_join_list(categoryValues, sep =', '))

Check that the `fullVariableDict` dataframe contains some values.

In [None]:
fullVariableDict[0:5,] # View first five rows

We can than write the data frame that contains the full data dictionary to a csv file.

In [None]:
dataDictFile <- "data_dictionary.csv" # Name of output file
write.csv(fullVariableDict, dataDictFile, row.names=FALSE)

You should now see a data_dictionary.csv in the Jupyter Hub file explorer.

### Parsing variable names

Though helpful, we can use a simple function, `get_multiIndex`, defined in `R_lib/utils.R` to add a little more information and ease working with long variables names. 

Although not an official feature of the API, such functionality illustrates how to quickly scan and select groups of related variables.

Printing part of the "parsed names" dictionary allows us to quickly see the tree-like organization of the variables. Moreover, original and simplified variable names are now stored in the "varName" and "simplified_varName" columns, respectively. Simplified variable names are the last component of the variable name, which is usually the most informative and specific part of the full variable name.

In [None]:
# Display the variables tree hierarchy from the variables name
variablesDict <- get_multiIndex_variablesDict(plain_variablesDict)
head(variablesDict)

Below is a simple example to illustrate the ease of use a multiIndex dictionary. Let's say we are interested in the variable called "5 - CRF data collection track only" from the "CIBMTR - Cure Sickel Cell Disease" study.

In [None]:
mask_study <- variablesDict[,1] == "CIBMTR - Cure Sickle Cell Disease"
mask_dctrack <- grepl("5 - CRF data collection track only", variablesDict[["level_1"]])
more_variables <- variablesDict[mask_study & mask_dctrack,]
more_variables

This simple filter can be easily combined with other filters to quickly select variables of interest.

## Querying and retrieving data

The second cornerstone of the API are the `query` functions, which is how we retrieve data from the resource.

The query function has several methods that enable us to build a query:

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

### Building the query

Let's say we are interested in the age at which patients from the following cohort received their transplant:
* patients in the "CIBMTR - Cure Sickle Cell Disease" study
* males
* patients with avascular necrosis
* patients that received their transplant after the year 1999

First, we need to get the variables from the study of interest. We can do this by searching for the study name in the `level_0` column of the multiIndexed dictionary and saving the resulting variable from the `name` column.

In [None]:
# Selecting all variables from "CIBMTR" study
mask_study = variablesDict[["level_0"]] == "CIBMTR - Cure Sickle Cell Disease"
varnames = variablesDict[mask_study, ]$name %>% as.list()

Now we will find variables pertaining to sex and avascular necrosis. We can do this by searching for "Sex" and "Avascular necrosis" in the `simplified_name` column of the dictionary.

In [None]:
sex_var <- variablesDict[variablesDict["simplified_name"] == "Sex", ]$name 

avascular_necrosis_varname <- variablesDict[variablesDict["simplified_name"] == "Avascular necrosis", ]$name 
values <- variablesDict[mask_study, "categoryValues"]

Next, we can find the variable pertaining to "Year of transplant".

In [None]:
yr_transplant_varname <- variablesDict[variablesDict["simplified_name"] == "Year of transplant", ]$name

Now we can create a new query and apply our filters to retrieve the cohort of interest.

In [None]:
my_query <- hpds::new.query(resource = resource)
hpds::query.select.add(my_query, keys = sex_var)
hpds::query.filter.add(my_query, sex_var, "Male")

hpds::query.select.add(my_query, keys = avascular_necrosis_varname)
hpds::query.filter.add(my_query, avascular_necrosis_varname, "Yes")

hpds::query.select.add(my_query, keys = yr_transplant_varname)
hpds::query.filter.add(my_query, yr_transplant_varname, min=2000)

Using this cohort, we can add the variable of interest: "Patient age at transplant, years"

In [None]:
age_transplant_var = variablesDict[variablesDict['simplified_name'] == "Patient age at transplant, years",]$name
hpds::query.select.add(my_query, keys = age_transplant_var)

## Retrieving the data

Once our query object is finally built, we use the `query.run` function to retrieve the data corresponding to our query

In [None]:
my_df <- hpds::query.run(my_query, result.type = "dataframe")

In [None]:
my_df

Once the data has been retrieved as a dataframe, you can use R functions to conduct analyses and create visualizations, such as this:

In [None]:
names(my_df)[2] <- "age_at_transplant" # Rename long column to age_at_transplant
ggplot(data = my_df) +
    geom_histogram(mapping = aes(x=age_at_transplant), bins=15) +
    labs(x = "Age received transplant, yrs old", y = "Count") +
    theme_bw()