# PIC-SURE API tutorial using CureSC database

This is a tutorial notebook aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE R API 
### What is PIC-SURE? 

PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. Original data exposed through PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on the downstream analyses and to facilitate reproducible sciences.

### More about PIC-SURE
The API is available in two different programming languages, python and R, allowing investigators to query datasets in the same way using either of language. The R/python PIC-SURE API is a small part of the entire PIC-SURE platform.

The API is actively developed by the Avillach Lab at Harvard Medical School.

GitHub repo:
* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client

 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure you have [added your security token](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/Cure_Sickle_Cell#get-your-security-token). This documentation contains an explanation about how to get a security token, which is required to access the databases.**

# Environment set-up

### Pre-requisite
- R 3.5 or later

### Packages installation

In [26]:
list_packages <- c("ggrepel",
                   "jsonlite", 
                   "ggplot2",
                   "plyr",
                   "dplyr",
                   "tidyr",
                   "purrr",
                   "urltools",
                   "devtools",
                   "stringi"
                  )

for (package in list_packages){
     if(! package %in% installed.packages()){
         install.packages(package, dependencies = TRUE)
     }
     library(package, character.only = TRUE)
}

Install latest R PIC-SURE API libraries from github

In [27]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")

devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)

Downloading GitHub repo hms-dbmi/pic-sure-r-client@HEAD




[32m✔[39m  [90mchecking for file ‘/tmp/RtmpHFiulZ/remotes21f0278e3713/hms-dbmi-pic-sure-r-client-ffe158b/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘picsure’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘picsure_0.1.0.tar.gz’[39m[36m[39m
   


Downloading GitHub repo hms-dbmi/pic-sure-r-adapter-hpds@HEAD




[32m✔[39m  [90mchecking for file ‘/tmp/RtmpHFiulZ/remotes21f01fa00590/hms-dbmi-pic-sure-r-adapter-hpds-1e35133/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘hpds’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘hpds_0.1.0.tar.gz’[39m[36m[39m
   


Load user-defined functions

In [28]:
source("R_lib/utils.R")

## Connecting to a PIC-SURE network

You will need the following information before connecting to the PIC-SURE network:
* resource ID: ID of the resource that you are trying to access. You can leave the default value for this project.
* user-specific token text file: A text file called `token.txt` should contain the token retrieved from you user profile in PIC-SURE UI. This file needs to be located at the R root folder.

In [29]:
token_file <- "token.txt"
my_token <- scan(token_file, what = "character")

In [30]:
PICSURE_network_URL <- "https://curesc.hms.harvard.edu/picsure/"
connection <- picsure::connect(url = PICSURE_network_URL, 
                               token = my_token)

[1] "57e29a43-38c3-4c4b-84c9-dda8138badbe"


In [31]:
resource <- hpds::get.resource(connection,
                               resourceUUID = picsure::list.resources(connection))

In [32]:
picsure::list.resources(connection)

Two objects were created: a `connection` and a `resource` object, using the `picsure` and `hpds` libraries, respectively. 

Since will only be using one single resource, **the `resource` object is the only one we will need to proceed with this data analysis.** It should be noted that the `connection` object is useful to get access to different databases stored in different resources. 

The `resource` object is connected to the specific resource ID and enables us to query and retrieve data from this source.

## Getting help with the R PIC-SURE API

The `?` operator prints out the helper message for any PIC-SURE library function. For example, we can learn more about getting a resource using the following code:

In [33]:
?hpds::get.resource()

“internal error -3 in R_decompress1”
ERROR while rich displaying an object: Error in fetch(key): lazy-load database '/home/ec2-user/anaconda3/envs/R/lib/R/library/hpds/help/hpds.rdb' is corrupt

Traceback:
1. FUN(X[[i]], ...)
2. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. doTryCatch(return(expr), name, parentenv, handler)
6. withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- repr::mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), 

## Using the *variables dictionary*

Once a connection to the desired resource has been established, we first need to get an understanding of which variables are available in the database. We will use the `dictionary` method of the `resource` object to do this.

A `dictionary` instance retrieves matching records that match a specific term. The `find.in.dictionary()` function can be used to retrieve information about all available variables. For instance, looking for variables containing the term 'Avascular necrosis' is done this way: 

In [34]:
dictionary_search <- hpds::find.in.dictionary(resource, "Avascular necrosis")

Objects created by the `find.in.dictionary()` function can expose the search results using three different methods: `extract.count()`, `extract.keys()`, and `extract.entries()`. 

In [35]:
print(list("Count"   = hpds::extract.count(dictionary_search), 
           "Keys"    = hpds::extract.keys(dictionary_search)[1:2], # Show first two keys
           "Entries" = hpds::extract.entries(dictionary_search)[1:2])) # Show first two entries

$Count
[1] 2

$Keys
$Keys[[1]]
[1] "\\CIBMTR - Cure Sickle Cell Disease\\5 - CRF data collection track only\\Avascular necrosis\\"

$Keys[[2]]
[1] "\\CIBMTR - Cure Sickle Cell Disease\\5 - CRF data collection track only\\Time from HCT to avascular necrosis, months\\"


$Entries
                                                                                                                     name
2                           \\CIBMTR - Cure Sickle Cell Disease\\5 - CRF data collection track only\\Avascular necrosis\\
21 \\CIBMTR - Cure Sickle Cell Disease\\5 - CRF data collection track only\\Time from HCT to avascular necrosis, months\\
   categorical
2         TRUE
21       FALSE



In [36]:
hpds::extract.entries(dictionary_search) %>% tail() #View last few entries as a dataframe

Unnamed: 0_level_0,name,categorical,observationCount,patientCount,min,max,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<chr>,<lgl>,<int>,<int>,<dbl>,<dbl>,<chr>,<list>,<lgl>
2,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Avascular necrosis\,True,732,732,,,phenotypes,"No , Not reported, Yes",
21,"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to avascular necrosis, months\",False,29,29,0.3947368,149.3092,phenotypes,,


Viewing the dictionary as a DataFrame allows us to:

* Use the various information exposed in the dictionary (patient count, variable type ...) as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variable names to be used in the query, as shown below.

Variable names aren't very pratical to use right away for two reasons:
1. Very long
2. Presence of backslashes that prevent copy-pasting. 

However, retrieving the dictionary search result in the form of a dataframe can help access the variable names.

Let's say we want to retrieve every variable for the "CIBMTR - Cure Sickle Cell Disease" study in the form of a DataFrame. We can do this using the code below:

In [37]:
plain_variablesDict <-  hpds::find.in.dictionary(resource, "CIBMTR - Cure Sickle Cell Disease") %>% 
                        hpds::extract.entries()

In [38]:
plain_variablesDict[1:5,]

Unnamed: 0_level_0,name,min,categorical,observationCount,patientCount,max,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<int>,<int>,<dbl>,<chr>,<list>,<lgl>
2,"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to diabetes, months\",0.1315789,False,48,48,46.315789,phenotypes,,
210,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Serum creatinine pre-conditioning unit\,,True,732,732,,phenotypes,"Not reported, mg/dL",
3,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Hypertension (HTN) requiring therapy\,,True,732,732,,phenotypes,"No , Not reported, Yes",
4,"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to hyperlipidemia, months\",2.7960526,False,2,2,4.243421,phenotypes,,
5,"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to encephalopathy, months\",0.1644737,False,13,13,14.375,phenotypes,,


### Extract Full Data Dictionary to CSV

The `hpds::find.in.dictionary` function without arguments can be used to return all entries, which is described in the help documentation. We can extract the entire data dictionary by performing an empty search:

In [39]:
fullVariableDict <- hpds::find.in.dictionary(resource, "") %>% 
                    hpds::extract.entries() %>%
                    mutate(categoryValues = stri_join_list(categoryValues, sep =', '))
    

Check that the `fullVariableDict` dataframe contains some values.

In [40]:
fullVariableDict[0:5,] # View first five rows

Unnamed: 0_level_0,name,min,categorical,observationCount,patientCount,max,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<chr>,<dbl>,<lgl>,<int>,<int>,<dbl>,<chr>,<chr>,<lgl>
2,"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to diabetes, months\",0.1315789,False,48,48,46.315789,phenotypes,,
210,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Serum creatinine pre-conditioning unit\,,True,732,732,,phenotypes,"Not reported, mg/dL",
3,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Hypertension (HTN) requiring therapy\,,True,732,732,,phenotypes,"No, Not reported, Yes",
4,"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to hyperlipidemia, months\",2.7960526,False,2,2,4.243421,phenotypes,,
5,"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to encephalopathy, months\",0.1644737,False,13,13,14.375,phenotypes,,


We can than write the data frame that contains the full data dictionary to a csv file

In [41]:
dataDictFile <- "data_dictionary.csv" # Name of output file
write.csv(fullVariableDict, dataDictFile, row.names=FALSE)

You should now see a data_dictionary.csv in the Jupyter Hub file explorer.

### Parsing variable names

Though helpful, we can use a simple function, `get_multiIndex`, defined in `R_lib/utils.R` to add a little more information and ease working with long variables names. 

Although not an official feature of the API, such functionality illustrates how to quickly scan and select groups of related variables.

Printing part of the "parsed names" dictionary allows us to quickly see the tree-like organization of the variables. Moreover, original and simplified variable names are now stored in the "varName" and "simplified_varName" columns, respectively. Simplified variable names are the last component of the variable name, which is usually the most informative to let us know what each variable is about).

In [42]:
# Display the variables tree hierarchy from the variables name
variablesDict <- get_multiIndex_variablesDict(plain_variablesDict)
head(variablesDict)

level_0,level_1,level_2,simplified_name,name,observationCount,categorical,categoryValues,min,max,HpdsDataType
<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<lgl>,<list>,<dbl>,<dbl>,<chr>
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,"Time from HCT to diabetes, months","Time from HCT to diabetes, months","\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to diabetes, months\",48,False,,0.1315789,46.315789,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Serum creatinine pre-conditioning unit,Serum creatinine pre-conditioning unit,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Serum creatinine pre-conditioning unit\,732,True,"Not reported, mg/dL",,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Hypertension (HTN) requiring therapy,Hypertension (HTN) requiring therapy,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Hypertension (HTN) requiring therapy\,732,True,"No , Not reported, Yes",,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,"Time from HCT to hyperlipidemia, months","Time from HCT to hyperlipidemia, months","\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to hyperlipidemia, months\",2,False,,2.7960526,4.243421,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,"Time from HCT to encephalopathy, months","Time from HCT to encephalopathy, months","\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to encephalopathy, months\",13,False,,0.1644737,14.375,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,"Time from HCT to HTN, months","Time from HCT to HTN, months","\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to HTN, months\",58,False,,0.0,129.210526,phenotypes


Below is a simple example to illustrate the ease of use a multiIndex dictionary. Let's say we are interested in the variable containing "Avascular necrosis".

In [43]:
mask_necrosis <- grepl("Avascular necrosis", variablesDict[["simplified_name"]])

In [44]:
mask_study <- variablesDict[,1] == "CIBMTR - Cure Sickle Cell Disease"
mask_necrosis <- grepl("Avascular necrosis", variablesDict[["simplified_name"]])
more_variables <- variablesDict[mask_study & mask_necrosis,]
more_variables

level_0,level_1,level_2,simplified_name,name,observationCount,categorical,categoryValues,min,max,HpdsDataType
<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<lgl>,<list>,<dbl>,<dbl>,<chr>
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Avascular necrosis,Avascular necrosis,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Avascular necrosis\,732,True,"No , Not reported, Yes",,,phenotypes


This simple filter can be easily combined with other filters to quickly select variables of interest.

## Querying and retrieving data

The second cornerstone of the API are the `query` functions, which is how we retrieve data from the resource.

The query function has several methods that enable us to build a query:

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

### Building the query

Let's say we want to select a cohort consisting of males with avascular necrosis from the CIBMTR - Cure Sickle Cell Disease study. 

First, we need to get the variables from the study of interest. We can do this by searching for the study name in the `level_0` column of the multiIndexed dictionary and saving the resulting variable from the `name` column.

In [45]:
# Selecting all variables from "CIBMTR" study
mask_study = variablesDict[["level_0"]] == "CIBMTR - Cure Sickle Cell Disease"
varnames = variablesDict[mask_study, ]$name %>% as.list()

Now we will find variables pertaining to sex and avascular necrosis. We can do this by searching for "Sex" and "Avascular necrosis" in the `simplified_name` column of the dictionary.

In [46]:
sex_var <- variablesDict[variablesDict["simplified_name"] == "Sex", ]$name 

avascular_necrosis_varname <- variablesDict[variablesDict["simplified_name"] == "Avascular necrosis", ]$name 
values <- variablesDict[mask_study, "categoryValues"]

Now we can create a new query and apply our filters to get the data.

In [47]:
my_query <- hpds::new.query(resource = resource)
hpds::query.select.add(my_query, keys = sex_var)
hpds::query.filter.add(my_query, sex_var, "Male")

hpds::query.select.add(my_query, keys = avascular_necrosis_varname)
hpds::query.filter.add(my_query, avascular_necrosis_varname, "Yes")


## Retrieving the data

Once our query object is finally built, we use the `query.run` function to retrieve the data corresponding to our query

In [48]:
my_df <- hpds::query.run(my_query, result.type = "dataframe")

In [49]:
dim(my_df)

In [50]:
head(my_df)

Unnamed: 0_level_0,Patient ID,\CIBMTR - Cure Sickle Cell Disease\1 - Patient Related\Sex\,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Avascular necrosis\
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,42,Male,Yes
2,295,Male,Yes
3,336,Male,Yes
4,612,Male,Yes
5,725,Male,Yes
6,752,Male,Yes
