# PIC-SURE API tutorial using Genomic Information Commons database

This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

We will start with some general information about PIC-SURE and the basics of intracting with the PIC-SURE API using the R client libraries.

We will then walk through an example exploration of the data to informally answer this question:

Is there any relationship between the intractability of seizures and either age or calcium levels in the blood?

*spoiler alert: there is no relationship identified in this notebook*



## PIC-SURE R API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

Databases exposed through PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hide this complexity and expose the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible science.

### More about PIC-SURE
PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. The API is available in two different programming languages, python and R, allowing investigators to query databases using either of those languages.

PIC-SURE is a large project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface, allowing research scientist to get quick knowledge about variables and data available for a specific data source.

The API is actively developed by the Avillach-Lab at Harvard Medical School.

GitHub repos for R client libraries:
* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client



 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the get_your_token.ipynb notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

# Environment set-up

### Pre-requisite
- R 3.5 or later

### Packages installation

In [None]:
list_packages <- c("ggrepel",
                   "jsonlite", 
                   "ggplot2",
                   "plyr",
                   "dplyr",
                   "tidyr",
                   "purrr",
                   "urltools",
                   "stringr",
                   "devtools")

for (package in list_packages){
     if(! package %in% installed.packages()){
         install.packages(package, dependencies = TRUE)
     }
     library(package, character.only = TRUE)
}

#### Installing latest R PIC-SURE API libraries from github

In [None]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")
devtools::install_github("hms-dbmi/pic-sure-r-client", ref = "dynamic_psama_url", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", ref="adding-fields-performance-fix", force=T)

## Connecting to a PIC-SURE network

Several information are needed to get access to data through the PIC-SURE API: a network URL, a resource id, and a user security token which is specific to a given URL + resource.

In [None]:
PICSURE_network_URL <- "http://wildfly:8080/pic-sure-api-2/PICSURE/"
resource_id <- "4fec17a3-21de-4ab4-82e2-2372724e584e"
token_file <- "token.txt"

We need to read your token from the file you created:

In [None]:
my_token <- scan(token_file, what = "character")


Note that the resource_id above may be different in your environment. You can see what resources are available using the following line of code. 

In [None]:
connection <- picsure::connect_local(token = my_token)
picsure::list.resources(connection)

We then get a connection to the chosen resource, if the resource_id set above is not in the prior output then replace it with one that is.

In [None]:
resource <- hpds::get.resource(connection,
                               resourceUUID = resource_id)

Two objects are created here: a `connection` and a `resource` object, using respectively the `picsure` and `hpds` libraries. 

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter** (FYI, the `connection` object is useful to get access to different databases stored in different resources). 

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this source.

## Getting help with the R PIC-SURE API

The `?` operator prints out helper message for any PIC-SURE library function.

In [None]:
?hpds::get.resource()

## Using the *variables dictionnary*

Once a connection to the desired resource has been established, we first need to get a quick idea of which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A `dictionary` instance offers the possibility to retrieve matching records according to a specific term, or to retrieve information about all available variables, using the `find.in.dictionary()` function. For instance, looking for variables containing the term 'seizure' is done this way: 

In [None]:
dictionary_search <- hpds::find.in.dictionary(resource, "seizure")
seizure_concepts <- hpds::extract.dataframe(dictionary_search)

**`hpds::extract.dataframe()` enables to get the result of the dictionary search in a data.frame format.**

The dictionary provide various information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if categorical, False if continuous/numerical
- min/max: only provided for non-categorical variables
- HpdsDataType: 'phenotypes' or 'info'. 'info' records are variant annotations, filtering by these is still in development for the client libraries.

Hence, it enables to:
* Use the various variables information as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.


Variable names, as currently implemented in the API, aren't straightforward to use.
1. They can be very long
2. Presence of backslashes that requires escaping after copy-pasting. 

However, using the dictionary to select variables can help to deal with this, this is the dataframe format of the seizure_concepts retrieved above: 

In [None]:
seizure_concepts

The dictionary currently returned by the API provide various information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if categorical, False if continuous/numerical
- min/max: only provided for non-categorical variables
- HpdsDataType: 'phenotypes' or 'info'. Info variables represent variant annotations.

 

So we now have a dataframe with information about all concepts containing siezure, what about `age` and `calcium` levels?

For a search term such as `age` the initial results will also contain matches such as leakage, damage, shrinkage, etc. so we need to filter the results with the tools R provides for filtering dataframes.

In [None]:
age_concepts <- hpds::find.in.dictionary(resource, "age") %>% hpds::extract.dataframe()
age_concepts

So we have 933 concepts that matched out search of `age`, but we know we are interested only in the demographics `age` not any of the other terms that contain `age`. Let's try filtering out the other terms using the `grepl` function.

In [None]:
demo_age_concepts = filter(age_concepts, grepl("demo", age_concepts$name, ignore.case = TRUE))
demo_age_concepts

Great! That was pretty straight forward and easy. There will certainly be cases where things are not so clear, but hopefully this shows how you can work to narrow down the dictionary results using the any of the tools available in R because they are just another dataframe. Let's save the specific `age` concept we want for later use.

In [None]:
demo_age_concept = demo_age_concepts[1,]
demo_age_concept

Ok, we have `epilepsy` concepts, and the `age` concept, so how about calcium? 

In [None]:
calcium_concepts <- hpds::find.in.dictionary(resource, "Calcium") %>% hpds::extract.dataframe()
calcium_concepts

At least there are fewer concepts, but I think we can narrow that down even further using `grepl`.

In [None]:
electrolyte_calcium_concepts = filter(calcium_concepts, grepl("Electrolytes", calcium_concepts$name))
electrolyte_calcium_concepts

Ok, now we are down to 2 concepts. Notice that the metadata about these concepts are identical. Same min, max, observationCount, patientCount. It is likely that these are just synonymous concepts and that the data in each is identical. In a real use-case we would want to confirm that, but for our example we will just select the second one as our `calcium_concept` and set it aside to add to our query later.

In [None]:
calcium_concept = electrolyte_calcium_concepts[2,]
calcium_concept

## Querying and retrieving data

Beside from the dictionary, the second cornerstone of the API are the `query` functions (`hpds::query.anyof`, `hpds::query.select`, `hpds::query.filter`, `hpds::query.require`). They are how a user expresses the data to export to the API. Any fields added using these functions will be returned in the results of the export.

First, we need to create a query object.

In [None]:
my_query <- hpds::new.query(resource = resource)

The query object created will be then be passed to the different query functions to build the query: `hpds::query.anyof`, `hpds::query.select`, `hpds::query.filter`, `hpds::query.require`. Each of those methods accept a query object, a list of variable names, and eventual additional parameters

- The `query.select.add()` method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, without any record (ie subjects/rows) subsetting or filtering.

- The `query.require.add()` method accept variable names as string or list of strings as argument, and will allow the query to return all the variables passed, and only records that do not contain any null values for those variables.

- The `query.anyof.add()` method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, and only records that do contain at least one non-null value for any of those variables.

- The `query.filter.add()` method accept variable names a variable name as strings as argument, plus additional values to filter on that given variable. The query will return this variable and only the records that do match this filter.

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

### Building the query

First we want to make sure all participants in the results have at least one epilepsy and seizure term, we use the query.anyof.add to do this. By adding all of these fields to the `anyof` list, they are also returned by the API so we don't need to explicitly add them to `select`. To do this we need to extract just the `name` field of the concepts

In [None]:
seizure_concept_paths = seizure_concepts[,1]
seizure_concept_paths 

In [None]:
hpds::query.anyof.add(my_query, keys = seizure_concept_paths)


Next we `select` the `age` and `calcium` concept paths the same way.

In [None]:
age_concept_path = demo_age_concept[,1]
calcium_concept_path = calcium_concept[,1]

In [None]:
hpds::query.select.add(my_query, keys = age_concept_path)
hpds::query.select.add(my_query, keys = calcium_concept_path)

## Retrieving the data

Once our query object is finally built, we use the `query.run` function to retrieve the data corresponding to our query

In [None]:
my_df <- hpds::query.run(my_query, result.type = "dataframe")

In [None]:
dim(my_df)

In [None]:
my_df

Since our use-case is to compare non-intractable seizures against intractable seizures, and we have 1379 seizure concepts listed in our dataset, we have to derive computed variables derived from the many variables the represent each.

To do this we first subset the query results extracting the concepts that signify non-itractable seizures. Of course this is not labeled consistently, sometimes they use `not intractable` sometimes it is `without intractability`. This is a common issue in EHR data and cannot be avoided, we just have to account for it as we formulate our computed variables.

In the R client library, the column names of the dataframe from the query are transformed replacing spaces and backslashes with periods, so we have to do this translation also on our patterns in str_subset. Additionally it prepends each concept with a capital `X`.

The following function performs this translation and will likely be exposed through the R client library in the future.

In [None]:
concept_to_column_name <- function(concept_name) {
   return (str_replace_all(str_replace_all(concept_name, "[\\\\ ]", "."), "^", "X"))
}

So the `age` concept path we selected can be transformed to the corresponding dataframe column name using the following call.

In [None]:
concept_to_column_name(age_concept_path)

So let's create a dataframe that only has our non-intractable variables. We start by listing the columnn names that match either `not.intractable` or `without.intractability`.

In [None]:
#subset(df, select=-c(z,u))
not_intractable <- str_subset(colnames(my_df), pattern = fixed("not.intractable"))
without_intractability <- str_subset(colnames(my_df), pattern = fixed("without.intractability"))

not_intractable <- append(not_intractable, without_intractability)
not_intractable

Let's use that list of variable names to create a dataframe that has a computed variable for each row of whether there was a record of non-intractable seizures for each patient.

In [None]:
df_not_intractable <- data.frame(my_df$Patient.ID)


for (col in not_intractable)
{
    df_not_intractable$not_intractable = paste(df_not_intractable$not_intractable, unlist(my_df[col]), sep = "|", collapse=NULL)
}
df_not_intractable$not_intractable <- with(df_not_intractable, str_length(not_intractable) > 13)

df_not_intractable

We then get a lis of `intractable seizure` variable names by excluding the non-intractable variables from the columns in our query results. We also exclude the `Patient.ID`, `X.ACT.Demographics.Years.Age.` and `X.ACT.Laboratory.Tests.Chemistry.Electrolytes.Calcium.` variables because we don't want to duplicate it.

In [None]:
not_intractable<-append(not_intractable,"Patient.ID")
not_intractable<-append(not_intractable,"X.ACT.Demographics.Years.Age.")
not_intractable<-append(not_intractable,"X.ACT.Laboratory.Tests.Chemistry.Electrolytes.Calcium.")
cols = colnames(my_df)
intractable <- cols[(!cols %in% not_intractable)]
intractable

Let's use that list of variable names to create a dataframe that has a computed variable for each row of whether there was a record of intractable seizures for each patient, and include the non-intractable flag as well. We also add the `X.ACT.Demographics.Years.Age.` and `X.ACT.Laboratory.Tests.Chemistry.Electrolytes.Calcium.` variables.

In [None]:
df_intractable <- df_not_intractable

for (col in intractable)
{
    df_intractable$intractable = paste(df_intractable$intractable, unlist(my_df[col]), sep = "|", collapse=NULL)
}
df_intractable$intractable <- with(df_intractable, str_length(intractable) > 24)
df_intractable$age <- unlist(my_df['X.ACT.Demographics.Years.Age.'])
df_intractable$calcium <- unlist(my_df['X.ACT.Laboratory.Tests.Chemistry.Electrolytes.Calcium.'])

df_intractable


Now that we have transformed our data into a form that fits our specific question, let's use a visualization to make an informal determination.

In [None]:
ggplot(data = df_intractable %>% filter(calcium > 0)) + 
  geom_point(mapping = aes(x = age, y = calcium, color=intractable, fill=not_intractable), shape=21, size=2, alpha = 1)



So as is probably to be expected, there does not appear to be any real relationship between intractability and age or calcium levels. 

The purpose of this example is only to show how you can navigate the data dictionary in PIC-SURE and build queries to retrieve dataframes. This should provide you a starting point for tackling your own use-cases.


### Keeping a list of all concepts for reference

Another thing to help you on your journey is that using the `hpds::find.in.dictionary` function without arguments returns the entire dictionary, as shown in the help documentation. You can use standard R tools to browse through this dictionary locally for reference purposes.

While this is useful, keep in mind that new concepts can be added, they could be moved around in the hierarchy or removed at any time. PIC-SURE datasets tend to be lively, 

In [None]:
write.csv(hpds::find.in.dictionary(resource) %>% hpds::extract.dataframe(),"all_concepts_gic.csv", row.names = FALSE)

In [None]:
read.csv("all_concepts_gic.csv")