# PIC-SURE API tutorial using UDN database
This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## R PIC-SURE API
### What is PIC-SURE?
Databases exposed through PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and expose the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences.

### More about PIC-SURE
PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. The API is available in two different programming languages, Python and R, allowing investigators to query databases in the same way using any of those languages.

PIC-SURE is a large project from which the R/Python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface, allowing research scientist to get quick knowledge about variables and data available for a specific data source.

The API is actively developed by the Avillach-Lab at Harvard Medical School.

GitHub repo:

* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

---

## Getting your own user-specific security token
**Before running this notebook, please be sure to review the get_your_token.ipynb notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

### Environment set-up

#### Pre-requisites: 
* R >= 3.6

#### Packages installation and imports
The installation of some packages may take some time, please be patient.



##### Install R packages for the analysis example

In [None]:
# R packages for analysis
list_packages <- c("stringr",
                   "ggplot2",
                   "urltools"
                   )

for (package in list_packages){
     if(! package %in% installed.packages()){
         install.packages(package, repos = "http://cran.us.r-project.org", dependencies = TRUE)
     }
     library(package, character.only = TRUE)
}

##### Install latest R PIC-SURE API libraries from GitHub
To install the PIC-SURE libraries from GitHub, we need to install first the `devtools` package.

In [None]:
system(command = 'conda install -c conda-forge r-devtools --yes')

In [None]:
library("devtools")

In [None]:
# pic-sure api lib
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)

##### Load user-defined functions

In [None]:
# R_lib for pic-sure
source("R_lib/utils.R")

## Connecting to a PIC-SURE network

### 1. Connect to the UDN data network
Several information are needed to get access to data through the PIC-SURE API: a network URL, a resource id, and a user security token which is specific to a given URL + resource.

In [None]:
# Connection to the PIC-SURE API w/ key
# network information
PICSURE_network_URL <- "https://udn.hms.harvard.edu/picsure"
resource_id <- "c23b6814-7e5b-48d2-80d9-65511d7d2051"

In [None]:
# token is the individual user key given to connect to the UDN resource
token_file <- "token.txt"
my_token <- scan(token_file, what = "character")

In [None]:
# get connection object
connection <- picsure::connect(url = PICSURE_network_URL,
                                 token = my_token)

In [None]:
# get resource object
resource <- hpds::get.resource(connection,
                               resourceUUID = resource_id)

Two objects are created here: a connection and a resource object, using respectively the `picsure` and `hpds` libraries.

As we will only be using one single resource, **the resource object is actually the only one we will need to proceed with data analysis hereafter** (FYI, the connection object is useful to get access to different databases stored in different resources).

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this source.

#### Getting help with the R PIC-SURE API

The `?` operator prints out helper message for any PIC-SURE library function.

In [None]:
# get function documentation
?hpds::get.resource()

### 2. Explore the data: data structures description

There are two methods to explore the data from which the user get two different data structures: a **dictionary object** to explore variables and a **query object** to explore the patient records in UDN. 

**Methods**:

    * Search variables: find.in.dictionary() method
    * Retrieve data: query() methods

**Data structures**:

    * Dictionary object structure
    * Query object structure
    

#### Explore variables using the _dictionary_

Once a connection to the desired resource has been established, we first need to get a quick idea of which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A dictionary object offers the possibility to retrieve information about either matching variables according to a specific term or all available variables, using the `find.in.dictionary()` method. For instance, looking for variables containing the term 'aplasia' is done this way:

In [None]:
# create a dictionary object and search for a specific term, in this example for "aplasia"
lookup <- hpds::find.in.dictionary(resource, "aplasia")

We have created the dictionary object with only variables matched by the search term. To retrieve the search result from dictionary objects we have 4 different methods: `extract.count()`, `extract.keys()`, and `extract.entries()`.

In [None]:
# description of the dictionary search content
print(list("Count"   = hpds::extract.count(lookup), 
           "Keys"    = hpds::extract.keys(lookup)[0:2],
           "Entries" = hpds::extract.entries(lookup)[0:2]))

**hpds::extract.entries()** enables to get the result of the dictionary search in a data.frame format.

In [None]:
# show table of records from the dictionary object
hpds::extract.entries(lookup) %>% tail(, n =2)

We can retrieve information about **ALL** variables. We do it without specifying a term in the dictionary search method:

In [None]:
# we search the whole set of variables
plain_variablesDict <- hpds::find.in.dictionary(resource, "") %>% 
hpds::extract.entries()

In [None]:
# description of the whole dictionary of variables
print(dim(plain_variablesDict))
head(plain_variablesDict, n = 2)

The UDN network resource contains 13414 variables described by 11 data fields:
* name
* HpdsDataType
* description
* categorical
* categoryValues
* values
* continuous
* min
* max
* observationCount
* patientCount

The dictionary provide various information about the variables, such as:

* observationCount: number of entries with non-null value
* categorical: type of the variables, True if categorical, False if continuous/numerical
* min/max: only provided for non-categorical variables
* HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

Hence, it enables to:

* Use the various variables information as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.
 
Variable names (`name` **column** in the dataframe), as currently implemented in the API, aren't straightforward to use because:

1. Very long
2. Presence of backslashes that requires modification right after copy-pasting.

However, using the dictionary to select variables can help to deal with this. 

##### Parsing variable names
We can use an utils function, `get_multiIndex()`, defined in R_lib/utils.R, to add a little more information and ease working with variables names.

Although not an official feature of the API, such functionality illustrates how to quickly scan and select groups of related variables.

Printing part of the "parsed names" Dictionary allows to quickly see the tree-like organisation of the variables. Moreover, original and simplified variable names are now stored respectively in the "name" and "simplified_name" columns (simplified variable names is simply the last component of the variable name, which usually makes the most sense to know what each variable is about).

In [None]:
# Display the variables tree hierarchy from the variables name
variablesDict <- get_multiIndex_variablesDict(plain_variablesDict)
head(variablesDict, n = 2)

Below is a simple example to illustrate the ease of use of a multiIndex dictionary. Let's say we are interested in filtering variables related to "aplasias" in the "nervous system".

In [None]:
mask_system <- variablesDict[,3] == "Abnormality of the nervous system"
mask_abnormality <- grepl("Aplasia", variablesDict[["name"]])
filtered_variables <- variablesDict[mask_system & mask_abnormality,]
print(dim(filtered_variables))
head(filtered_variables, n = 2)

Although pretty simple, it can be easily combined with other filters to quickly select necessary variables.

#### Explore patient records using _query_

Beside from the dictionary, the second cornerstone of the API are the query methods (`hpds::query.select`, `hpds::query.require`, `hpds::query.anyof`, `hpds::query.filter`). They are the entering point to **query and retrieve data from the resource**.

First, we need to create a query object.

In [None]:
# create a query object for the resource
my_query <- hpds::new.query(resource = resource)

The query object created will be then passed to the different query methods to build the query:  <font color='orange'>hpds::query.select.add(), hpds::query.require.add(), hpds::query.anyof.add(), and hpds::query.filter.add()</font>. Each of those methods accept a query object, a list of variable names, and eventual additional parameters.

* The **query.select.add()** method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, without any record (ie subjects/rows) subsetting.
* The **query.require.add()** method accept variable names as string or list of strings as argument, and will allow the query to return all the variables passed, and only records that do not contain any null values for those variables.
* The **query.anyof.add()** method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, and only records that do contain at least one non-null value for those variables.
* The **query.filter.add()** method accept variable names a variable name as strings as argument, plus additional values to filter on that given variable. The query will return this variable and only the records that do match this filter.

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

##### Building the query
Let's say we want to check some demographics about the data in UDN. We will filter to variables that have observation counts > 50% patient counts.

In [None]:
# select demographic variable names
demographicsDict <- hpds::find.in.dictionary(resource, "demographics") %>% 
    hpds::extract.entries()
mask_obs <- demographicsDict$observationCount > demographicsDict$patientCount * .50
selected_varnames <- as.list(demographicsDict[mask_obs, "name"])
print(length(selected_varnames))
selected_varnames

In [None]:
# build and query for demographics patient data
hpds::query.select.add(query=my_query, keys=selected_varnames)

##### Retrieving the data
Once our query object is finally built, we use the `query.run()` method to retrieve the data corresponding to our query.

In [None]:
# retrieve the query result as a dataframe
demographics_data <- hpds::query.run(my_query, result.type="dataframe")

In [None]:
print(dim(demographics_data))

In [None]:
head(demographics_data)

We have retieved patient records in UDN that meet the criteria posed in the query. 

**NOTE**: The <font color='orange'>Patient ID</font> is a `COLUMN` of the dataframe derived.

From this point, we can proceed with the data management and analysis using any other R functions or libraries.

##### Visualize the demographics

In [None]:
# rename column names
colnames(demographics_data) <- c("Patient_ID",
                                     "age_udn",
                                     "age_symptom",
                                     "age_current",
                                     "ethnicity",
                                     "gender",
                                     "race")

In [None]:
# visualize 
races <- table(demographics_data$race)
pie(races, main="Race distribution in UDN", radius = .5)