# PIC-SURE API tutorial using the Undiagnosed Diseases Network (UDN) database
This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## R PIC-SURE API
### What is PIC-SURE?
Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences.

### More about PIC-SURE
PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. The API is available in two different programming languages, Python and R, allowing investigators to query databases in the same way using either of those languages.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patients that match criteria, and create cohorts from this interactive exploration.

The API is actively developed by the Avillach-Lab at Harvard Medical School.

GitHub repo:

* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds

---

## Getting your own user-specific security token
**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

### Environment set-up

#### Pre-requisites: 
* R version >= 3.6

#### Package installation and imports
The installation of some packages may take some time, please be patient.
- packages listed in the `requirements.R` file
- PIC-SURE API components (from Github)
    - PIC-SURE Adapter
    - PIC-SURE Client

#### Install latest R PIC-SURE API libraries from GitHub
To install the PIC-SURE libraries from GitHub, we need to install first the `devtools` package.

In [None]:
system(command = 'conda install -c conda-forge r-devtools --yes')

In [None]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")
install.packages("https://cran.r-project.org/src/contrib/Archive/devtools/devtools_1.13.6.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/R6_2.5.1.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/hash_2.2.6.1.tar.gz", repos=NULL, type="source")
install.packages(c("urltools"),repos = "http://cran.us.r-project.org")
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)

##### Load user-defined functions

In [None]:
# R_lib for pic-sure
source("R_lib/utils.R")

## Connecting to a PIC-SURE resource

### 1. Connect to the UDN data network
The following is required to get access to data through the PIC-SURE API: 
- Network URL
- Resource id
- User-specific security token

In [None]:
# Connection to the PIC-SURE API w/ key
# network information
PICSURE_network_URL <- "https://udn.hms.harvard.edu/picsure"
resource_id <- "c23b6814-7e5b-48d2-80d9-65511d7d2051"

In [None]:
# token is the individual user key given to connect to the UDN resource
token_file <- "token.txt"
my_token <- scan(token_file, what = "character")

In [None]:
# get connection object
connection <- picsure::connect(url = PICSURE_network_URL,
                                 token = my_token)

In [None]:
# get resource object
resource <- hpds::get.resource(connection,
                               resourceUUID = resource_id)

Two objects are created here: a `connection` and a `resource` object.

Since we will only be using a single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**.

It is connected to the specific data source ID we specified and enables us to query and retrieve data from this database.

#### Getting help with the R PIC-SURE API

You can get help with PIC-SURE library functions by using the `?` operator

In [None]:
# get function documentation
?hpds::get.resource

### 2. Explore the data: data structures description

There are two methods to explore the data from which the user get two different data structures: a **dictionary object** to explore variables and a **query object** to explore the patient records in UDN. 

**Methods**:

    * Search variables: find.in.dictionary() method
    * Retrieve data: query() methods

**Data structures**:

    * Dictionary object structure
    * Query object structure
    

#### Explore variables using the _dictionary_

Once a connection to the desired resource has been established, we first need to get a quick idea of which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A dictionary object offers the possibility to retrieve information about either matching variables according to a specific term or all available variables, using the `find.in.dictionary()` method. For instance, looking for variables containing the term 'aplasia' is done this way:

In [None]:
# create a dictionary object and search for a specific term, in this example for "aplasia"
lookup <- hpds::find.in.dictionary(resource, "aplasia")

We have created the dictionary object with only variables matched by the search term. To retrieve the search result from dictionary objects we have 4 different methods: `extract.count()`, `extract.keys()`, and `extract.entries()`.

In [None]:
# description of the dictionary search content
print(list("Count"   = hpds::extract.count(lookup), 
           "Keys"    = hpds::extract.keys(lookup)[0:2],
           "Entries" = hpds::extract.entries(lookup)[1:5,0:2]))

**hpds::extract.entries()** enables us to get the result of the dictionary search in a data.frame format.

In [None]:
# show table of records from the dictionary object
hpds::extract.entries(lookup) %>% tail(, n =2)

We can retrieve information about **ALL** variables. We do it without specifying a term in the dictionary search method:

In [None]:
# we search for ALL variables, and extract the resulting entries
plain_variablesDict <- hpds::find.in.dictionary(resource, "") %>% hpds::extract.entries()

In [None]:
# description of the whole dictionary of variables
print(dim(plain_variablesDict))
head(plain_variablesDict, n = 2)

The UDN network resource contains 13414 variables described by 11 data fields:
* name
* HpdsDataType
* description
* categorical
* categoryValues
* values
* continuous
* min
* max
* observationCount
* patientCount

The dictionary provides various information about the variables, such as:

* observationCount: number of entries with non-null value
* categorical: type of the variables, True if categorical, False if continuous/numerical
* min/max: only provided for non-categorical variables
* HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

Hence, it enables us to:

* Use the various variables information as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.
 
Variable names (`name` **column** in the dataframe), as currently implemented in the API, aren't straightforward to use because:

1. Very long
2. Presence of backslashes that requires modification right after copy-pasting.

However, using the dictionary to select variables can help to deal with this. 

##### Parsing variable names
We can use an utils function, `get_multiIndex()`, defined in R_lib/utils.R, to add a little more information and ease working with variables names.

Although not an official feature of the API, such functionality illustrates how to quickly scan and select groups of related variables.

Printing part of the "parsed names" Dictionary allows to quickly see the tree-like organisation of the variables. Moreover, original and simplified variable names are now stored respectively in the "name" and "simplified_name" columns (simplified variable names is simply the last component of the variable name, which usually makes the most sense to know what each variable is about).

In [None]:
# Display the variables tree hierarchy from the variables name
variablesDict <- get_multiIndex_variablesDict(plain_variablesDict)
head(variablesDict, n = 2)

Below is a simple example to illustrate the ease of use of a multiIndex dictionary. Let's say we are interested in filtering variables related to "aplasias" in the "nervous system".

In [None]:
mask_system <- variablesDict[,3] == "Abnormality of the nervous system"
mask_abnormality <- grepl("Aplasia", variablesDict[["name"]])
filtered_variables <- variablesDict[mask_system & mask_abnormality,]
print(dim(filtered_variables))
head(filtered_variables, n = 2)

Although pretty simple, it can be easily combined with other filters to quickly select necessary variables.

#### Explore patient records using _query_

Beside from the dictionary, the second cornerstone of the API are the query methods (`hpds::query.select`, `hpds::query.require`, `hpds::query.anyof`, `hpds::query.filter`). They are the entering point to **query and retrieve data from the resource**.

First, we need to create a query object.

In [None]:
# create a query object for the resource
my_query <- hpds::new.query(resource = resource)

The query object created will be then passed to the different query methods to build the query:  <font color='orange'>hpds::query.select.add(), hpds::query.require.add(), hpds::query.anyof.add(), and hpds::query.filter.add()</font>. Each of those methods accept a query object, a list of variable names, and eventual additional parameters.

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

##### Building the query
Let's say we want to check some demographics about the data in UDN. We will filter to variables that have observation counts > 50% patient counts.

In [None]:
# select demographic variable names
demographicsDict <- hpds::find.in.dictionary(resource, "demographics") %>% 
    hpds::extract.entries()
mask_obs <- demographicsDict %>% filter(observationCount > patientCount * 0.50)
selected_varnames <- mask_obs %>% pull(name) 
print(paste0('We have found ', length(selected_varnames), ' demographics variable(s) which have observation counts > 50% of patient counts (listed below).'))
selected_varnames

You may warning messages containing the following text when building your query with multiple variables: 
“the condition has length > 1 and only the first element will be used” - this can be ignored.

To double check that your filter has been applied to your query, you can run ```hpds::query.show(query = my_query)```

In [None]:
# build and query for demographics patient data
hpds::query.select.add(query=my_query, keys=selected_varnames)

##### Retrieving the data
Once our query object is  built, we use the `query.run()` method to retrieve the data corresponding to our query.

In [None]:
# retrieve the query result as a dataframe
demographics_data <- hpds::query.run(my_query, result.type="dataframe")

In [None]:
print(dim(demographics_data))

In [None]:
head(demographics_data)

##### Working with Variant Data
You can also use the query object to explore variant data. In this example, let's look at variants for the CHD8 gene.

In [None]:
# create a new query
my_query <- hpds::new.query(resource = resource)

In [None]:
# add a filter for a categorical variant: CHD8
hpds::query.filter.add(query=my_query, keys="Gene_with_variant", "CHD8")

Before calling the full data frame of variants, let's ensure that the approximate total count of variants being returned by our query is of a reasonable size. Queries returning more than 100,000 variants could crash your workbook. 

In [None]:
variantCount <- hpds::query.run(my_query, result.type="variantsApproximateCount")
variantCount

Another example of a genomic filter is looking at the variant frequency.
- Novel variants are not found in the rest of the population
- Rare variants are found in <1% of the population
- Common variants are found in >= 1% of the population

**Filtering by Gene prior to adding additional genomic or phenotypic filters is good practice to ensure the system does not become overwhelmed by a very large query.**

In [None]:
# Example querying for rare variants in the following genes of interest: CHD8, CHD9, and CHCHD10
my_query <- hpds::new.query(resource = resource)
hpds::query.filter.add(query=my_query, keys="Gene_with_variant", list("CHD8", "CHD9", "CHCHD10")) 
hpds::query.filter.add(query=my_query, keys="Variant_frequency_as_text", "Rare") 
variant_data <- hpds::query.run(my_query, result.type="variantsDataFrame")

head(variant_data)

In [None]:
dim(variant_data)

We can further add a phenotypic filter to this existing genomic query, to find rare variants in the genes of interest, where the sex of the participant is "Female"

In [None]:
# Example combining variant and phenotype queries
hpds::query.filter.add(query=my_query, keys="\\00_Demographics\\Sex\\", "Female")
variant_data <- hpds::query.run(my_query, result.type="variantsDataFrame")
head(variant_data)
dim(variant_data)

In [None]:
head(demographics_data)

### Generating Patient ID Mapping
You may notice that the Patient IDs found in the demographics dataframe do not match the Patient IDs found from our genomic query. Phenotypic queries return 'Patient IDs' while genomic queries return 'UDN IDs'. You can create a mapping between these two types of IDs as demonstrated below, which you can use to merge phenotypic and genomic data.

In [None]:
mapping_query <- hpds::new.query(resource = resource)
hpds::query.select.add(query=mapping_query, keys='\\000_UDN ID\\')
id_mapping <- hpds::query.run(mapping_query, result.type="dataframe")
head(id_mapping)

In [None]:
sessionInfo()