# PIC-SURE API Use-Case: Querying on Genomic Variables

This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE R API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI). 

Original data exposed through PIC-SURE API encompass a large heterogeneity of data organization underneath. PIC-SURE hide this complexity and exposes the different studies dataset in a single tabular format. By easing the process of data extraction, it allows investigators to focus on the downstream analyses and facilitate reproducible sciences.

Both phenotypic and genetic variables are accessible through the PIC-SURE API.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using any of those languages.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patient that match criteria, and create cohort from this interactive exploration.

The R API is actively developed by the Avillach-Lab at Harvard Medical School.

PIC-SURE API R Library GitHub repos:
* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client



 -------

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the `get_your_token.ipynb` notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

# Environment set-up

### Pre-requisite
- R 3.4 or later

### Packages installation

In [None]:
source("R_lib/requirements.R")

#### Installing the latest PIC-SURE API library from GitHub

Installation of the two components of the PIC-SURE API from GitHub, that is the PIC-SURE adapter and the PIC-SURE Client.

In [None]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)

In [None]:
library(stringr)
library(dplyr)

## Connecting to a PIC-SURE resource

Several information are required to get access to data through the PIC-SURE API: a network URL, a resource id, and a user-specific security token.

In [None]:
PICSURE_network_URL <- "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id <- "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file <- "token.txt"

In [None]:
token <- scan(token_file, what = "character")

In [None]:
connection <- picsure::connect(url = PICSURE_network_URL,
                               token = token)

In [None]:
resource <- hpds::get.resource(connection,
                               resourceUUID = resource_id)

Two objects are created here: a `connection` and a `resource` object.

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**.

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this database.

## Building the Query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  We will limit the query to a single study, a single gender and age range (phenotype), two genetic filters, and then run the query.  First we will create a new query instance.

In [None]:
my_query <- hpds::new.query(resource=resource)


#### Limiting the Query to a Single Study

By default new query objects are automatically populated with all the consent groups that you have access to.  For this example we are going to clear these and specify a single consent that represents accessing only the SAGE study.

In [None]:
# Here we show all the studies that you have access to
my_query$filter()$show()

In [None]:
# Here we delete those accesses and add only a single study
hpds::query.filter.delete(my_query, "\\_Consents\\Short Study Accession with Consent Code\\")
hpds::query.filter.add(my_query, "\\_Consents\\Short Study Accession with Consent Code\\", c("phs000921.c2"))

In [None]:
# Here we show that we have only selected a single study
my_query$filter()$show()

#### List Available Phenotype Variables

Once a connection to the desired resource has been established, it is helpful to get search for variables of interest to our search query. To this end, we will use the `dictionary` method of the `resource` object to create a data dictionary instance to search for variables.

In [None]:
# use raw API calls and custom processing to speed up the following steps
api <- resource$connection_reference$INTERNAL_api_obj()
query <- list()
query$query <- ""
results <- api$search(resource_id, jsonlite::toJSON(query, auto_unbox=TRUE))
results <- jsonlite::fromJSON(results)


In [None]:
# custom processing of JSON objects into a data frame
output_df <- data.frame()
for (idx1 in 1:length(results$results)) {
    result_type <- names(results$results[idx1])
    temp_list <- unname(results$results[[idx1]])
    if (result_type == "phenotypes") {
        temp_categoricals <- list()
        for (idx1 in 1:length(temp_list)) {   
            if (temp_list[[idx1]][["categorical"]] == TRUE) {
                temp_list[[idx1]][["min"]] <- NA
                temp_list[[idx1]][["max"]] <- NA
                temp_categoricals[[idx1]] <- temp_list[[idx1]][["categoryValues"]]
            } else {
                temp_categoricals[[idx1]] <- NA
            }
            temp_list[[idx1]][["categoryValues"]] <- NULL
        }
        temp_df <- data.frame(do.call(rbind.data.frame, temp_list))
        temp_df$HpdsDataType <- result_type
        temp_df$categoryValues <- temp_categoricals
        temp_df$values <- NA
        temp_df$description <- NA
        temp_df$continuous <- NA
    } else {
        temp_values <- list()
        for (idx1 in 1:length(temp_list)) {   
            temp_values[[idx1]] <- temp_list[[idx1]][["values"]]
            temp_list[[idx1]][["values"]] <- NULL
        }
        temp_df <- data.frame(do.call(rbind.data.frame, temp_list))
        temp_df$name <- NA
        temp_df$min <- NA
        temp_df$categorical <- NA
        temp_df$patientCount <- NA
        temp_df$observationCount <- NA
        temp_df$max <- NA
        temp_df$HpdsDataType <- result_type
        temp_df$categoryValues <- NA
        temp_df$values <- temp_values        
    }
    output_df <- rbind(output_df, temp_df)
}


In [None]:
#extract the phenotype vars for SAGE
sage_vars <- output_df %>% filter(str_detect(name, "(SAGE)"))

#display phenotype vars
head(sage_vars)

#### Add Phenotype Variable (GENDER) to the Query

In [None]:
found_terms <- hpds::find.in.dictionary(resource = resource, 
                                        term = "Sex of participant")

In [None]:
# View information about the "Sex of participant" variable for the "(SAGE)" study
for (val in hpds::extract.entries(found_terms)) {
    if (stringr::str_detect(val$name, "(SAGE)")) {
        print(val)
    }
}

Given the above dictionary entry shows that we can select "FEMALE", "MALE", or "NA" for gender.  For this example lets limit our search to females.

In [None]:
hpds::query.filter.add(query = my_query, 
                       keys = "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Sex of participant\\",
                       values = 'FEMALE')

In [None]:
my_query$filter()$show()

#### Add Phenotype Variable (AGE) to the Query

Following the data dictionary search pattern just shown, we search for SAGE study variables related to the "Subject Age".

In [None]:
# View information about the "subject age" variable
found_terms <- hpds::find.in.dictionary(resource = resource,
                                        term = "Subject Age")
for (val in hpds::extract.entries(found_terms)) {
    if (stringr::str_detect(val$name, "(SAGE)")) {
        print(val)
    }
}

In [None]:
hpds::query.filter.add(query = my_query,
                       keys = "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Subject age\\",
                       min = 8,
                       max = 35)

#### Add Genotype Variable (Variant_frequency_in_gnomAD) to the Query

In [None]:
# View information about "Variant_frequency_in_gnomAD" variable
found_terms <- hpds::find.in.dictionary(resource = resource,
                                        term = "Variant_frequency_in_gnomAD")
print(hpds::extract.entries(found_terms)) 

In [None]:
hpds::query.filter.add(query = my_query,
                       keys = "Variant_frequency_in_gnomAD",
                       min = 0,
                       max = 0.1)

#### List Available Genotype Variables

In [None]:
#extract the phenotype vars for SAGE
geno_vars <- filter(output_df, HpdsDataType == "info")

#display phenotype vars
head(geno_vars)

#### Add Genotype Variable (Gene_with_variant) to the Query

In [None]:
# View information about "Gene_with_variant" variable
found_terms <- hpds::find.in.dictionary(resource = resource,
                                        term = "Gene_with_variant")
for (x in hpds::extract.entries(found_terms)) {
    str(x)
}

In [None]:
# Look for entries with variants in the CHD8 gene 
hpds::query.filter.add(query = my_query,
                       keys = "Gene_with_variant",
                       values = "CHD8")

Now that all query criteria have been entered into the query instance we can view it by using the following line of code:

In [None]:
# Now we show the query as it is specified
hpds::query.show(query = my_query)


Next we will take this query and retreve the data for patients with matching criteria.

## Retrieving Data from the Query

Now that we have built a query called `my_query` which contains the search criteria we are interested in, we will now run a count query to find the number of matching patients followed by a data query to download the data.

#### Getting Query Count

In [None]:
my_query_count <- hpds::query.run(query = my_query,
                                  result.type = "count")
print(my_query_count)

#### Getting Query Data

Once our query object is finally built, we set `result.type = "dataframe"` to retrieve the data corresponding to our query

In [None]:
my_query_df <- hpds::query.run(query = my_query,
                               result.type = "dataframe")

In [None]:
dim(my_query_df)

In [None]:
head(my_query_df, n=5)