# PIC-SURE API Use-Case: Querying on Genomic Variables

This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE R API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI). 

Original data exposed through PIC-SURE API encompass a large heterogeneity of data organization underneath. PIC-SURE hide this complexity and exposes the different studies dataset in a single tabular format. By easing the process of data extraction, it allows investigators to focus on the downstream analyses and facilitate reproducible sciences.

Both phenotypic and genetic variables are accessible through the PIC-SURE API.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using any of those languages.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patient that match criteria, and create cohort from this interactive exploration.

The R API is actively developed by the Avillach-Lab at Harvard Medical School.

PIC-SURE API R Library GitHub repos:
* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client



 -------

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the `get_your_token.ipynb` notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

# Environment set-up

### Pre-requisite
- R 3.4 or later

### Packages installation

In [1]:
source("R_lib/requirements.R")

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
also installing the dependencies ‘ini’, ‘desc’, ‘fs’, ‘gh’, ‘git2r’, ‘rematch2’, ‘rlang’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
also installing the dependencies ‘sys’, ‘askpass’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


installing: 
-  ggplot2 
-  dplyr 
-  tidyr 
-  urltools 
-  devtools 
-  ggrepel 


also installing the dependency ‘triebeard’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
also installing the dependencies ‘xfun’, ‘knitr’, ‘evaluate’, ‘tinytex’, ‘rmarkdown’, ‘prettydoc’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
“As of rlang 0.4.0, dplyr must be at least version 0.8.0.
* dplyr 0.7.8 is too old for rlang 0.4.8.
* Please update dplyr with `install.packages("dplyr")` and restart R.”
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



#### Installing the latest PIC-SURE API library from GitHub

Installation of the two components of the PIC-SURE API from GitHub, that is the PIC-SURE adapter and the PIC-SURE Client.

In [2]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)

Downloading GitHub repo hms-dbmi/pic-sure-r-client@master
from URL https://api.github.com/repos/hms-dbmi/pic-sure-r-client/zipball/master
Installing picsure
Installing httr
Installing curl
'/opt/conda/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL '/tmp/RtmpkHu0Yb/devtools3a2e3e5675/curl'  \
  --library='/opt/conda/lib/R/library' --install-tests 

Installing jsonlite
'/opt/conda/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL '/tmp/RtmpkHu0Yb/devtools3a30fda959/jsonlite'  \
  --library='/opt/conda/lib/R/library' --install-tests 

Installing mime
'/opt/conda/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL '/tmp/RtmpkHu0Yb/devtools3a60913b94/mime'  \
  --library='/opt/conda/lib/R/library' --install-tests 

Installing R6
'/opt/conda/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL '/tmp/RtmpkHu0Yb/devtools3a4f91875e/R6'  \
  --

## Connecting to a PIC-SURE resource

Several information are required to get access to data through the PIC-SURE API: a network URL, a resource id, and a user-specific security token.

In [29]:
PICSURE_network_URL <- "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id <- "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file <- "token.txt"

In [30]:
token <- scan(token_file, what = "character")

In [31]:
connection <- picsure::connect(url = PICSURE_network_URL,
                                 token = token)

[1] "02e23f52-f354-4e8b-992c-d37c8b9ba140"


In [32]:
resource <- hpds::get.resource(connection,
                               resourceUUID = resource_id)

Two objects are created here: a `connection` and a `resource` object.

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**.

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this database.

## Building the Query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  We will limit the query to a single study, a single gender and age range (phenotype), two genetic filters, and then run the query.  First we will create a new query instance.

In [33]:
my_query <- hpds::new.query(resource = resource)


#### Limiting the Query to a Single Study

By default new query objects are automatically populated with all the consent groups that you have access to.  For this example we are going to clear these and specify a single consent that represents accessing only the SAGE study.

In [34]:
# Here we show all the studies that you have access to
my_query$filter()$show()

{
    "numericFilters": [

    ],
    "categoryFilters": {
        "\\_Consents\\Short Study Accession with Consent Code\\": [
            "phs001217.c1",
            "phs001217.c0",
            "phs001345.c1",
            "phs000956.c2",
            "phs000946.c1",
            "phs001345.c0",
            "phs001368.c3",
            "phs001368.c2",
            "phs000956.c0",
            "phs001402.c1",
            "phs001368.c1",
            "phs001189.c1",
            "phs001368.c0",
            "phs001387.c0",
            "phs001143.c1",
            "phs001402.c0",
            "phs001143.c0",
            "phs000988.c1",
            "phs000810.c2",
            "phs001207.c1",
            "phs001207.c0",
            "phs000810.c1",
            "phs000988.c0",
            "phs001040.c1",
            "phs000951.c2",
            "phs000951.c1",
            "phs000997.c1",
            "phs001237.c0",
            "phs001237.c1",
            "phs001237.c2",
            "phs000951.c0",
     

In [35]:
# Here we delete those accesses and add only a single study
hpds::query.filter.delete(my_query, "\\_Consents\\Short Study Accession with Consent Code\\")
hpds::query.filter.add(my_query, "\\_Consents\\Short Study Accession with Consent Code\\", c("phs000921.c2"))

In [36]:
# Here we show that we have only selected a single study
my_query$filter()$show()

{
    "numericFilters": [

    ],
    "categoryFilters": {
        "\\_Consents\\Short Study Accession with Consent Code\\": [
            "phs000921.c2"
        ]
    },
    "variantInfoFilters": {
        "categoryVariantInfoFilters": [

        ],
        "numericVariantInfoFilters": [

        ]
    }
}
 

#### Add Phenotype Variable (GENDER) to the Query

Once a connection to the desired resource has been established, it is helpful to get search for variables of interest to our search query. To this end, we will use the `hpds::find.in.dictionary` function to search for variables in the data dictionary.

In [39]:
found_terms <- hpds::find.in.dictionary(resource = resource, term = "Sex of participant")

In [47]:
# View information about the "Sex of participant" variable for the "(SAGE)" study
for (val in hpds::extract.entries(found_terms)) {
    if (stringr::str_detect(val$name, "(SAGE)")) {
        print(val)
    }
}

$name
[1] "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Sex of participant\\"

$categoryValues
[1] "FEMALE" "MALE"   "NA"    

$categorical
[1] TRUE

$patientCount
[1] 2106

$observationCount
[1] 2106

$HpdsDataType
[1] "phenotypes"



Given the above dictionary entry shows that we can select "FEMALE", "MALE", or "NA" for gender.  For this example lets limit our search to females.

In [49]:
hpds::query.filter.add(query = my_query, keys="\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Sex of participant\\", values='FEMALE')

[1] "ERROR: cannot add, key already exists: \\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Sex of participant\\"


In [50]:
my_query$filter()$show()

{
    "numericFilters": [

    ],
    "categoryFilters": {
        "\\_Consents\\Short Study Accession with Consent Code\\": [
            "phs000921.c2"
        ],
        "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Sex of participant\\": [
            "FEMALE"
        ]
    },
    "variantInfoFilters": {
        "categoryVariantInfoFilters": [

        ],
        "numericVariantInfoFilters": [

        ]
    }
}
 

## Using the *variables dictionnary*

Once a connection to the desired resource has been established, we first need to get a knowledge of which variables are available in the database. To this end, we will use the `find.in.dictionary` function.

For instance, looking for variables containing the term `COPD` in thein names is done this way:

In [None]:
dictionary_search <- hpds::find.in.dictionary(resource, "COPD")

Four different functions can be used to retrieve results from a dictionary search: `extract.count()`, `extract.keys()`, `extract.entries()`, and `extract.dataframe()`.

In [None]:
print(list("Count"   = hpds::extract.count(dictionary_search),
           "Keys"    = hpds::extract.keys(dictionary_search)[1:3],
           "Entries" = hpds::extract.entries(dictionary_search)[1:3]))

In [None]:
df_dictionary_copd <- hpds::extract.dataframe(dictionary_search)

**`hpds::extract.dataframe()` enables to get the result of the dictionary search in a data.frame format. This way, it enables to:**

* Use the various criteria exposed in the dictionary (patientCount, variable type ...) as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.

Variable names, as currently implemented in the API, aren't handy to use right away.
1. Very long
2. Presence of backslashes that requires modification right after copy-pasting.

However, using the dictionary to select variables can help to deal with this. Let's say we want to retrieve every variable from the COPDGene study. Hence, one way to proceed is to retrieve the whole dictionary for those variables in the form of a data.frame, as below:

In [None]:
plain_variablesDict <- hpds::find.in.dictionary(resource, "COPDGene") %>% hpds::extract.dataframe()

Moreover, using the `hpds::find.in.dictionary` function without arguments return every entries, as shown in the help documentation. *As for now, this takes a long time in the R PIC-SURE API implementation, and will be fixed in later version of the API*

In [None]:
plain_variablesDict[10:20,]

The dictionary currently returned by the API provides information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if strings, False if numerical
- min/max: only provided for numerical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

### Parsing variable names

Though helpful, we can use a simple function, `get_multiIndex_variablesDict`, defined in `R_lib/utils.R` to add a little more information to the variable dictionary and to simplify working with variables names.

Although not an official feature of the API, such functionality illustrates how to quickly select groups of related variables.

Printing part of the "parsed names" Dictionary allows to quickly see the tree-like organisation of the variable names. Moreover, original and simplified variable names are now stored respectively in the "varName" and "simplified_varName" columns (simplified variable names is simply the last component of the variable name, that is usually the most informative to know what each variable is about).

In [None]:
# Display the variables tree hierarchy from the variables name
variablesDict <- get_multiIndex_variablesDict(plain_variablesDict)
head(variablesDict)

Below is a simple example to illustrate the simplicity of use a parsed dictionary. Let's say we are interested in every variables pertaining to the "Medical history" and "Medication history" subcategories.

In [None]:
mask_medication <- variablesDict[,3] == "Medication History"
mask_medical <- variablesDict[,3] == "Medical History"
medication_history_variables <- variablesDict[mask_medical | mask_medication,]
medication_history_variables

Although pretty simple, it can be easily combined with other filters to quickly select desired group of variables.

## Selecting consent groups

It can be helpful to limit query count responses to those patients who have provided specific study consents. This filtering is set on the resource object and will be automatically applied to queries run against the resource.  To see the list of currently active consent filters, use `hpds::show.consent.filters(resource)`  to set the value to use for the consent filters, use `hpds::set.consent.filters(resource, c('group1', 'group2', 'group3'))` the new consent filters will be applied to any new queries that are created.

In [None]:
hpds::show.consent.filters(resource)

In [None]:
hpds::set.consent.filters(resource, c('phs001001.c1', 'phs001001.c2', 'phs000289.c0', 'phs000820.c1'))

## Querying and retrieving data

Beside from the dictionary, the second cornerstone of the API are the `query` functions (`hpds::query.anyof`, `hpds::query.select`, `hpds::query.filter`, `hpds::query.require`). They are the entering point to retrieve data from the resource.

First, we need to create a query object.

In [None]:
my_query <- hpds::new.query(resource = resource)

The query object created will be then be passed to the different query functions to build the query: `hpds::query.anyof`, `hpds::query.select`, `hpds::query.filter`, `hpds::query.require`. Each of these methods accept a query object, a list of variable names, and eventual additional parameters as arguments.

- The `query.select.add()` method accepts variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, without any record (ie subjects/rows) subsetting.

- The `query.require.add()` method accepts variable names as string or list of strings as argument, and will allow the query to return all the variables passed, and only records that do not contain any null values for those variables.

- The `query.anyof.add()` method accepts variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, and only records that do contain at least one non-null value for those variables.

- The `query.filter.add()` method accepts variable name as argument, plus additional values to filter on that given variable. The query will return this variable and only the records that do match this filter criteria.

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

#### Building the query

In [None]:
mask <- variablesDict["simplified_name"] == "How old were you when you completely stopped smoking? [Years old]"
yo_stop_smoking_varname <- variablesDict[mask, "name"] %>%unlist() %>% unname()

In [None]:
mask_cat <- plain_variablesDict["categorical"] == TRUE
mask_count <- plain_variablesDict["observationCount"] > 4000
selected_vars <- plain_variablesDict[mask_cat & mask_count, "name"] %>% as.list()

In [None]:
hpds::query.filter.add(query = my_query,
                      keys = yo_stop_smoking_varname,
                      min=20,
                      max=70)
hpds::query.select.add(query = my_query,
                      keys = selected_vars[1:50])

## Retrieving the data

Once our query object is finally built, we use the `query.run` function to retrieve the data corresponding to our query

In [None]:
my_df <- hpds::query.run(my_query, result.type = "dataframe")

In [None]:
dim(my_df)

In [None]:
head(my_df)

From this point, we can proceed with the data management and analysis using any other R function or libraries.

In [None]:
my_df <- hpds::query.run(my_query, result.type = "dataframe")

In [None]:
dim(my_df)

In [None]:
head(my_df)