# Introduction to the PIC-SURE API

This is a tutorial notebook aimed to get the user quickly up and running with the R PIC-SURE API. 

## PIC-SURE R API
### What is PIC-SURE?

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI).

Original data exposed through PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on the downstream analyses and to facilitate reproducible science.


### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using either language.


PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patients that match criteria, and create cohorts from this interactive exploration.

The R API is actively developed by the Avillach Lab at Harvard Medical School.

PIC-SURE API GitHub repo:
* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds




 -------

## Getting your own user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

### Environment set-up

#### Pre-requisites
- R 3.4 or later

#### Install packages

The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:

- PIC-SURE Client
- PIC-SURE Adapter
- BioData Catalyst PIC-SURE Adapter

In [None]:
install.packages('devtools')

In [None]:
devtools::install_github("hms-dbmi/pic-sure-r-client", ref="master", force=T, quiet=TRUE)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", ref="master", force=T, quiet=TRUE)
devtools::install_github("hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds", ref="new-search", force=T)
library(hpds)

## Connecting to a PIC-SURE resource

The following is required to get access to data through the PIC-SURE API: 
- Network URL
- User-specific security token

The following code specifies the network URL as the BioData Catalyst Powered by PIC-SURE URL and references the user-specific token saved as `token.txt`.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [None]:
# Uncomment production URL when testing in production
# PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file <- "token.txt"
token <- scan(token_file, what = "character")
connection <- picsure::connect(PICSURE_network_URL, token)
authPicSure = bdc::use.authPicSure(connection)

### Getting help with the PIC-SURE API

You can get help with PIC-SURE library functions by using the `?` operator

In [None]:
?bdc::find.in.dictionary

For example, the above outpit lists and briefly defines how to use the `find.in.dictionary` function in the `bdc` adapter.

## Using the PIC-SURE variable dictionary

Now that you have set up your connection to the PIC-SURE API, let's determine which study or studies you are authorized to access. The dictionary method can be used to search the data dictionary for a specific term or to retrieve information about all the variables you are authorized to access. 


In [None]:
dictionary <- bdc::use.dictionary(connection) # set up the variable dictionary
all_variables <- bdc::find.in.dictionary(dictionary) # retrieve all the variables you have access to


In [None]:
all_variable_df <- bdc::extract.dataframe(all_variables)
studies <- unique(all_variable_df$study_id)

In [None]:
if (length(studies) == 0) {
    print("You are not authorized to access any studies.")
} else {
    print("You are authorized to access the following studies:")
    print(studies)
}

### *Note: if you do not see any studies listed above, you are not authorized to access any data. The rest of the notebook will not work.*


Let's save a study to use as an example for the rest of the notebook.

In [None]:
phs_number = studies[1]
paste0("The phs accession being used is: ", phs_number)

Now, let's find all of the variables associated with that study. We can search for these using the `find.in.dictionary()` function and searching the phs accession number. We can then view a dataframe of the variables returned from this search using the `extract.dataframe()` function.

In [None]:
my_variables <- bdc::find.in.dictionary(dictionary, phs_number) # Search for the phs accession number in our established variable dictionary
my_variables_df <- bdc::extract.dataframe(my_variables)

In [None]:
paste("There are", nrow(my_variables_df), "variables that were returned for your search.")
paste("Here are some of the variables you have access to:")
print(my_variables_df$var_description[1:10])

PIC-SURE integrates clinical and genomic datasets across BioData Catalyst, including TOPMed and TOPMed-related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same. 

Data Organization in PIC-SURE
---------------------------------------
| Data organization | TopMED & TOPMed-related studies | BioLINCC & COVID-19 studies |
|-------------------|---------------------------------|-----------------------------|
| General organization | Data organized using the format implemented by the database of Genotypes and Phenotypes (dbGaP). Generally, a given study will have several tables, and those tables have several variables. | Data do not follow dbGaP format; there are no phv or pht accessions. Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group. |
| Concept path structure | \phs\pht\phv\variable name\ | \phs\variable name |
| Variable ID | phv corresponding to the variable accession number | Equivalent to variable name | 
| Variable name | Encoded variable name that was used by the original submitters of the data | Encoded variable name that was used by the original submitters of the data |
| Variable description | Description of the variable | Description of the variable, as available |
| Dataset ID | pht corresponding to the trait table accession number | Equivalent to Dataset name | 
| Dataset name | Name of the trait table | Name of a group of like variables, as available | 
| Dataset description | Description of the trait table | Description of a group of like variables, as available |
| Study ID | phs corresponding to the study accession number | phs corresponding to the study accession number |
| Study description | Description of the study from dbGaP | Description of the study from dbGaP |



We can also view additional information for individual variables using the `varInfo()` function.
Note that you will need the variable(s)' `HPDS PATH` for this function.

In [None]:
first_var <- bdc::get.paths(my_variables)[1]
bdc::get.varInfo(my_variables, first_var)

Now you can try to search for a term on your own. Below is sample code on how to search for the term `sex`. To practice searching the data dictionary, you can change "sex" to a term you are interested in. If there are results for your term, you will see them displayed in the convenient dataframe format by using the `extract.dataframe()` function.

In [None]:
my_search <- bdc::find.in.dictionary(dictionary, 'sex') # Change sex to be your term of interest
my_search_df <- bdc::extract.dataframe(my_search)
if(nrow(my_search_df) == 0){
    print("Search term returned no results. Please try searching a different term.")
} else {
    (tail(my_search_df))
}


## Using PIC-SURE to build a query and retrieve data
You can also use the PIC-SURE API to build a query and retrieve data. With this functionality, you can filter based on specific variables, add others, and export the data as a dataframe into this notebook. 

The first step to this is setting up the `authQuery`.

In [None]:
authQuery_categorical_example <- bdc::new.query(authPicSure)

There are several methods that can be used to build a query, which are listed below.

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

### Build a query with a categorical variable
Let's practice building a query by filtering on variables. First, let's select a categorical variable to use. We can identify one using the `is_categorical` column of the variable dataframe.

In [None]:
categorical_vars <- my_variables_df[my_variables_df$data_type == 'categorical',] # filter to categorical variables
categorical_vars <- categorical_vars[categorical_vars$values != '[]',] # make sure there are values for the variable

v <- sample(c(1:nrow(categorical_vars)), 1) # select a random categorical variable

categorical_var <- categorical_vars[v, 'HPDS_PATH']
categorical_description <- categorical_vars[v, 'var_description']
categories <- categorical_vars[v, 'values']
# select the first category available to filter on
filter_category <- strsplit(gsub('\\[|]|', '', categories), ',')[[1]][1]
filter_category <- trimws(filter_category)

In [None]:
paste("We will use the PIC-SURE variable", categorical_var, "which is", categorical_description)
print("Here are the categories associated with the variable:")
print(categories)
paste("We will filter to participants with the value '", filter_category, "' for the variable.")

We now have the PIC-SURE variable and the value to apply to the filter saved. We can use the `filter()` method to add this information to our query. 

In [None]:
bdc::query.filter.add(authQuery_categorical_example, categorical_var, filter_category)

Note that though we are only filtering by one value, you can filter by multiple values by passing a list into the `filter` function.

Now we can export our filtered data to a dataframe in this notebook.

In [None]:
results_categorical <- bdc::query.run(authQuery_categorical_example)
head(results_categorical)

In the data dictionary dataframe shown previously, each row represented a single concept path or variable. In the query dataframe, the concept paths are added as columns with each row representing a participant with data that matches your query. 

The dataframe above should contain some automatically exported concept paths, such as `Patient ID`, `Parent Study Accession with Subject ID`, `Topmed Study Accession with Subject ID`, and `consents`, and the concept path we added to our query (`categorical_var`). Additionally, all participants should have the value we used to filter for our added concept path.

We can see how this query filtering worked by comparing the resulting dataframe to the full unfiltered data for this variable. Let's build a query that retrieves the data from all participants that have data for the categorical variable of interest using `require()`.

In [None]:
authQuery_categorical_example2 <- bdc::new.query(authPicSure) # Initialize a new query
bdc::query.require.add(authQuery_categorical_example2, categorical_var) # Use require() and the categorical_var
full_results_categorical <- bdc::query.run(authQuery_categorical_example2)
head(full_results_categorical)

In [None]:
# Visualize the results with pie charts
df2 <- data.frame(table(full_results_categorical[,eval(categorical_var)]))
pie(df2$Freq, labels = df2$Var1, main = paste("Before filtering variable\n", categorical_var))

df1 <- data.frame(table(results_categorical[,eval(categorical_var)]))
pie(df1$Freq, labels = df1$Var1, main = paste("After filtering variable\n", categorical_var))

### Build a query with a continuous variable
Similarly, we can create a query using a continuous variable. Instead of using the `is_categorical` column, let's make use of the `min` and `max` columns to determine the range of values that can be selected for our query.

In [None]:
# Filter to a random continuous variable with a range of data
continuous_vars <- my_variables_df[!is.na(my_variables_df$max) & my_variables_df$max != my_variables_df$min,]
v <- sample(c(1:nrow(continuous_vars)), 1) 

continuous_var <- continuous_vars[v,'HPDS_PATH']
paste('We will use the PIC-SURE variable', continuous_var, 'which is', continuous_vars[v,'var_description'])

# Get range of values for chosen variable
min <- as.numeric(continuous_vars[v, 'min'])
max <- as.numeric(continuous_vars[v, 'max'])
paste('The minimum value for this variable is:', min)
paste('The maximum value for this variable is:', max)

# Set a sub-range of values to filter our participants with
filter_min_value <- max - (max - min)/2
filter_max_value <- max
paste("We will filter to participants with values between", filter_min_value, "and", filter_max_value, ".")


Again, we can use the `filter()` method to add the continuous variable to the query. Once added, we can retrieve our results dataframe.

In [None]:
authQuery_continuous_example <- bdc::new.query(authPicSure) # Start a new query
bdc::query.filter.add(authQuery_continuous_example, continuous_var, filter_min_value, filter_max_value)
results_continuous <- bdc::query.run(authQuery_continuous_example)
head(results_continuous)

In [None]:
paste("We will plot a histogram of our results to ensure we are only looking at values for",
      continuous_vars[v,'var_description'],
      "which fall between", filter_min_value, "and", filter_max_value, ".")

In [None]:
hist(results_continuous[,eval(continuous_var)], 
     main = paste('After filtering variable\n', continuous_vars[v,'var_description']))

Compared to all data values for this variable:

In [None]:
authQuery_continuous_example2 <- bdc::new.query(authPicSure) # Start a new query
bdc::query.require.add(authQuery_continuous_example2, continuous_var)
full_results_continuous <- bdc::query.run(authQuery_continuous_example2)

In [None]:
hist(full_results_continuous[,eval(continuous_var)], 
     main = paste('Before filtering variable\n', continuous_vars[v,'var_description']))

### Build a query with multiple variables
You can also add multiple variables to a single query. Let's build a query with the first five variables for the study of interest.

In [None]:
query_vars <- my_variables_df[c(1:5), 'HPDS_PATH']
print("We will add the following variables to the query:")
query_vars

We can use the`anyof` function to add variables to the query. This will filter to participants that have data **for at least one of the variables added**.  

In [None]:
authQuery <- bdc::new.query(authPicSure) # Start a new query
invisible(lapply(query_vars, bdc::query.anyof.add, query = authQuery))
results <- bdc::query.run(authQuery)

In [None]:
head(results)

### Selecting consent groups

PIC-SURE will limit results based on which study and consent groups you have been individually authorized to access. In some cases, such as instances where you can access multiple studies and/or consent groups, you may need to limit your results further to only a subset of the groups you have been authorized to access.

Let's see which studies and consent groups you are authorized to access using the `show()` method of the query.

In [None]:
authQuery_consents <- bdc::new.query(authPicSure) # Start a new query
bdc::query.show(authQuery_consents)

The `\\_consents\\` section of the output shown above lists all of the phs accession numbers and consent codes that you are authorized to access. 

To query on specific consent groups in this list, you must first clear the list of values within the `\\_consents\\` section and then manually replace them. Let's practice this by copying and pasting a phs accession number and consent code, deleting the `\\_consents\\` field, and adding it back with our selected consent code.

*Note that trying to manually add a consent group which you are not authorized to access will results in errors downstream.*

In [None]:
bdc::query.filter.delete(authQuery_consents, '\\_consents\\')
bdc::query.show(authQuery_consents)

In [None]:
consent_group_filter <- "phs000964.c1"#"<<<Paste consent group here you are authorized to access here>>>"
bdc::query.filter.add(authQuery_consents, '\\_consents\\', consent_group_filter)
bdc::query.show(authQuery_consents)

Now your query is set to select only variables and participants from the phs accession and consent code you selected. From here, you can build out your query as shown above.

### Retrieving data from a query built in the PIC-SURE user interface (UI)

You are able to retrieve the results of a query that you have previously built using the [PIC-SURE Authorized Access UI](https://picsure.biodatacatalyst.nhlbi.nih.gov/psamaui/). After you have built your query and filtered to your cohort of interest, open the **Select and Package Data** tool in the Tool Suite. This will allow you to copy your query ID and bring it in to a Jupyter notebook. **Note that query IDs are not permanent and may expire.**


<img src="https://drive.google.com/uc?id=1XD3L0obdgQZ3GgO2Xu-sxhMxzzXgqofL">

In [None]:
# To run this using your notebook you must replace it with the ID value of a query that you have run.
query_id <- '<<<Paste your Query ID here>>>'
results <- bdc::query.getResults(authPicSure, query_id)
results <- read.delim(textConnection(results), sep = ",")
colnames(results) <- results[1,]
head(results)