# PIC-SURE API Use-Case: Querying on Genomic Variables

This is a tutorial notebook, aimed to be quickly up and running with the python PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE python API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI). 

Original data exposed through PIC-SURE API encompass a large heterogeneity of data organization underneath. PIC-SURE hide this complexity and exposes the different studies dataset in a single tabular format. By easing the process of data extraction, it allows investigators to focus on the downstream analyses and facilitate reproducible sciences.

Both phenotypic and genetic variables are accessible through the PIC-SURE API.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using any of those languages.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patient that match criteria, and create cohort from this interactive exploration.

The python API is actively developed by the Avillach-Lab at Harvard Medical School.

PIC-SURE API GitHub repo:
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client



 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the `get_your_token.ipynb` notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

# Environment set-up

### Pre-requisite
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Packages installation

Installation of the packages listed in the `requirements.txt` file, as well as the two components of the PIC-SURE API from GitHub, that is the PIC-SURE adapter and the PIC-SURE Client.

In [None]:
!cat requirements.txt

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git

Import all the external dependencies

In [None]:
import json
import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

## Connecting to a PIC-SURE resource

Several pieces of information are required to get access to data through the PIC-SURE API: a network URL, a resource id, and a user-specific security token.

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

Two objects are created here: a `connection` and a `resource` object.

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data querying**. 

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this database.

## Building the Query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  We will limit the query to a single study, a single gender and age range (phenotype), two genetic filters, and then run the query.  First we will create a new query instance.

In [None]:
my_query = resource.query()


#### Limiting the Query to a Single Study

By default new query objects are automatically populated with all the consent groups that you have access to.  For this example we are going to clear these and specify a single consent that represents accessing only the SAGE study.

In [None]:
# Here we show all the studies that you have access to
my_query.filter().show()

In [None]:
# Here we delete those accesses and add only a single study
my_query.filter().delete("\\_Consents\\Short Study Accession with Consent Code\\")
my_query.filter().add("\\_Consents\\Short Study Accession with Consent Code\\", ['phs000921.c2'])

In [None]:
# Here we show that we have only selected a single study
my_query.filter().show()

#### Add Phenotype Variable (GENDER) to the Query

Once a connection to the desired resource has been established, it is helpful to get search for variables of interest to our search query. To this end, we will use the `dictionary` method of the `resource` object to create a data dictionary instance to search for variables.

A `dictionary` instance enables to retrieve matching records by searching for a specific term, or to retrieve information about all the available variables, using the `find()` method. For instance, looking for variables containing the term `SEX` in their names is done this way: 

In [None]:
dictionary = resource.dictionary()
dictionary_search = dictionary.find("SEX")

We now loop through all the found "sex" variables and look for any entries that are part of our study of interest (we can search for the string "`(SAGE)`" in its name).  The output of this will allow us to know what values are valid for query. 

In [None]:
# View information about the "Sex of participant" variable
for x in dictionary_search.entries():
    if x["name"].find("(SAGE)") > 0:
        pprint.pprint(x)

Given the above dictionary entry shows that we can select "FEMALE", "MALE", or "NA" for gender.  For this example lets limit our search to females.

In [None]:
my_query.filter().add("\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Sex of participant\\",['FEMALE'])

#### Add Phenotype Variable (AGE) to the Query

Following the data dictionary search pattern just shown, we search for SAGE study variables related to the "Subject Age".

In [None]:
# View information about the "subject age" variable
dictionary = resource.dictionary()
dictionary_search = dictionary.find("SUBJECT AGE")
for x in dictionary_search.entries():
    if x["name"].find("(SAGE)") > 0:
        pprint.pprint(x)

In [None]:
my_query.filter().add("\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study\\Subject age\\", min=8, max=35)

#### Add Genotype Variable (Variant_frequency_in_gnomAD) to the Query

In [None]:
# View information about "Variant_frequency_in_gnomAD" variable
dictionary_search = dictionary.find("Variant_frequency_in_gnomAD")
for x in dictionary_search.entries():
    pprint.pprint(x)

In [None]:
my_query.filter().add("Variant_frequency_in_gnomAD", min=0, max=1)

#### Add Genotype Variable (Gene_with_variant) to the Query

In [None]:
# View information about "Gene_with_variant" variable
dictionary_search = dictionary.find("Gene_with_variant")
for x in dictionary_search.entries():
    temp = pprint.pformat(x)
    print(temp[0:400] + "\t...\n\t" + temp[-200:])

In [None]:
# Look for entries with variants in the CHD8 gene 
my_query.filter().add("Gene_with_variant", ["CHD8"])

Now that all query criteria have been entered into the query instance we can view it by using the following line of code:

In [None]:
# Now we show the query as it is specified
my_query.show()


Next we will take this query and retreve the data for patients with matching criteria.

## Retrieving Data from the Query

Now that we have built a query called `my_query` which contains the search criteria we are interested in, we will now run a count query to find the number of matching patients followed by a data query to download the data.

#### Getting Query Count

In [None]:
my_query_count = my_query.getCount()
print(my_query_count)

#### Getting Query Data

Once our query object is finally built, we use the `getResultsDataFrame` function to retrieve the data corresponding to our query

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [None]:
query_result.shape

In [None]:
query_result.head()