# PIC-SURE API Use-Case: Querying on Genomic Variables

This is a tutorial notebook, aimed to be quickly up and running with the python PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE python API 
### What is PIC-SURE? 

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI).

Original data exposed through the PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on downstream analysis and to facilitate reproducible science.

Both phenotypic and genomic variables are accessible through the PIC-SURE API.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using either language.

The R/python PIC-SURE is a small part of the entire PIC-SURE platform.

The python API is actively developed by the Avillach Lab at Harvard Medical School.

PIC-SURE API GitHub repo:

* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

# Environment set-up

### Pre-requisites
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install packages

Install the following:
- packages listed in the `requirements.txt` file (listed below, along with version numbers)
- PIC-SURE API components (from Github)
    - PIC-SURE Adapter 
    - PIC-SURE Client

In [None]:
!cat requirements.txt

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Import all the external dependencies

In [None]:
import json
import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

## Connecting to a PIC-SURE resource

The following is required to get access to data through the PIC-SURE API: 
- Network URL
- Resource ID
- User-specific security token

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

Two objects are created here: a `connection` and a `resource` object.

Since we will only be using a single resource, **the `resource` object is  the only one we will need to proceed with the data analysis**. 

It is connected to the specific resource we supplied and enables us to query and retrieve data from this database.

## Building the query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  For this example, we will limit the query to a single study, a single phenotype (gender and age range), and two genomic filters. 

First we will create a new query instance.

In [None]:
my_query = resource.query()


#### Limiting the query to a single study

By default new query objects are automatically populated with all the consent groups for which you are authorized to access.  For this example we are going to clear the existing consents and specify a single consent that represents accessing only the NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) study.

In [None]:
# Here we show all the studies that you have access to
resource.list_consents()

In [None]:
# Here we delete those accesses and add only a single study
my_query.filter().delete("\\_consents\\")
my_query.filter().add("\\_consents\\", ['phs000921.c2'])

In [None]:
# Here we show that we have only selected a single study
my_query.filter().show()

*Note that trying to manually add a consent group which you are not authorized to access will result in errors downstream.*

#### List available phenotype variables

Once a connection to the desired resource has been established, it is helpful to search for variables related to our search query. We will use the `dictionary` method of the `resource` object to create a data dictionary instance to search for variables.

In [None]:
dictionary_entries = resource.dictionary().find("") # Get all variable entries
dict_df = dictionary_entries.DataFrame() # Export to dataframe
phenotype_vars = dict_df[dict_df.index.str.contains("(SAGE)", regex=False)] # Look for SAGE in the KEY column

In [None]:
phenotype_vars

#### Add categorical phenotypic variable (gender) to the query

A `dictionary` instance enables us to retrieve matching records by searching for a specific term. The `find()` method can be used to retrieve information about all available variables. For instance, looking for variables containing the term `Sex of participant` is done this way: 

In [None]:
dictionary = resource.dictionary()
dictionary_search = dictionary.find("Sex of participant")

We will now loop through all of the `Sex of participant` variables we found to find entries that are part of our study of interest. To accomplish this, we will look for variables that contain "`(SAGE)`".  The output will allow us to see what values of the sex variable are valid to add to our query. 

In [None]:
# View information about the "Sex of participant" variable
target_key = False
for x in dictionary_search.entries():
    if x["name"].find("(SAGE)") > 0:
        target_key = x["name"]
        pprint.pprint(x)
        break

The dictionary entry in the output above shows that we can select "`FEMALE`", "`MALE`", and/or "`NA`" for gender.  For this example let's limit our search to females.

In [None]:
my_query.filter().add(target_key,['FEMALE'])

#### Add continuous phenotypic variable (age) to the query

Following the data dictionary search pattern shown above, we can search for the SAGE study variables related to the `SUBJECT AGE`.

In [None]:
# View information about the "subject age" variable
dictionary = resource.dictionary()
dictionary_search = dictionary.find("SUBJECT AGE")
for x in dictionary_search.entries():
    if x["name"].find("(SAGE)") > 0:
        target_key = x["name"]
        pprint.pprint(x)
        break

The dictionary entry in the output above shows the age range of data available for `SUBJECT AGE`.  

For this example let's limit our search to a minimum of 8 and maximum of 35.

In [None]:
my_query.filter().add(target_key, min=8, max=35)

#### List available genotypic variables

To start adding genomic filters to our query, we first need to understand which genomic variables exist.

In [None]:
dictionary_entries = resource.dictionary().find("")
dict_df = dictionary_entries.DataFrame()
genotype_vars = dict_df[dict_df["HpdsDataType"]=="info"]

In [None]:
genotype_vars

As shown in the output above, some genomic variables that can be used in queries include `Gene_with_variant`, `Variant_class`, and `Variant_severity`.

#### Add genotypic variable (Gene_with_variant) to the query

Let's use `Gene_with_variant` to view a list of genes and get more information about this variable.

In [None]:
# View information about "Gene_with_variant" variable
dictionary_search = dictionary.find("Gene_with_variant")
target_key = dictionary_search.keys()[0]
for x in dictionary_search.entries():
    temp = pprint.pformat(x)
    print(temp[0:400] + "\t...\n\t" + temp[-200:])
    break

The output shown above provides a list of values that can be used for this variable, in this case genes affected by a variant. Let's narrow our query to include the CHD8 gene.

In [None]:
# Look for entries with variants in the CHD8 gene 
my_query.filter().add(target_key, ["CHD8"])

Now that all query criteria have been entered into the query instance we can view it by using the following line of code:

In [None]:
# Now we show the query as it is specified
my_query.show()


Next we will take this query and retrieve the data for participants with matching criteria.

## Retrieving data from the query

#### Getting query count

We have now built a query called `my_query` which contains the search criteria we are interested in.

Next, we will run a count query to find the number of matching participants.

Finally, we will run a data query to download the data.

In [None]:
my_query_count = my_query.getCount()
print(my_query_count)

#### Getting query data

Now that we have all our research variables being returned, we can now run the query and get the results.

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [None]:
query_result.shape

In [None]:
query_result.head() # Show first few rows of output

# Data analysis example: *SERPINA1* gene and COPD

In this example, we will create a query to explore the relationship between the COPD phenotype and variants in the SERPINA1 gene. Variations of the SERPINA1 gene have been found to be a strong risk factor for COPD, which you can read more about [here](https://pubmed.ncbi.nlm.nih.gov/31661293/).

To explore this relationship, we will narrow the cohort down to participants that meet the following criteria:
* participated in the COPDgene study
* have had COPD
* have a highly or moderately severe variant of the *SERPINA1* gene

Let's start by creating a new query and finding the variables pertaining to the COPDgene study using a multiIndex dictionary.

In [None]:
copd_query = resource.query()
copd_dictionary = resource.dictionary().find("COPDGene").DataFrame()
copdDict = get_multiIndex_variablesDict(copd_dictionary)

Now let's find the variable "COPD: have you ever had COPD" using the `simplified_name` column.

In [None]:
mask_copd = copdDict['simplified_name'] == "COPD: have you ever had COPD" # Where is this variable in the dictionary?
copd_varname = copdDict.loc[mask_copd, "name"] # Filter to only that variable
copd_query.filter().add(copd_varname, "Yes")

To add the genomic filter, we can use a dictionary to find the variable `Gene_with_variant` and filter to the *SERPINA1* gene.

In [None]:
copd_dictionary = resource.dictionary()
gene_dictionary = copd_dictionary.find("Gene_with_variant")
gene_varname = gene_dictionary.keys()[0]
copd_query.filter().add(gene_varname, "SERPINA1")

In [None]:
severity_dictionary = copd_dictionary.find("Variant_severity")
severity_varname = severity_dictionary.keys()[0]
copd_query.filter().add(severity_varname, ["HIGH", "MODERATE"])

In [None]:
copd_query.getCount()

In [None]:
copd_result = copd_query.getResultsDataFrame(low_memory=False)
copd_result.shape