# *NHLBI BioData Catalyst® (BDC)* Powered by PIC-SURE API Use-Case: Querying on Genomic Variables

This tutorial notebook focuses how to use the PIC-SURE API to query genomic and phenotypic data together.

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [`README.md` file](../README.md).**

 -------   

# Environment set-up

### Pre-requisites
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

### Install packages



In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")
!pip install matplotlib-venn
from matplotlib_venn import venn2


In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

# Example 1: Walkthrough on building a query with genomic and phenotypic variables

## 1.1 Building the query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  For this example, we will limit the query to a single study, a single phenotype (gender and age range), and two genomic filters. 

First we will create a new query instance.

In [None]:
authPicSure = bdc.useAuthPicSure()
my_query = authPicSure.query()

### Find all SAGE variables

We can search for variables related to our search query using the `dictionary().find()` method. 

In this example, we will retrive all variables available in the SAGE study (NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study , phs000921).

You can find information about the phs number associated with each study and what data are available from the Data Access Dashboard in the PIC-SURE [User Interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/).

In [None]:
# Retrieve all SAGE variables
sage_dictionary = bdc.useDictionary().dictionary().find("phs000921")

# Save as dataframe
sage_df = sage_dictionary.dataframe() 
sage_df.head()              

## 1.2 Add categorical phenotypic variable (sex) to the query

First, we will search our sage data dictionary for a sex-related variable.

In [None]:
sex_var = sage_df[sage_df.columnmeta_name.str.contains('sex', case = False)]
sex_var

Let's examine the values within this variable.

In [None]:
sex_var['values']

We are only interested in FEMALE values in this use case- let's apply this filter to our query accordingly.

In [None]:
my_query.filter().add(sex_var['HPDS_PATH'],['FEMALE'])

## 1.3 Add continuous phenotypic variable (age) to the query
For this example, we are only interested in children (<=18 years of age).

Following the data dictionary search pattern shown above, we can search for the SAGE study variables related to age.

In [None]:
# Search our SAGE dataframe for an age variable
age_var = sage_df[sage_df.columnmeta_name.str.contains('age', case = False)]
print(age_var.columnmeta_name, age_var.HPDS_PATH, '\n\n')

# Examine the values
print("Min: ", float(age_var['min']), "\nMax:", float(age_var['max']))

# Filter to children aged 18 years or older
my_query.filter().add(age_var['HPDS_PATH'], max=18)

## 1.4 Add genomic filters to the query

To start adding genomic filters to our query, we first need to understand what genomic annotations are available.

In [None]:
genotype_annotations = authPicSure.dictionary().genotype_annotations()
genotype_annotations

As shown in the output above, some genomic variables that can be used in queries include `Gene_with_variant`, `Variant_class`, and `Variant_severity`.

#### Add genotypic variable (Gene_with_variant) to the query

Let's explore what genes we can filter our variants by by looking at the values associated with the `Gene_with_variant` annotation.

In [None]:
# get total list of genes
genes = genotype_annotations.iloc[[0]]['values'][0]
genes = genes.split(', ')

# print first 10 genes
print(genes[0:10])

# check if a certain gene of interest, e.g. CHD8, is in the gene list
gene_of_interest = 'CHD8'
gene_of_interest in genes

We can view the full list of `Variant_consequence_calculated` options in a similar way.

In [None]:
# get all "Variant_consequence_calculated" values
consequences = genotype_annotations.iloc[[2]]['values'][2]
consequences = consequences.split(', ')
print(sorted(consequences))

The gene list shown above provides a list of values that can be used for the `Gene_with_variant`, in this case genes affected by a variant. Let's say we are interested in participants who have some kind of variant on the CHD8 gene. We can add this genomic filter to our query like so:

In [None]:
my_query.filter().add('Gene_with_variant', ["CHD8"])

We can further narrow our query by filtering on variant consequence. Let's look at missense variants in this example. 

In [None]:
my_query.filter().add('Variant_consequence_calculated', ["missense_variant"])

## 1.5 Retrieve data from the query

#### Getting query count

We have now built a query called `my_query` which contains the search criteria we are interested in:
- Sex = Female
- Age < 18
- Participants have a missense variant on CHD8

We will run a count query to find the number of matching participants.
This is a great, way to preview how many participants match your query criteria without extracting all of the data yet.


In [None]:
my_query_count = my_query.getCount()
print(my_query_count)

#### Getting query data

We will retrieve our results in a dataframe.
Note that since we only added sex and age to the query, these are the only phenotypic variables returned.

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [None]:
query_result.shape

In [None]:
query_result.head() # Show first few rows of output

# Example 2: Use case with *SERPINA1* gene variants as a risk factor for  COPD

In this example, we will create a query to explore the relationship between the COPD phenotype and variants in the SERPINA1 gene. Variations of the SERPINA1 gene have been found to be a strong risk factor for COPD, which you can read more about [here](https://pubmed.ncbi.nlm.nih.gov/31661293/).

To explore this relationship, we will narrow the cohort down to participants that meet the following criteria:
* participated in the COPDgene study
* have had COPD
* have a *SERPINA1* gene variant with high or moderate severity

## 2.1 Create query
Let's start by creating a new query and finding the variables pertaining to the COPDgene study (phs.

In [None]:
# Retrieve all COPDGene variables
copd_dictionary = bdc.useDictionary().dictionary().find("phs000179")

# Save as dataframe
copd_df = copd_dictionary.dataframe() 
copd_df.head()


### Criteria 1: Participants who have had COPD
Let's search our dataframe of COPDGene variables to find the one we want to use to filter for participants who have ever had COPD.

In [None]:
copd_df[copd_df.columnmeta_description.str.contains('copd', case = False)][['HPDS_PATH', 'columnmeta_name', 'columnmeta_description', 'values']]

The variable with the description "COPD: have you ever had COPD" seems the most promising. Let's copy that variable's HPDS path to use in our query. Also note what values are present for this variable- we will filter to participants with value "Yes".

In [None]:
copd_path = '\\phs000179\\pht002239\\phv00159731\\COPD\\'

In [None]:
copd_query = authPicSure.query()
copd_query.filter().add(copd_path, "Yes")

### Criteria 2: Participants with variants of high or moderate severity on *SERPINA1*

In [None]:
copd_query.filter().add('Gene_with_variant', ["SERPINA1"])

In [None]:
# what values are available for Variant_severity?
genotype_annotations.iloc[[4]]

In [None]:
copd_query.filter().add('Variant_severity', ["HIGH", "MODERATE"])

## 2.2 Get results 
Now that the filtering is complete, we can use this final query to get counts and perform analysis on the data.

In [None]:
copd_query.getCount()

In [None]:
copd_result = copd_query.getResultsDataFrame(low_memory=False)
copd_result.shape

In [None]:
copd_result.head()