# *NHLBI BioData CatalystÂ® (BDC)* Powered by PIC-SURE API Use-Case: Querying on Genomic Variables

This tutorial notebook focuses how to use the PIC-SURE API to query genomic and phenotypic data together.

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the [`README.md` file](../README.md). It explains how to get a security token, which is mandatory to use the PIC-SURE API.**

To set up your token file, be sure to run the [`Workspace_setup.ipynb` file](./Workspace_setup.ipynb).

 -------   

# Environment set-up

### Pre-requisites
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

### Install packages



In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")
!{sys.executable} -m pip install matplotlib-venn
from matplotlib_venn import venn2


In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

## Connecting to a PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

# Example 1: Walkthrough on building a query with genomic and phenotypic variables

## 1.1 Building the query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  For this example, we will limit the query to a single dataset (the TOPMed DCC Harmonized dataset) by filtering on sex, body mass index, and two genomic filters.

For more information about the TOPMed DCC Harmonized dataset, please refer to the [`2_TOPMed_DCC_Harmonized_Variables_analysis.ipynb` notebook](./2_TOPMed_DCC_Harmonized_Variables_analysis.ipynb).

First we will create a new query instance.

In [None]:
authPicSure = bdc.useAuthPicSure()
my_query = authPicSure.query()

### Find all TOPMed DCC Harmonized variables

We can search for variables related to our search query using the `dictionary().find()` method. 

In this example, we will retrive all variables available in the TOPMed DCC Harmonized dataset.

You can find information about the phs number associated with each study and what data are available from the Data Access Dashboard in the PIC-SURE [User Interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/).

In [None]:
# Retrieve all TOPMed Harmonized variables
topmed_harmonized_dictionary = bdc.useDictionary().dictionary().find("DCC Harmonized")

# Save as dataframe
harmonized_df = topmed_harmonized_dictionary.dataframe() 
harmonized_df = harmonized_df[harmonized_df['studyId'] == "DCC Harmonized data set"]
print(harmonized_df.shape)
harmonized_df.head()              

## 1.2 Add categorical phenotypic variable (sex) to the query

First, we will search our TOPMed DCC Harmonized data dictionary for a sex-related variable.

In [None]:
sex_var = harmonized_df[harmonized_df.description.str.contains('Subject sex', case = False)]
sex_var

Let's examine the values within this variable.

In [None]:
sex_var['values']

We are only interested in Female values in this use case- let's apply this filter to our query accordingly.

In [None]:
my_query.filter().add(sex_var['HPDS_PATH'],['Female'])

## 1.3 Add continuous phenotypic variable (body mass index, BMI) to the query
For this example, we are only interested in obese participants (BMI >= 30).

Following the data dictionary search pattern shown above, we can search for the TOPMed DCC Harmonized dataset variables related to BMI.

In [None]:
# Search our TOPMed Harmonized dataframe for a BMI variable
bmi_var = harmonized_df[harmonized_df.description.str.contains('body mass index', case = False)]
print(bmi_var.columnmeta_name, bmi_var.HPDS_PATH, '\n\n')

# Examine the values
print(bmi_var["min"])
print(bmi_var["max"])

# Filter to obese participants with BMI 30 or more
my_query.filter().add(bmi_var['HPDS_PATH'], min=30)

## 1.4 Add genomic filters to the query

To start adding genomic filters to our query, we first need to understand what genomic annotations are available.

In [None]:
genotype_annotations = authPicSure.genotype_annotations()
genotype_annotations

The fat mass and obesity related gene, or *FTO* gene, [has been linked to obesity and other diseases](https://www.ncbi.nlm.nih.gov/gene/79068). Let's search for this gene.

In [None]:
genotype_annotation_values = authPicSure.genotype_annotation_values("Gene_with_variant", "FTO")
genotype_annotation_values

Let's also say that we only want the highly severe variants. We can see the levels of severity shown below.

In [None]:
severity = authPicSure.genotype_annotation_values("Variant_severity", '')
severity

Now we can add the filters to the query.

In [None]:
my_query.filter().add('Gene_with_variant', ["FTO"]) # FTO Gene of interest
my_query.filter().add('Variant_severity', ['HIGH']) # High severity variants

## 1.5 Retrieve data from the query

#### Getting query count

We have now built a query called `my_query` which contains the search criteria we are interested in:
- Sex = Female
- BMI > 30
- Participants have a highly severe variant on *FTO*

We will run a count query to find the number of matching participants.
This is a great way to preview how many participants match your query criteria without extracting all of the data yet.


In [None]:
my_query_count = my_query.getCount()
print(my_query_count)

#### Getting query data

We will retrieve our results in a dataframe.
Note that since we only added sex and age to the query, these are the only phenotypic variables returned.

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [None]:
query_result.shape

In [None]:
query_result.head() # Show first few rows of output

# Example 2: Use case with *SERPINA1* gene variants as a risk factor for  COPD

In this example, we will create a query to explore the relationship between the COPD phenotype and variants in the SERPINA1 gene. Variations of the SERPINA1 gene have been found to be a strong risk factor for COPD, which you can read more about [here](https://pubmed.ncbi.nlm.nih.gov/31661293/).

To explore this relationship, we will narrow the cohort down to participants that meet the following criteria:
* participated in the COPDgene study
* have had COPD
* have a missense variant of the *SERPINA1* gene 

## 2.1 Create query
Let's start by creating a new query and finding the variables pertaining to the COPDgene study (phs000179).

In [None]:
# Retrieve all COPDGene variables
copd_dictionary = bdc.useDictionary().dictionary().find("phs000179")

# Save as dataframe
copd_df = copd_dictionary.dataframe() 
copd_df.head()


### Criteria 1: Participants who have had COPD
Let's search our dataframe of COPDGene variables to find the one we want to use to filter for participants who have ever had COPD.

In [None]:
copd_df[copd_df.columnmeta_description.str.contains('copd', case = False)][['HPDS_PATH', 'columnmeta_name', 'columnmeta_description', 'values']]

The variable with the description "COPD: have you ever had COPD" seems the most promising. Let's copy that variable's HPDS path to use in our query. Also note what values are present for this variable- we will filter to participants with value "Yes".

In [None]:
copd_path = '\\phs000179\\pht002239\\phv00159731\\COPD\\'

In [None]:
copd_query = authPicSure.query()
copd_query.filter().add(copd_path, "Yes")

### Criteria 2: Participants with a missense variant on *SERPINA1*

A filter for the *SERPINA1* gene can be added in a way similar to the first query shown above.

In [None]:
copd_query.filter().add('Gene_with_variant', ["SERPINA1"])

We can take a look at the variant consequences available for filtering to confirm that missense variant is a filtering option.

In [None]:
consequences = authPicSure.genotype_annotation_values("Variant_consequence_calculated", '')
consequences

In [None]:
copd_query.filter().add('Variant_consequence_calculated', ["missense_variant"])

COPDGene, like many studies in BDC, has associated participant genomic data in the TOPMed version of the study. In order to map to these genomic files, we will need the sample identifier from the TOPMed study of COPDGene, which has an accession number of phs000951.

In [None]:
sample_id_search = bdc.useDictionary().dictionary().find("phs000951").dataframe()
sample_var = sample_id_search.HPDS_PATH[sample_id_search.description == "Sample ID"]
sample_var

To add this variable to the query, let's use `require()` to require participants to have genomic data from the TOPMed study. 

In [None]:
copd_query.require().add(sample_var)

## 2.2 Get results 
Now that the filtering is complete, we can use this final query to get counts and perform analysis on the data.

In [None]:
copd_query.getCount()

In [None]:
copd_result = copd_query.getResultsDataFrame(low_memory=False)
copd_result.shape

In [None]:
copd_result.head()

The sample identifier column can be used to identify genomic data files from the associated TOPMed study for analysis. The following code can be used to get a list of the sample IDs associated with your cohort. Note that there are multiple samples associated with a single participant. In these cases, the IDs are separated by a tab or `\t` character.

In [None]:
clean_mapping = copd_result[['patient_id', '\\phs000951\\pht005051\\phv00253403\\SAMPLE_ID\\']]
clean_mapping.columns = ["patient_id", "sample_id"]
clean_mapping['sample_id'] = clean_mapping["sample_id"].str.split("\t")
mapping_df = clean_mapping.explode('sample_id')
mapping_df.head()