# PIC-SURE API Use-Case: Querying on Genomic Variables

This tutorial notebook focuses how to use the PIC-SURE API to query genomic and phenotypic data together.

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token).**

 -------   

# Environment set-up

### Pre-requisites
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install packages



In [None]:
!pip install matplotlib-venn
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
import numpy as np
import pandas as pd
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git@new-search

import PicSureBdcAdapter

## Connecting to a PIC-SURE network

In [None]:
# Uncomment production URL when testing in production
# PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

## Building the query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  For this example, we will limit the query to a single study, a single phenotype (gender and age range), and two genomic filters. 

First we will create a new query instance.

In [None]:
authPicSure = bdc.useAuthPicSure()
my_query = authPicSure.query()


### Find all SAGE variables

We can search for variables related to our search query using the `dictionary().find()` method. 

In this example, we will retrive all variables available in the SAGE study (NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study , phs000921).

You can find information about the phs number associated with each study and what data are available from the Data Access Dashboard in the PIC-SURE [User Interface](https://picsure.biodatacatalyst.nhlbi.nih.gov/).

In [None]:
# Retrieve all SAGE variables
sage_dictionary = bdc.useDictionary().dictionary().find("phs000921")

# Save as dataframe
sage_df = sage_dictionary.dataframe() 
sage_df.head()

                      

### 1. Add categorical phenotypic variable (sex) to the query

First, we will search our sage data dictionary for a sex-related variable.

In [None]:
sex_var = sage_df[sage_df.columnmeta_name.str.contains('sex', case = False)]
sex_var

Let's examine the values within this variable.

In [None]:
sex_var['values']

We are only interested in FEMALE values in this use case- let's apply this filter to our query accordingly.

In [None]:
my_query.filter().add(sex_var['HPDS_PATH'],['FEMALE'])

### 2. Add continuous phenotypic variable (age) to the query
For this example, we are only interested in children (<=18 years of age).

Following the data dictionary search pattern shown above, we can search for the SAGE study variables related to the `SUBJECT AGE`.

In [None]:
# Search our SAGE dataframe for an age variable
age_var = sage_df[sage_df.columnmeta_name.str.contains('age', case = False)]
print(age_var.columnmeta_name, age_var.HPDS_PATH, '\n\n')

# Examine the values
print("Min: ", float(age_var['min']), "\nMax:", float(age_var['max']))

# Filter to children aged 18 years or older
my_query.filter().add(age_var['HPDS_PATH'], min=7.3, max=18)

### 3. Add genomic filters to the query

To start adding genomic filters to our query, we first need to understand which genomic variables exist.

In [None]:
resource = bdc.useDictionary()

In [None]:
sage_df.columns

In [None]:
dictionary_entries = resource.dictionary().find("")
dict_df = dictionary_entries.dataframe()
genotype_vars = dict_df[dict_df["HpdsDataType"]=="info"]

In [None]:
genotype_vars

As shown in the output above, some genomic variables that can be used in queries include `Gene_with_variant`, `Variant_class`, and `Variant_severity`.

Note that, for printing purposes, the full list of genes in `Gene_with_variant` row and `categoryValues` column was truncated. This is to provide a simpler preview of the genomic variables and to avoid printing thousands of gene names in the dataframe.

#### Add genotypic variable (Gene_with_variant) to the query

Let's use `Gene_with_variant` to view a list of genes and get more information about this variable.

In [None]:
# View gene list about "Gene_with_variant" variable
dictionary_search = dictionary.find("Gene_with_variant").DataFrame()
gene_list = dictionary_search.loc['Gene_with_variant', 'categoryValues']
print(sorted(gene_list))

We can also view the full list of `Variant_consequence_calculated` options.

In [None]:
# View options of the "Variant_consequence_calculated" option
dictionary_search = dictionary.find("Variant_consequence_calculated").DataFrame()
consequence_list = dictionary_search.loc['Variant_consequence_calculated', 'categoryValues']
print(sorted(consequence_list))

The gene list shown above provides a list of values that can be used for the `Gene_with_variant`, in this case genes affected by a variant. Let's narrow our query to include the CHD8 gene.

In [None]:
# Look for entries with variants in the CHD8 gene 
dictionary_search = dictionary.find("Gene_with_variant")
target_key = dictionary_search.keys()[0]
my_query.filter().add(target_key, ["CHD8"])

Now that all query criteria have been entered into the query instance we can view it using the following line of code:

In [None]:
# Now we show the query as it is specified
my_query.show()


Next we will take this query and retrieve the data for participants with matching criteria.

## Retrieving data from the query

#### Getting query count

We have now built a query called `my_query` which contains the search criteria we are interested in.

Next, we will run a count query to find the number of matching participants.

Finally, we will run a data query to download the data.

In [None]:
my_query_count = my_query.getCount()
print(my_query_count)

#### Getting query data

Now that we have all our research variables being returned, we can now run the query and get the results.

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [None]:
query_result.shape

In [None]:
query_result.head() # Show first few rows of output

# Data analysis example: *SERPINA1* gene and COPD

In this example, we will create a query to explore the relationship between the COPD phenotype and variants in the SERPINA1 gene. Variations of the SERPINA1 gene have been found to be a strong risk factor for COPD, which you can read more about [here](https://pubmed.ncbi.nlm.nih.gov/31661293/).

To explore this relationship, we will narrow the cohort down to participants that meet the following criteria:
* participated in the COPDgene study
* have had COPD
* have a *SERPINA1* gene variant with high or moderate severity

#### Initialize the query
Let's start by creating a new query and finding the variables pertaining to the COPDgene study using a multiIndex dictionary.

In [None]:
copd_query = resource.query()
copd_dictionary = resource.dictionary().find("COPDGene").DataFrame()
copdDict = get_multiIndex_variablesDict(copd_dictionary)

#### Add phenotypic variable (COPD: have you ever had COPD) to the query
Next we will find the full variable name for "COPD: have you ever had COPD" using the `simplified_name` column and filter to this data.

In [None]:
mask_copd = copdDict['simplified_name'] == "COPD: have you ever had COPD" # Where is this variable in the dictionary?
copd_varname = copdDict.loc[mask_copd, "name"] # Filter to only that variable
copd_query.filter().add(copd_varname, "Yes")

#### Add genomic variable (Gene_with_variant) to the query
To add the genomic filter, we can use a dictionary to find the variable `Gene_with_variant` and filter to the *SERPINA1* gene.

In [None]:
copd_dictionary = resource.dictionary()
gene_dictionary = copd_dictionary.find("Gene_with_variant")
gene_varname = gene_dictionary.keys()[0]
copd_query.filter().add(gene_varname, "SERPINA1")

#### Add genomic variable (Variant_severity) to the query
Finally, we can filter our results to include only variants of the *SERPINA1* gene with high or moderate severity. 

In [None]:
severity_dictionary = copd_dictionary.find("Variant_severity")
severity_varname = severity_dictionary.keys()[0]
copd_query.filter().add(severity_varname, ["HIGH", "MODERATE"])

#### Retrieve data from the query
Now that the filtering is complete, we can use this final query to get counts and perform analysis on the data.

In [None]:
copd_query.getCount()

In [None]:
copd_result = copd_query.getResultsDataFrame(low_memory=False)
copd_result.shape

In [None]:
copd_result.head()