# PIC-SURE API Use-Case: Querying on Genomic Variables

This is a tutorial notebook, aimed to be quickly up and running with the python PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE python API 
### What is PIC-SURE? 

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI).

Original data exposed through the PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on downstream analysis and to facilitate reproducible science.

Both phenotypic and genomic variables are accessible through the PIC-SURE API.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using either language.

The R/python PIC-SURE is a small part of the entire PIC-SURE platform.

The python API is actively developed by the Avillach Lab at Harvard Medical School.

PIC-SURE API GitHub repo:

* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

# Environment set-up

### Pre-requisites
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install packages

Install the following:
- packages listed in the `requirements.txt` file (listed below, along with version numbers)
- PIC-SURE API components (from Github)
    - PIC-SURE Adapter 
    - PIC-SURE Client

In [1]:
!cat requirements.txt

numpy>=1.16.4
matplotlib>=3.1.1
pandas>=0.25.3
scipy>=1.3.1
tqdm>=4.38.0
statsmodels>=0.10.2


In [2]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting tqdm>=4.38.0
  Downloading tqdm-4.61.0-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 4.6 MB/s  eta 0:00:01
Installing collected packages: tqdm
Successfully installed tqdm-4.61.0


In [3]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /tmp/pip-req-build-s55fksb6
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git /tmp/pip-req-build-s55fksb6
Collecting httplib2
  Using cached httplib2-0.19.1-py3-none-any.whl (95 kB)
Collecting pyparsing<3,>=2.4.2
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Building wheels for collected packages: PicSureHpdsLib
  Building wheel for PicSureHpdsLib (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureHpdsLib: filename=PicSureHpdsLib-0.9.0-py2.py3-none-any.whl size=22051 sha256=996fd77f58ea17692231d8665def7bc4d32f3172525c41a415a65eeb6556ca5d
  Stored in directory: /tmp/pip-ephem-wheel-cache-0ph81d9q/wheels/ae/d9/1a/c8c0ac8151b575c845efddc061fe014d86c51d1fd2c408907c
Successfully built PicSureHpdsLib
Installing collected packages: pyparsing, httplib2, PicSureHpdsLib
  Attempting un

Import all the external dependencies

In [4]:
import json
import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

## Connecting to a PIC-SURE resource

The following is required to get access to data through the PIC-SURE API: 
- Network URL
- Resource ID
- User-specific security token

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [5]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [6]:
with open(token_file, "r") as f:
    my_token = f.read()

In [7]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 02e23f52-f354-4e8b-992c-d37c8b9ba140 |                                                      |
| 70c837be-5ffc-11eb-ae93-0242ac130002 |                                                      |
+--------------------------------------+------------------------------------------------------+


Two objects are created here: a `connection` and a `resource` object.

Since we will only be using a single resource, **the `resource` object is  the only one we will need to proceed with the data analysis**. 

It is connected to the specific resource we supplied and enables us to query and retrieve data from this database.

## Building the query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  For this example, we will limit the query to a single study, a single phenotype (gender and age range), and two genomic filters. 

First we will create a new query instance.

In [8]:
my_query = resource.query()


#### Limiting the query to a single study

By default new query objects are automatically populated with all the consent groups for which you are authorized to access.  For this example we are going to clear the existing consents and specify a single consent that represents accessing only the NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) study.

In [9]:
# Here we show all the studies that you have access to
resource.list_consents()

Unnamed: 0,consent,harmonized,topmed
0,phs001359.c1,N,N
1,phs001143.c1,Y,N
2,phs000993.c2,N,Y
3,phs000280.c2,Y,N
4,phs001345.c1,N,Y
...,...,...,...
88,phs000964.c1,N,Y
89,phs000286.c0,Y,N
90,phs001416.c2,N,Y
91,phs001062.c2,N,Y


In [10]:
# Here we delete those accesses and add only a single study
my_query.filter().delete("\\_consents\\")
my_query.filter().add("\\_consents\\", ['phs000921.c2'])

Deleted key: \_consents\


<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7fb6602c0668>

In [11]:
# Here we show that we have only selected a single study
my_query.filter().show()

| _restriction_type_ | _key__________________________________________________________________________________________________________ | _restriction_values_
|  categorical       | \\_topmed_consents\\                                                                                           | ['phs001215.c0', 'phs001217.c1', 'phs001217.c0', 'phs001345.c1', 'phs001215.c1', 'phs000946.c1', 'phs000954.c1', 'phs000921.c2', 'phs001368.c2', 'phs001368.c1', 'phs001189.c1', 'phs001032.c1', 'phs000988.c1', 'phs001207.c1', 'phs001207.c0', 'phs000988.c0', 'phs001040.c1', 'phs000951.c2', 'phs000974.c1', 'phs000951.c1', 'phs000997.c1', 'phs001237.c1', 'phs001237.c2', 'phs000974.c2', 'phs000993.c1', 'phs000920.c2', 'phs000993.c2', 'phs001293.c1', 'phs001293.c2', 'phs001293.c0', 'phs000964.c3', 'phs001062.c1', 'phs000964.c4', 'phs001062.c2', 'phs000964.c0', 'phs001189.c0', 'phs000964.c1', 'phs001368.c4', 'phs000964.c2', 'phs001211.c2', 'phs001211.c1', 'phs001416.c1', 'phs001412.c2', 'phs001024.c1', 'p

*Note that trying to manually add a consent group which you are not authorized to access will result in errors downstream.*

#### List available phenotype variables

Once a connection to the desired resource has been established, it is helpful to search for variables related to our search query. We will use the `dictionary` method of the `resource` object to create a data dictionary instance to search for variables.

In [12]:
dictionary_entries = resource.dictionary().find("") # Get all variable entries
dict_df = dictionary_entries.DataFrame() # Export to dataframe
phenotype_vars = dict_df[dict_df.index.str.contains("(SAGE)", regex=False)] # Look for SAGE in the KEY column

In [13]:
phenotype_vars

Unnamed: 0_level_0,min,categorical,observationCount,patientCount,max,HpdsDataType,categoryValues,description
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This sample attributes table includes body site where sample was collected, analyte type, tumor status, sequencing center, funding source, TOPMed phase, project, and study name.\Analyte Type\",,True,2105.0,2105.0,,phenotypes,[DNA],
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\The subject consent data table contains subject IDs, consent group information, and subject aliases.\Consent group as determined by DAC\",,True,2106.0,2106.0,,phenotypes,"[Disease-Specific (Lung Diseases, IRB, COL) (D...",
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\No2 Air Pollution measurement lifetime average\",,True,1708.0,1708.0,,phenotypes,[NA],
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\De-identified Subject ID\",,True,2106.0,2106.0,,phenotypes,"[BUR02260558, BUR02260559, BUR02260560, BUR022...",
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This subject sample mapping data table includes a mapping of study subject IDs to sample IDs. Samples are the final preps submitted for genotyping, sequencing, and/or expression data. For example, if one patient (subject ID) gave one sample, and that sample was processed differently to generate 2 sequencing runs, there would be two rows, both using the same subject ID, but having 2 unique sample IDs. The data table also includes sample source.\Subject ID of Phenotype Data\",,True,2105.0,2105.0,,phenotypes,"[BUR02260558, BUR02260559, BUR02260560, BUR022...",
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This sample attributes table includes body site where sample was collected, analyte type, tumor status, sequencing center, funding source, TOPMed phase, project, and study name.\Tumor Status\",,True,2105.0,2105.0,,phenotypes,[Is not a tumor],
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This sample attributes table includes body site where sample was collected, analyte type, tumor status, sequencing center, funding source, TOPMed phase, project, and study name.\TOPMed Phase\",1.0,False,2105.0,2105.0,3.0,phenotypes,,
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\The subject consent data table contains subject IDs, consent group information, and subject aliases.\Subject ID\",,True,2106.0,2106.0,,phenotypes,"[BUR02260558, BUR02260559, BUR02260560, BUR022...",
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Subject age\",7.3,False,2104.0,2104.0,41.0,phenotypes,,
"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Bronchodilator response after 4 puffs of Proventil HFA Albuterol\",,True,834.0,834.0,,phenotypes,[NA],


#### Add categorical phenotypic variable (gender) to the query

A `dictionary` instance enables us to retrieve matching records by searching for a specific term. The `find()` method can be used to retrieve information about all available variables. For instance, looking for variables containing the term `Sex of participant` is done this way: 

In [14]:
dictionary = resource.dictionary()
dictionary_search = dictionary.find("Sex of participant")

We will now loop through all of the `Sex of participant` variables we found to find entries that are part of our study of interest. To accomplish this, we will look for variables that contain "`(SAGE)`".  The output will allow us to see what values of the sex variable are valid to add to our query. 

In [15]:
# View information about the "Sex of participant" variable
target_key = False
for x in dictionary_search.entries():
    if x["name"].find("(SAGE)") > 0:
        target_key = x["name"]
        pprint.pprint(x)
        break

{'HpdsDataType': 'phenotypes',
 'categorical': True,
 'categoryValues': ['FEMALE', 'MALE', 'NA'],
 'name': '\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and '
         'Environment (SAGE) Study ( phs000921 )\\Sex of participant\\',
 'observationCount': 2106,
 'patientCount': 2106}


The dictionary entry in the output above shows that we can select "`FEMALE`", "`MALE`", and/or "`NA`" for gender.  For this example let's limit our search to females.

In [16]:
my_query.filter().add(target_key,['FEMALE'])

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7fb6602c0668>

#### Add continuous phenotypic variable (age) to the query

Following the data dictionary search pattern shown above, we can search for the SAGE study variables related to the `SUBJECT AGE`.

In [17]:
# View information about the "subject age" variable
dictionary = resource.dictionary()
dictionary_search = dictionary.find("SUBJECT AGE")
for x in dictionary_search.entries():
    if x["name"].find("(SAGE)") > 0:
        target_key = x["name"]
        pprint.pprint(x)
        break

{'HpdsDataType': 'phenotypes',
 'categorical': False,
 'max': 41.0,
 'min': 7.3,
 'name': '\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and '
         'Environment (SAGE) Study ( phs000921 )\\Subject age\\',
 'observationCount': 2104,
 'patientCount': 2104}


The dictionary entry in the output above shows the age range of data available for `SUBJECT AGE`.  

For this example let's limit our search to a minimum of 8 and maximum of 35.

In [18]:
my_query.filter().add(target_key, min=8, max=35)

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7fb6602c0668>

#### List available genotypic variables

To start adding genomic filters to our query, we first need to understand which genomic variables exist.

In [38]:
dictionary_entries = resource.dictionary().find("")
dict_df = dictionary_entries.DataFrame()
genotype_vars = dict_df[dict_df["HpdsDataType"]=="info"]

In [39]:
genotype_vars

Unnamed: 0_level_0,min,categorical,observationCount,patientCount,max,HpdsDataType,categoryValues,description
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Gene_with_variant,,True,,,,info,"[HTR4, AC121758.1, HTR6, HTR7, BBX, RN7SL563P,...","Description=""The official symbol for a gene af..."
Variant_class,,True,,,,info,"[SNV, insertion, deletion]","Description=""A standardized term from the Sequ..."
Variant_consequence_calculated,,True,,,,info,"[intergenic_variant, start_retained_variant, f...","Description=""A standardized term from the Sequ..."
Variant_frequency_as_text,,True,,,,info,"[Novel, Rare, Common]","Description=""The variant allele frequency in g..."
Variant_severity,,True,,,,info,"[MODERATE, HIGH, LOW]","Description=""The severity for the calculated c..."


As shown in the output above, some genomic variables that can be used in queries include `Gene_with_variant`, `Variant_class`, and `Variant_severity`.

Note that, for printing purposes, the full list of genes in `Gene_with_variant` row and `categoryValues` column was truncated. This is to provide a simpler preview of the genomic variables and to avoid printing thousands of gene names in the dataframe.

#### Add genotypic variable (Gene_with_variant) to the query

Let's use `Gene_with_variant` to view a list of genes and get more information about this variable.

In [37]:
# View gene list about "Gene_with_variant" variable
dictionary_search = dictionary.find("Gene_with_variant").DataFrame()
gene_list = dictionary_search.loc['Gene_with_variant', 'categoryValues']
print(sorted(gene_list))

['5S_rRNA', '5_8S_rRNA', '7SK', 'A1BG', 'A1CF', 'A2M', 'A2ML1', 'A2ML1-AS1', 'A2MP1', 'A3GALT2', 'A4GALT', 'A4GNT', 'AA06', 'AAAS', 'AACS', 'AACSP1', 'AADAC', 'AADACL2', 'AADACL2-AS1', 'AADACL3', 'AADACL4', 'AADACP1', 'AADAT', 'AAED1', 'AAGAB', 'AAK1', 'AAMDC', 'AANAT', 'AAR2', 'AARD', 'AARS', 'AARS2', 'AASDH', 'AASDHPPT', 'AASS', 'AATBC', 'AATF', 'AATK', 'AATK-AS1', 'AB015752.1', 'ABALON', 'ABAT', 'ABBA01006766.1', 'ABBA01006766.2', 'ABCA1', 'ABCA10', 'ABCA12', 'ABCA13', 'ABCA17P', 'ABCA2', 'ABCA3', 'ABCA4', 'ABCA5', 'ABCA6', 'ABCA7', 'ABCA8', 'ABCA9', 'ABCA9-AS1', 'ABCB1', 'ABCB10', 'ABCB10P1', 'ABCB10P3', 'ABCB10P4', 'ABCB11', 'ABCB4', 'ABCB5', 'ABCB6', 'ABCB8', 'ABCB9', 'ABCC1', 'ABCC10', 'ABCC11', 'ABCC12', 'ABCC13', 'ABCC2', 'ABCC3', 'ABCC4', 'ABCC5', 'ABCC6', 'ABCC6P1', 'ABCC6P2', 'ABCC8', 'ABCC9', 'ABCD1P2', 'ABCD1P3', 'ABCD1P4', 'ABCD1P5', 'ABCD2', 'ABCD3', 'ABCD4', 'ABCE1', 'ABCF1', 'ABCF2', 'ABCF2P1', 'ABCF3', 'ABCG1', 'ABCG2', 'ABCG4', 'ABCG5', 'ABCG8', 'ABHD1', 'ABHD10', '

We can also view the full list of `Variant_consequence_calculated` options.

In [40]:
# View options of the "Varaint_consequence_calculated" option
dictionary_search = dictionary.find("Variant_consequence_calculated").DataFrame()
consequence_list = dictionary_search.loc['Variant_consequence_calculated', 'categoryValues']
print(sorted(consequence_list))

['3_prime_UTR_variant', '5_prime_UTR_variant', 'TFBS_ablation', 'TF_binding_site_variant', 'coding_sequence_variant', 'downstream_gene_variant', 'frameshift_variant', 'incomplete_terminal_codon_variant', 'inframe_deletion', 'inframe_insertion', 'intergenic_variant', 'intron_variant', 'mature_miRNA_variant', 'missense_variant', 'non_coding_transcript_exon_variant', 'non_coding_transcript_variant', 'protein_altering_variant', 'regulatory_region_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'splice_region_variant', 'start_lost', 'start_retained_variant', 'stop_gained', 'stop_lost', 'stop_retained_variant', 'synonymous_variant', 'upstream_gene_variant']


The gene list shown above provides a list of values that can be used for the `Gene_wivariable, in this case genes affected by a variant. Let's narrow our query to include the CHD8 gene.

In [22]:
# Look for entries with variants in the CHD8 gene 
my_query.filter().add(target_key, ["CHD8"])

ERROR: cannot add, key already exists ->  \NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Subject age\


<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7fb6602c0668>

Now that all query criteria have been entered into the query instance we can view it using the following line of code:

In [23]:
# Now we show the query as it is specified
my_query.show()

.__________[ Query.select()  Settings ]_____________________________________________________________________________________________________________________
| _key__________________________________________________________________________________________________________________________
|  \\_Topmed Study Accession with Subject ID\\                                                                                      |
|  \\_Parent Study Accession with Subject ID\\                                                                                      |
.__________[ Query.crosscounts()  has NO SELECTIONS ]_______________________________________________________________________________________________________
.__________[ Query.require() has NO SELECTIONS ]____________________________________________________________________________________________________________
.__________[ Query.anyof()  has NO SELECTIONS ]_____________________________________________________________________________________


Next we will take this query and retrieve the data for participants with matching criteria.

## Retrieving data from the query

#### Getting query count

We have now built a query called `my_query` which contains the search criteria we are interested in.

Next, we will run a count query to find the number of matching participants.

Finally, we will run a data query to download the data.

In [24]:
my_query_count = my_query.getCount()
print(my_query_count)

1077


#### Getting query data

Now that we have all our research variables being returned, we can now run the query and get the results.

In [25]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [26]:
query_result.shape

(1077, 6)

In [27]:
query_result.head() # Show first few rows of output

Unnamed: 0,Patient ID,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Sex of participant\","\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Subject age\",\_Parent Study Accession with Subject ID\,\_Topmed Study Accession with Subject ID\,\_consents\
0,416098,FEMALE,32.0,,phs000921.v4_BUR02260558,phs000921.c2
1,416103,FEMALE,33.0,,phs000921.v4_BUR02260563,phs000921.c2
2,416104,FEMALE,33.0,,phs000921.v4_BUR02260564,phs000921.c2
3,416105,FEMALE,25.0,,phs000921.v4_BUR02260565,phs000921.c2
4,416109,FEMALE,32.0,,phs000921.v4_BUR02260569,phs000921.c2


# Data analysis example: *SERPINA1* gene and COPD

In this example, we will create a query to explore the relationship between the COPD phenotype and variants in the SERPINA1 gene. Variations of the SERPINA1 gene have been found to be a strong risk factor for COPD, which you can read more about [here](https://pubmed.ncbi.nlm.nih.gov/31661293/).

To explore this relationship, we will narrow the cohort down to participants that meet the following criteria:
* participated in the COPDgene study
* have had COPD
* have a *SERPINA1* gene variant with high or moderate severity

#### Initialize the query
Let's start by creating a new query and finding the variables pertaining to the COPDgene study using a multiIndex dictionary.

In [28]:
copd_query = resource.query()
copd_dictionary = resource.dictionary().find("COPDGene").DataFrame()
copdDict = get_multiIndex_variablesDict(copd_dictionary)

#### Add phenotypic variable (COPD: have you ever had COPD) to the query
Next we will find the full variable name for "COPD: have you ever had COPD" using the `simplified_name` column and filter to this data.

In [29]:
mask_copd = copdDict['simplified_name'] == "COPD: have you ever had COPD" # Where is this variable in the dictionary?
copd_varname = copdDict.loc[mask_copd, "name"] # Filter to only that variable
copd_query.filter().add(copd_varname, "Yes")

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7fb63f19a2b0>

#### Add genomic variable (Gene_with_variant) to the query
To add the genomic filter, we can use a dictionary to find the variable `Gene_with_variant` and filter to the *SERPINA1* gene.

In [30]:
copd_dictionary = resource.dictionary()
gene_dictionary = copd_dictionary.find("Gene_with_variant")
gene_varname = gene_dictionary.keys()[0]
copd_query.filter().add(gene_varname, "SERPINA1")

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7fb63f19a2b0>

#### Add genomic variable (Variant_severity) to the query
Finally, we can filter our results to include only variants of the *SERPINA1* gene with high or moderate severity. 

In [31]:
severity_dictionary = copd_dictionary.find("Variant_severity")
severity_varname = severity_dictionary.keys()[0]
copd_query.filter().add(severity_varname, ["HIGH", "MODERATE"])

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7fb63f19a2b0>

#### Retrieve data from the query
Now that the filtering is complete, we can use this final query to get counts and perform analysis on the data.

In [32]:
copd_query.getCount()

2304

In [33]:
copd_result = copd_query.getResultsDataFrame(low_memory=False)
copd_result.shape

(2304, 6)

In [34]:
copd_result.head()

Unnamed: 0,Patient ID,"\Genetic Epidemiology of COPD (COPDGene) Funded by the National Heart, Lung, and Blood Institute ( phs000179 )\Subject ID, died center, age at enrolment, race, ethnic, gender, body weight, body height, BMI, systolic and diastolic blood pressure, measurement of several parameters during 6 minutes work, CT slicer, CT scanner, heart rate, oxygen saturation and therapy, medical history of back pain, cancer, cardio vascular diseases, diabetes, digestive system diseases, eye diseases, general health, musculoskeletal diseases, painful joint type, respiratory tract disease, smoking, and walking limbs, medication history of treatment with beta-agonist, theophylline, inhaled corticosteroid, Oral corticosteroids, ipratropium bromide, and tiotroprium bromide, respiratory disease, St. George's Respiratory Questionnaire, SF-36 Health Survey, spirometry, and VIDA of participants with or without chronic obstructive pulmonary disease and involved in the 'Genetic Epidemiology of COPD (COPDGene) Funded by the National Heart, Lung, and Blood Institute' project.\COPD: have you ever had COPD\",\_Parent Study Accession with Subject ID\,\_Topmed Study Accession with Subject ID\,\_consents\,\_topmed_consents\
0,35416,Yes,phs000179.v6_COPDGene_A00282,phs000951.v4_COPDGene_A00282,phs000179.c2,phs000951.c2
1,35421,Yes,phs000179.v6_COPDGene_A01220,phs000951.v4_COPDGene_A01220,phs000179.c1,phs000951.c1
2,35428,Yes,phs000179.v6_COPDGene_A04559,phs000951.v4_COPDGene_A04559,phs000179.c1,phs000951.c1
3,35429,Yes,phs000179.v6_COPDGene_A04808,phs000951.v4_COPDGene_A04808,phs000179.c1,phs000951.c1
4,35430,Yes,phs000179.v6_COPDGene_A05032,phs000951.v4_COPDGene_A05032,phs000179.c1,phs000951.c1
