# PIC-SURE API Use-Case: Querying on Genomic Variables

This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE R API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI). 

Original data exposed through PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By easing the process of data extraction, it allows investigators to focus on the downstream analyses and to facilitate reproducible science.

Both phenotypic and genomic variables are accessible through the PIC-SURE API.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using either language.

The R/python PIC-SURE API is a small part of the entire PIC-SURE platform.

The R API is actively developed by the Avillach Lab at Harvard Medical School.

PIC-SURE API R Library GitHub repos:
* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds



 -------

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

# Environment set-up

### Pre-requisites
- R 3.4 or later

### Install packages

In [1]:
source("R_lib/requirements.R")

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


installing: 
-  ggplot2 
-  dplyr 
-  tidyr 
-  urltools 
-  devtools 
-  ggrepel 


also installing the dependencies ‘systemfonts’, ‘textshaping’, ‘ragg’, ‘pkgdown’

“installation of package ‘pkgdown’ had non-zero exit status”Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: usethis


#### Installing the latest PIC-SURE API library from GitHub

Installation of the two components of the PIC-SURE API from GitHub, that is the PIC-SURE adapter and the PIC-SURE Client.

In [2]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")
install.packages("https://cran.r-project.org/src/contrib/Archive/devtools/devtools_1.13.6.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/R6_2.5.0.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/hash_2.2.6.1.tar.gz", repos=NULL, type="source")
install.packages(c("urltools"),repos = "http://cran.us.r-project.org")
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)
devtools::install_github("hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds", force=T)

“unable to access index for repository http://cran.us.r-project.org/src/contrib:
“package ‘urltools’ is not available (for R version 3.6.3)”Downloading GitHub repo hms-dbmi/pic-sure-r-client@HEAD



✔  checking for file ‘/tmp/RtmpW4gvHw/remotes600a55d003f9/hms-dbmi-pic-sure-r-client-115deb5/DESCRIPTION’ ...
─  preparing ‘picsure’:
✔  checking DESCRIPTION meta-information ... 
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘picsure_0.1.0.tar.gz’
   


Downloading GitHub repo hms-dbmi/pic-sure-r-adapter-hpds@HEAD



✔  checking for file ‘/tmp/RtmpW4gvHw/remotes600a429d1b66/hms-dbmi-pic-sure-r-adapter-hpds-2cee5ee/DESCRIPTION’ ...
─  preparing ‘hpds’:
✔  checking DESCRIPTION meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘hpds_0.1.1.tar.gz’
   


Downloading GitHub repo hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds@HEAD



✔  checking for file ‘/tmp/RtmpW4gvHw/remotes600ad74981e/hms-dbmi-pic-sure-biodatacatalyst-r-adapter-hpds-d019468/DESCRIPTION’ ... 
─  preparing ‘bdc’:
✔  checking DESCRIPTION meta-information
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘bdc_0.1.0.tar.gz’
   


In [3]:
library(stringr)
library(dplyr)

In [4]:
source("R_lib/utils.R")

## Connecting to a PIC-SURE resource

The following is required to get access to data through the PIC-SURE API: 
- Network URL
- Resource ID
- User-specific security token


If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [5]:
PICSURE_network_URL <- "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id <- "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file <- "token.txt"

In [6]:
token <- scan(token_file, what = "character")

In [7]:
connection <- picsure::connect(url = PICSURE_network_URL,
                               token = token)

[1] "02e23f52-f354-4e8b-992c-d37c8b9ba140"
[2] "70c837be-5ffc-11eb-ae93-0242ac130002"


In [8]:
resource <- bdc::get.resource(connection,
                               resourceUUID = resource_id)

[1] "Loading data dictionary... (takes a minute)"


Two objects are created here: a `connection` and a `resource` object.

Since we will only be using a single resource, **the `resource` object is the only one we will need to proceed with the data analysis**.

It is connected to the specific resource we supplied and enables us to query and retrieve data from this database.

## Building the query with the PIC-SURE API

We are going to create a new query request from the PIC-SURE resource that was specified above.  For this example, we will limit the query to a single study, a single phenotype (gender and age range), and two genomic filters.

First, we will create a new query instance.

In [9]:
my_query <- bdc::new.query(resource=resource)


#### Limiting the query to a single study

By default, new query objects are automatically populated with all the consent groups for which you are authorized to access.  For this example we are going to clear the existing consents and specify a single consent group that represents accessing only the NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) study.

In [10]:
# Here we show all the studies that you have access to
bdc::query.show(bdc::new.query(resource = resource))

{
    "query": {
        "fields": [
            "\\_Topmed Study Accession with Subject ID\\",
            "\\_Parent Study Accession with Subject ID\\"
        ],
        "crossCountFields": [

        ],
        "requiredFields": [

        ],
        "anyRecordOf": [

        ],
        "numericFilters": {

        },
        "categoryFilters": {
            "\\_consents\\": [
                "phs001001.c1",
                "phs001217.c1",
                "phs001217.c0",
                "phs001001.c2",
                "phs001345.c1",
                "phs000820.c1",
                "phs000946.c1",
                "phs000956.c2",
                "phs001368.c2",
                "phs001402.c1",
                "phs000956.c0",
                "phs001189.c1",
                "phs001368.c1",
                "phs001387.c0",
                "phs001143.c1",
                "phs001143.c0",
                "phs000988.c1",
                "phs000286.c4",
                "phs001207.c1",
        

NULL

In [11]:
# Here we delete that access and add only the consent code corresponding to the SAGE study
bdc::query.filter.delete(my_query, "\\_consents\\")

In [12]:
# Here we show that we have only selected a single study
bdc::query.filter.add(query = my_query,
                      keys = "\\_consents\\",
                      as.list(c("phs000921.c2")))

*Note that trying to manually add a consent group which you are not authorized to access will result in errors downstream.*

#### List available phenotypic variables

Once a connection to the desired resource has been established, it is helpful to search for variables related to our search query. To this end, we will use the `dictionary` method of the `resource` object to create a data dictionary instance to search for variables.

In [13]:
# search for variables that are part of the SAGE study
fullVariablesDict <- bdc::find.in.dictionary(resource, "") %>% bdc::extract.entries()

# extract the phenotypic variables
fullPhenotypeVars <- fullVariablesDict[fullVariablesDict$HpdsDataType == "phenotypes", ]

# display phenotypic vars for SAGE study
fullPhenotypeVars[stringr::str_detect(fullPhenotypeVars$name, fixed("(SAGE)")), c("name", "patientCount", "observationCount", "categorical", "min", "max", "HpdsDataType")]

Unnamed: 0_level_0,name,patientCount,observationCount,categorical,min,max,HpdsDataType
Unnamed: 0_level_1,<fct>,<int>,<int>,<lgl>,<dbl>,<dbl>,<chr>
15,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This sample attributes table includes body site where sample was collected, analyte type, tumor status, sequencing center, funding source, TOPMed phase, project, and study name.\Analyte Type\",2105,2105,True,,,phenotypes
4609,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\The subject consent data table contains subject IDs, consent group information, and subject aliases.\Consent group as determined by DAC\",2106,2106,True,,,phenotypes
4979,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\No2 Air Pollution measurement lifetime average\",1708,1708,True,,,phenotypes
7186,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\De-identified Subject ID\",2106,2106,True,,,phenotypes
12042,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This subject sample mapping data table includes a mapping of study subject IDs to sample IDs. Samples are the final preps submitted for genotyping, sequencing, and/or expression data. For example, if one patient (subject ID) gave one sample, and that sample was processed differently to generate 2 sequencing runs, there would be two rows, both using the same subject ID, but having 2 unique sample IDs. The data table also includes sample source.\Subject ID of Phenotype Data\",2105,2105,True,,,phenotypes
14455,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This sample attributes table includes body site where sample was collected, analyte type, tumor status, sequencing center, funding source, TOPMed phase, project, and study name.\Tumor Status\",2105,2105,True,,,phenotypes
15017,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\This sample attributes table includes body site where sample was collected, analyte type, tumor status, sequencing center, funding source, TOPMed phase, project, and study name.\TOPMed Phase\",2105,2105,False,1.0,3.0,phenotypes
30201,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\The subject consent data table contains subject IDs, consent group information, and subject aliases.\Subject ID\",2106,2106,True,,,phenotypes
34514,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Subject age\",2104,2104,False,7.3,41.0,phenotypes
39266,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Bronchodilator response after 4 puffs of Proventil HFA Albuterol\",834,834,True,,,phenotypes


#### Add categorical phenotypic variable (gender) to the query

A `dictionary` instance enables us to retrieve matching records by searching for a specific term. The `extract.entries` method can be used to retrieve information about all available variables. For instance, looking for variables containing the term `Sex of participant` is done this way:

In [14]:
found_terms <- bdc::find.in.dictionary(resource = resource, 
                                        term = "Sex of participant")

We now will look for variables containing the term `Sex of participant` which also contain "`(SAGE)`" . This will allow us to find the specific name associated with our variable of interest and also which values of the sex variable are valid to add to our query.

In [15]:
# View information about the "Sex of participant" variable for the "(SAGE)" study
found_terms_df <- bdc::extract.entries(found_terms)
found_terms_df[stringr::str_detect(found_terms_df$name, fixed("(SAGE)")), ]

Unnamed: 0_level_0,categorical,observationCount,patientCount,name,min,max,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<lgl>,<int>,<int>,<fct>,<dbl>,<dbl>,<chr>,<list>,<lgl>
5,True,2106,2106,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Sex of participant\",,,phenotypes,"FEMALE, MALE , NA",


The above dictionary entry shows that we can select "FEMALE", "MALE", or "NA" for gender.  For this example, let's limit our search to females.

In [16]:
bdc::query.filter.add(query = my_query, 
                       keys = "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\\Sex of participant\\",
                       values = 'FEMALE')

In [17]:
bdc::query.show(my_query)

{
    "query": {
        "fields": [
            "\\_Topmed Study Accession with Subject ID\\",
            "\\_Parent Study Accession with Subject ID\\"
        ],
        "crossCountFields": [

        ],
        "requiredFields": [

        ],
        "anyRecordOf": [

        ],
        "numericFilters": {

        },
        "categoryFilters": {
            "\\_consents\\": [
                "phs000921.c2"
            ],
            "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\\Sex of participant\\": [
                "FEMALE"
            ]
        },
        "variantInfoFilters": [
            {
                "categoryVariantInfoFilters": {

                },
                "numericVariantInfoFilters": {

                }
            }
        ],
        "expectedResultType": "DATAFRAME"
    },
    "resourceUUID": "02e23f52-f354-4e8b-992c-d37c8b9ba140"
}
 


NULL

#### Add continous phenotypic variable (age) to the query

Following the data dictionary search pattern just shown, we can search for the SAGE study variables related to `Subject Age`.

In [18]:
# View information about the "subject age" variable
found_terms <- bdc::find.in.dictionary(resource = resource,
                                        term = "Subject Age")
found_terms_df <- bdc::extract.entries(found_terms)
found_terms_df[stringr::str_detect(found_terms_df$name, "(SAGE)"), ]

Unnamed: 0_level_0,min,categorical,observationCount,patientCount,max,name,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<dbl>,<lgl>,<int>,<int>,<dbl>,<fct>,<chr>,<list>,<lgl>
21,7.3,False,2104,2104,41,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Subject age\",phenotypes,,


The dictionary entry in the output above shows the age range of data available for `Subject Age`.  

For this example, let's limit our search to a minimum age of 8 and maximum age of 35.

In [19]:
bdc::query.filter.add(query = my_query,
                       keys = "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\\Subject age\\",
                       min = 8,
                       max = 35)

#### List available genotypic variables
To start adding genomic filters to our query, we first need to understand which genomic variables exist.

In [20]:
# extract the genotypic vars for SAGE
geno_vars <- filter(fullVariablesDict, HpdsDataType == "info")

geno_vars$categoryValues[geno_vars$name == "Gene_with_variant"] <- '<<<FULL GENE LIST REMOVED.>>>'
# display genotypic vars
geno_vars

Unnamed: 0_level_0,min,categorical,observationCount,patientCount,max,name,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<dbl>,<lgl>,<int>,<int>,<dbl>,<fct>,<chr>,<list>,<chr>
210001,,True,,,,Gene_with_variant,info,<<<FULL GENE LIST REMOVED.>>>,"Description=""The official symbol for a gene affected by a variant."""
211000,,True,,,,Variant_class,info,"SNV , insertion, deletion","Description=""A standardized term from the Sequence Ontology (http://www.sequenceontology.org) to describe the type of a variant. Possible values: SNV, deletion, insertion."""
310000,,True,,,,Variant_consequence_calculated,info,"intergenic_variant , start_retained_variant , frameshift_variant , 3_prime_UTR_variant , splice_acceptor_variant , intron_variant , splice_region_variant , upstream_gene_variant , 5_prime_UTR_variant , non_coding_transcript_exon_variant, stop_gained , non_coding_transcript_variant , start_lost , splice_donor_variant , synonymous_variant , missense_variant , mature_miRNA_variant , stop_lost , regulatory_region_variant , downstream_gene_variant , TFBS_ablation , stop_retained_variant , TF_binding_site_variant , coding_sequence_variant , inframe_deletion , protein_altering_variant , inframe_insertion , incomplete_terminal_codon_variant","Description=""A standardized term from the Sequence Ontology (http://www.sequenceontology.org) to describe the calculated consequence of a variant."""
410000,,True,,,,Variant_frequency_as_text,info,"Novel , Rare , Common","Description=""The variant allele frequency in gnomAD exomes of combined population as discrete text categories. Possible values: Novel, Rare (variant frequency less than 1%), Common (variant frequency greater than or equal to 1%)."""
510000,,True,,,,Variant_severity,info,"MODERATE, HIGH , LOW","Description=""The severity for the calculated consequence of a variant on a gene. Possible values: HIGH (frameshift, splice disrupting, or truncating variants), MODERATE (non-frameshift insertions or deletions, variants altering protein sequencing without affecting its length), LOW (other coding sequence variants including synonymous variants), MODIFIER (all others)."""


As shown in the output above, some genomic variables that can be used in queries include `Gene_with_variant`, `Variant_class`, and `Variant_severity`.

Note that, for printing purposes, the full list of genes in `Gene_with_variant` row and `categoryValues` column was removed. This is to provide a simpler preview of the genomic variables and to avoid printing thousands of gene names in the dataframe.

#### Add genotypic variable (Gene_with_variant) to the query
Let's use `Gene_with_variant` to view a list of genes and get more information about this variable.

In [21]:
# View gene list from the "Gene_with_variant" variable
found_terms <- bdc::find.in.dictionary(resource = resource,
                                        term = "Gene_with_variant") %>% bdc::extract.entries()
gene_list <- found_terms$categoryValues[found_terms$name == "Gene_with_variant"]
print(sort(gene_list[[1]]))

    [1] "5_8S_rRNA"                 "5S_rRNA"                  
    [3] "7SK"                       "A1BG"                     
    [5] "A1CF"                      "A2M"                      
    [7] "A2ML1"                     "A2ML1-AS1"                
    [9] "A2MP1"                     "A3GALT2"                  
   [11] "A4GALT"                    "A4GNT"                    
   [13] "AA06"                      "AAAS"                     
   [15] "AACS"                      "AACSP1"                   
   [17] "AADAC"                     "AADACL2"                  
   [19] "AADACL2-AS1"               "AADACL3"                  
   [21] "AADACL4"                   "AADACP1"                  
   [23] "AADAT"                     "AAED1"                    
   [25] "AAGAB"                     "AAK1"                     
   [27] "AAMDC"                     "AANAT"                    
   [29] "AAR2"                      "AARD"                     
   [31] "AARS"                      "AAR

The output shown above provides a list of values that can be used for this variable, in this case genes affected by a variant. Let's narrow our query to include the CHD8 gene.



In [22]:
# Look for entries with variants in the CHD8 gene 
bdc::query.filter.add(query = my_query,
                       keys = "Gene_with_variant",
                       values = "CHD8")

Now that all query criteria have been entered into the query instance, we can view it by using the following line of code:

In [23]:
# Now we show the query as it is specified
bdc::query.show(query = my_query)

{
    "query": {
        "fields": [
            "\\_Topmed Study Accession with Subject ID\\",
            "\\_Parent Study Accession with Subject ID\\"
        ],
        "crossCountFields": [

        ],
        "requiredFields": [

        ],
        "anyRecordOf": [

        ],
        "numericFilters": {
            "\\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\\Subject age\\": {
                "min": 8,
                "max": 35
            }
        },
        "categoryFilters": {
            "\\_consents\\": [
                "phs000921.c2"
            ],
            "\\_topmed_consents\\": [
                "phs001215.c0",
                "phs001217.c1",
                "phs001217.c0",
                "phs001345.c1",
                "phs001215.c1",
                "phs000946.c1",
                "phs000954.c1",
                "phs000921.c2",
                "phs001368.c2",
                "phs001368.c1",
              

NULL


Next we will take this query and retreve the data for patients with matching criteria.

## Retrieving data from the query

#### Getting query count

We have now built a query called `my_query` which contains the search criteria we are interested in.

Next, we will run a count query to find the number of matching participants.

Finally, we will run a data query to download the data.

In [24]:
my_query_count <- bdc::query.run(query = my_query,
                                  result.type = "count")
print(my_query_count)

[1] 215


#### Getting query data

Once our query object is finally built, we set `result.type = "dataframe"` to retrieve the data corresponding to our query.

In [25]:
my_query_df <- bdc::query.run(query = my_query,
                               result.type = "dataframe")

In [26]:
dim(my_query_df)

In [27]:
head(my_query_df, n=5)

Unnamed: 0_level_0,Patient ID,"\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Sex of participant\","\NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study ( phs000921 )\Subject age\",\_Parent Study Accession with Subject ID\,\_Topmed Study Accession with Subject ID\,\_consents\,\_topmed_consents\
Unnamed: 0_level_1,<int>,<fct>,<dbl>,<lgl>,<fct>,<fct>,<fct>
1,417704,FEMALE,9.5,,phs000921.v4_BUR02262015B,phs000921.c2,phs000921.c2
2,417708,FEMALE,9.7,,phs000921.v4_BUR02262025B,phs000921.c2,phs000921.c2
3,417710,FEMALE,12.7,,phs000921.v4_BUR02262028B,phs000921.c2,phs000921.c2
4,417713,FEMALE,13.9,,phs000921.v4_BUR02262034B,phs000921.c2,phs000921.c2
5,417717,FEMALE,17.6,,phs000921.v4_BUR02262044B,phs000921.c2,phs000921.c2


# Data analysis example: *SERPINA1* gene and COPD

In this example, we will create a query to explore the relationship between the COPD phenotype and variants in the SERPINA1 gene. Variations of the SERPINA1 gene have been found to be a strong risk factor for COPD, which you can read more about [here](https://pubmed.ncbi.nlm.nih.gov/31661293/).

To explore this relationship, we will narrow the cohort down to participants that meet the following criteria:
* participated in the COPDgene study
* have had COPD
* have a *SERPINA1* gene variant with high or moderate severity

#### Initialize the query
Let's start by creating a new query and finding the variables pertaining to the COPDgene study using a dictionary.

In [28]:
copd_query <- bdc::new.query(resource=resource)
copd_dictionary <- bdc::find.in.dictionary(resource = resource,
                                           term = "COPDGene") %>% bdc::extract.entries()
copdDict <- get_multiIndex_variablesDict(copd_dictionary)

**Add phenotypic variable (COPD: have you ever had COPD) to the query**

Next we will find the full variable name for "COPD: have you ever had COPD" using the `simplified_name` column and filter to this data.

In [29]:
mask_copd <- copdDict['simplified_name'] == 'COPD: have you ever had COPD'
copd_varname <- copdDict[mask_copd, 'name'] %>%
    unlist() %>%
    unname()
copd_varname <- as.character(copd_varname)
bdc::query.filter.add(query=copd_query, keys=copd_varname, value='Yes')

**Add genomic variable (Gene_with_variant) to the query**

To add the genomic filter, we can use a dictionary to find the variable `Gene_with_variant` and filter to the *SERPINA1* gene.

In [30]:
copd_dictionary <- bdc::new.query(resource=resource)
gene_dictionary <- bdc::find.in.dictionary(resource=resource,
                                           term="Gene_with_variant") %>% bdc::extract.entries()
gene_varname <- gene_dictionary$name
bdc::query.filter.add(query=copd_query, keys=gene_varname, value='SERPINA1')

#### Add genomic variable (Variant_severity) to the query
Finally, we can filter our results to include only variants of the *SERPINA1* gene with high or moderate severity.

In [31]:
severity_dictionary <- bdc::find.in.dictionary(resource=resource,
                                              term = 'Variant_severity') %>% bdc::extract.entries()
severity_varname <- severity_dictionary$name
bdc::query.filter.add(query=copd_query, keys=severity_varname, value=list("HIGH", "MODERATE"))

#### Retrieve data from the query

Now that the filtering is complete, we can use this final query to get counts and perform analysis on the data.

In [32]:
copd_results <- bdc::query.run(copd_query, result.type='dataframe')

In [33]:
dim(copd_results)

In [34]:
head(copd_results)

Unnamed: 0_level_0,Patient ID,"\Genetic Epidemiology of COPD (COPDGene) Funded by the National Heart, Lung, and Blood Institute ( phs000179 )\Subject ID, died center, age at enrolment, race, ethnic, gender, body weight, body height, BMI, systolic and diastolic blood pressure, measurement of several parameters during 6 minutes work, CT slicer, CT scanner, heart rate, oxygen saturation and therapy, medical history of back pain, cancer, cardio vascular diseases, diabetes, digestive system diseases, eye diseases, general health, musculoskeletal diseases, painful joint type, respiratory tract disease, smoking, and walking limbs, medication history of treatment with beta-agonist, theophylline, inhaled corticosteroid, Oral corticosteroids, ipratropium bromide, and tiotroprium bromide, respiratory disease, St. George's Respiratory Questionnaire, SF-36 Health Survey, spirometry, and VIDA of participants with or without chronic obstructive pulmonary disease and involved in the 'Genetic Epidemiology of COPD (COPDGene) Funded by the National Heart, Lung, and Blood Institute' project.\COPD: have you ever had COPD\",\_Parent Study Accession with Subject ID\,\_Topmed Study Accession with Subject ID\,\_consents\,\_topmed_consents\
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<fct>
1,35416,Yes,phs000179.v6_COPDGene_A00282,phs000951.v4_COPDGene_A00282,phs000179.c2,phs000951.c2
2,35421,Yes,phs000179.v6_COPDGene_A01220,phs000951.v4_COPDGene_A01220,phs000179.c1,phs000951.c1
3,35428,Yes,phs000179.v6_COPDGene_A04559,phs000951.v4_COPDGene_A04559,phs000179.c1,phs000951.c1
4,35429,Yes,phs000179.v6_COPDGene_A04808,phs000951.v4_COPDGene_A04808,phs000179.c1,phs000951.c1
5,35430,Yes,phs000179.v6_COPDGene_A05032,phs000951.v4_COPDGene_A05032,phs000179.c1,phs000951.c1
6,35431,Yes,phs000179.v6_COPDGene_A05113,phs000951.v4_COPDGene_A05113,phs000179.c1,phs000951.c1
