# PIC-SURE API tutorial using the Undiagnosed Diseases Network (UDN) database
This is a tutorial notebook, aimed to be quickly up and running with the R PIC-SURE API. It covers the main functionalities of the API.

## R PIC-SURE API
### What is PIC-SURE?
Databases exposed through the PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and exposes the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences.

### More about PIC-SURE
PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. The API is available in two different programming languages, Python and R, allowing investigators to query databases in the same way using either of those languages.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patients that match criteria, and create cohorts from this interactive exploration.

The API is actively developed by the Avillach-Lab at Harvard Medical School.

GitHub repo:

* https://github.com/hms-dbmi/pic-sure-r-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-r-client
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-r-adapter-hpds

---

## Getting your own user-specific security token
**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

### Environment set-up

#### Pre-requisites: 
* R version >= 3.6

#### Package installation and imports
The installation of some packages may take some time, please be patient.
- packages listed in the `requirements.R` file
- PIC-SURE API components (from Github)
    - PIC-SURE Adapter
    - PIC-SURE Client

#### Install latest R PIC-SURE API libraries from GitHub
To install the PIC-SURE libraries from GitHub, we need to install first the `devtools` package.

In [1]:
system(command = 'conda install -c conda-forge r-devtools --yes')

In [2]:
Sys.setenv(TAR = "/bin/tar")
options(unzip = "internal")
install.packages("https://cran.r-project.org/src/contrib/Archive/devtools/devtools_1.13.6.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/R6_2.5.0.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/hash_2.2.6.1.tar.gz", repos=NULL, type="source")
install.packages(c("urltools"),repos = "http://cran.us.r-project.org")
devtools::install_github("hms-dbmi/pic-sure-r-client", force=T)
devtools::install_github("hms-dbmi/pic-sure-r-adapter-hpds", force=T)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

“Devtools is incompatible with the current version of R. `load_all()` may function incorrectly.”
Downloading GitHub repo hms-dbmi/pic-sure-r-client@master
from URL https://api.github.com/repos/hms-dbmi/pic-sure-r-client/zipball/master

Installing picsure

'/home/ec2-user/anaconda3/envs/R/lib/R/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/tmp/Rtmp32wsAO/devtools444a291c8774/hms-dbmi-pic-sure-r-client-115deb5'  \
  --library='/home/ec2-user/anaconda3/envs/R/lib/R/library' --install-tests 



Downloading GitHub repo hms-dbmi/pic-sure-r-adapter-hpds@master
from URL https://api.github.com/repos/hms-dbmi/pic-sure-r-adapter-hpds/zipball/master

Installing hpds

'/home/ec2-user/anaconda3/envs/R/lib/R/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/tmp/Rtmp32wsAO/devtools444a6e8e77ba/hms-dbmi-pic-sure-r-adapter-hpds-2cee5ee'  \
 

##### Load user-defined functions

In [3]:
# R_lib for pic-sure
source("R_lib/utils.R")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




## Connecting to a PIC-SURE resource

### 1. Connect to the UDN data network
The following is required to get access to data through the PIC-SURE API: 
- Network URL
- Resource id
- User-specific security token

In [5]:
# Connection to the PIC-SURE API w/ key
# network information
PICSURE_network_URL <- "https://udn.hms.harvard.edu/picsure"
resource_id <- "c23b6814-7e5b-48d2-80d9-65511d7d2051"

In [8]:
# token is the individual user key given to connect to the UDN resource
token_file <- "token.txt"
my_token <- scan(token_file, what = "character")

In [9]:
# get connection object
connection <- picsure::connect(url = PICSURE_network_URL,
                                 token = my_token)

[1] "c23b6814-7e5b-48d2-80d9-65511d7d2051"


In [10]:
# get resource object
resource <- hpds::get.resource(connection,
                               resourceUUID = resource_id)

Two objects are created here: a `connection` and a `resource` object.

Since we will only be using a single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**.

It is connected to the specific data source ID we specified and enables us to query and retrieve data from this database.

#### Getting help with the R PIC-SURE API

You can get help with PIC-SURE library functions by using the `?` operator

In [11]:
# get function documentation
?hpds::get.resource()

### 2. Explore the data: data structures description

There are two methods to explore the data from which the user get two different data structures: a **dictionary object** to explore variables and a **query object** to explore the patient records in UDN. 

**Methods**:

    * Search variables: find.in.dictionary() method
    * Retrieve data: query() methods

**Data structures**:

    * Dictionary object structure
    * Query object structure
    

#### Explore variables using the _dictionary_

Once a connection to the desired resource has been established, we first need to get a quick idea of which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A dictionary object offers the possibility to retrieve information about either matching variables according to a specific term or all available variables, using the `find.in.dictionary()` method. For instance, looking for variables containing the term 'aplasia' is done this way:

In [12]:
# create a dictionary object and search for a specific term, in this example for "aplasia"
lookup <- hpds::find.in.dictionary(resource, "aplasia")

We have created the dictionary object with only variables matched by the search term. To retrieve the search result from dictionary objects we have 4 different methods: `extract.count()`, `extract.keys()`, and `extract.entries()`.

In [13]:
# description of the dictionary search content
print(list("Count"   = hpds::extract.count(lookup), 
           "Keys"    = hpds::extract.keys(lookup)[0:2],
           "Entries" = hpds::extract.entries(lookup)[0:2]))

$Count
[1] 1845

$Keys
$Keys[[1]]
[1] "\\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\\Phenotypic abnormality\\Abnormality of the skeletal system\\Abnormality of limb bone\\Abnormality of limb bone morphology\\Abnormality of digit\\Abnormality of toe\\Abnormality of the phalanges of the toes\\Abnormality of the phalanges of the 5th toe\\Aplasia/Hypoplasia of the phalanges of the 5th toe\\Aplasia/Hypoplasia of the middle phalanx of the 5th toe\\"

$Keys[[2]]
[1] "\\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\\Phenotypic abnormality\\Abnormality of limbs\\Abnormality of limb bone\\Abnormality of limb bone morphology\\Aplasia involving bones of the extremities\\Aplasia/hypoplasia involving bones of the upper limbs\\Aplasia/hypoplasia involving bones of the hand\\Aplasia/Hypoplasia of fingers\\Aplasia/Hypoplasia of the thumb\\Absent thumb\\"


$Entries
     patientCount categorical
1               1        TRUE
2               1        TRUE
3 

**hpds::extract.entries()** enables us to get the result of the dictionary search in a data.frame format.

In [14]:
# show table of records from the dictionary object
hpds::extract.entries(lookup) %>% tail(, n =2)

Unnamed: 0_level_0,patientCount,categorical,observationCount,name,min,max,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<int>,<lgl>,<int>,<chr>,<lgl>,<lgl>,<chr>,<list>,<lgl>
1844,1,True,1,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the skeletal system\Abnormality of skeletal morphology\Aplasia/hypoplasia involving the skeleton\Aplasia/hypoplasia of the extremities\Aplasia involving bones of the extremities\Aplasia/hypoplasia involving bones of the lower limbs\Aplasia/Hypoplasia involving bones of the feet\Aplasia/Hypoplasia of toe\Absent toe\Aplasia/Hypoplasia of the distal phalanges of the toes\Aplasia/Hypoplasia of the distal phalanx of the 4th toe\Short distal phalanx of the 4th toe\",,,phenotypes,Positive,
1845,1,True,1,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the skeletal system\Abnormality of skeletal morphology\Aplasia/hypoplasia involving the skeleton\Aplasia/hypoplasia of the extremities\Aplasia involving bones of the extremities\Aplasia/hypoplasia involving bones of the lower limbs\Aplasia/Hypoplasia involving bones of the feet\Aplasia/Hypoplasia of toe\Absent toe\Aplasia/Hypoplasia of the 2nd toe\Short phalanx of the 2nd toe\Short distal phalanx of the 2nd toe\",,,phenotypes,Positive,


We can retrieve information about **ALL** variables. We do it without specifying a term in the dictionary search method:

In [15]:
# we search the whole set of variables
plain_variablesDict <- hpds::find.in.dictionary(resource, "") %>% 
hpds::extract.entries()

In [16]:
# description of the whole dictionary of variables
print(dim(plain_variablesDict))
head(plain_variablesDict, n = 2)

[1] 9248    9


Unnamed: 0_level_0,patientCount,categorical,observationCount,name,min,max,HpdsDataType,categoryValues,description
Unnamed: 0_level_1,<int>,<lgl>,<int>,<chr>,<dbl>,<dbl>,<chr>,<list>,<chr>
1,1,True,1,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the nervous system\Abnormality of nervous system physiology\Seizures\Generalized seizures\Absence seizures\Typical absence seizures\",,,phenotypes,Positive,
2,1,True,1,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the skeletal system\Abnormality of skeletal morphology\Abnormal appendicular skeleton morphology\Abnormality of limb bone morphology\Abnormality of limb epiphysis morphology\Abnormality of upper limb epiphysis morphology\Abnormality of the epiphyses of the hand\Abnormality of the epiphyses of the phalanges of the hand\Abnormality of the epiphyses of the middle phalanges of the hand\Abnormality of the epiphysis of the middle phalanx of the 2nd finger\",,,phenotypes,Positive,


The UDN network resource contains 13414 variables described by 11 data fields:
* name
* HpdsDataType
* description
* categorical
* categoryValues
* values
* continuous
* min
* max
* observationCount
* patientCount

The dictionary provides various information about the variables, such as:

* observationCount: number of entries with non-null value
* categorical: type of the variables, True if categorical, False if continuous/numerical
* min/max: only provided for non-categorical variables
* HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

Hence, it enables us to:

* Use the various variables information as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.
 
Variable names (`name` **column** in the dataframe), as currently implemented in the API, aren't straightforward to use because:

1. Very long
2. Presence of backslashes that requires modification right after copy-pasting.

However, using the dictionary to select variables can help to deal with this. 

##### Parsing variable names
We can use an utils function, `get_multiIndex()`, defined in R_lib/utils.R, to add a little more information and ease working with variables names.

Although not an official feature of the API, such functionality illustrates how to quickly scan and select groups of related variables.

Printing part of the "parsed names" Dictionary allows to quickly see the tree-like organisation of the variables. Moreover, original and simplified variable names are now stored respectively in the "name" and "simplified_name" columns (simplified variable names is simply the last component of the variable name, which usually makes the most sense to know what each variable is about).

In [17]:
# Display the variables tree hierarchy from the variables name
variablesDict <- get_multiIndex_variablesDict(plain_variablesDict)
head(variablesDict, n = 2)

level_0,level_1,level_2,level_3,level_4,level_5,level_6,level_7,level_8,level_9,⋯,level_13,level_14,simplified_name,name,observationCount,categorical,categoryValues,min,max,HpdsDataType
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<int>,<lgl>,<list>,<dbl>,<dbl>,<chr>
"04_Clinical symptoms and physical findings (in HPO, from PhenoTips)",Phenotypic abnormality,Abnormality of the nervous system,Abnormality of nervous system physiology,Seizures,Generalized seizures,Absence seizures,Typical absence seizures,,,⋯,,,Typical absence seizures,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the nervous system\Abnormality of nervous system physiology\Seizures\Generalized seizures\Absence seizures\Typical absence seizures\",1,True,Positive,,,phenotypes
"04_Clinical symptoms and physical findings (in HPO, from PhenoTips)",Phenotypic abnormality,Abnormality of the skeletal system,Abnormality of skeletal morphology,Abnormal appendicular skeleton morphology,Abnormality of limb bone morphology,Abnormality of limb epiphysis morphology,Abnormality of upper limb epiphysis morphology,Abnormality of the epiphyses of the hand,Abnormality of the epiphyses of the phalanges of the hand,⋯,,,Abnormality of the epiphysis of the middle phalanx of the 2nd finger,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the skeletal system\Abnormality of skeletal morphology\Abnormal appendicular skeleton morphology\Abnormality of limb bone morphology\Abnormality of limb epiphysis morphology\Abnormality of upper limb epiphysis morphology\Abnormality of the epiphyses of the hand\Abnormality of the epiphyses of the phalanges of the hand\Abnormality of the epiphyses of the middle phalanges of the hand\Abnormality of the epiphysis of the middle phalanx of the 2nd finger\",1,True,Positive,,,phenotypes


Below is a simple example to illustrate the ease of use of a multiIndex dictionary. Let's say we are interested in filtering variables related to "aplasias" in the "nervous system".

In [18]:
mask_system <- variablesDict[,3] == "Abnormality of the nervous system"
mask_abnormality <- grepl("Aplasia", variablesDict[["name"]])
filtered_variables <- variablesDict[mask_system & mask_abnormality,]
print(dim(filtered_variables))
head(filtered_variables, n = 2)

[1] 35 23


level_0,level_1,level_2,level_3,level_4,level_5,level_6,level_7,level_8,level_9,⋯,level_13,level_14,simplified_name,name,observationCount,categorical,categoryValues,min,max,HpdsDataType
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<int>,<lgl>,<list>,<dbl>,<dbl>,<chr>
"04_Clinical symptoms and physical findings (in HPO, from PhenoTips)",Phenotypic abnormality,Abnormality of the nervous system,Abnormality of nervous system morphology,Morphological abnormality of the central nervous system,Abnormality of brain morphology,Abnormality of forebrain morphology,Abnormality of the cerebrum,Aplasia/Hypoplasia of the cerebrum,Anencephaly,⋯,,,Arrhinencephaly,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the nervous system\Abnormality of nervous system morphology\Morphological abnormality of the central nervous system\Abnormality of brain morphology\Abnormality of forebrain morphology\Abnormality of the cerebrum\Aplasia/Hypoplasia of the cerebrum\Anencephaly\Arrhinencephaly\",1,True,Positive,,,phenotypes
"04_Clinical symptoms and physical findings (in HPO, from PhenoTips)",Phenotypic abnormality,Abnormality of the nervous system,Abnormality of nervous system morphology,Morphological abnormality of the central nervous system,Abnormality of brain morphology,Abnormality of hindbrain morphology,Abnormality of the metencephalon,Abnormality of the cerebellum,Cerebellar malformation,⋯,Dandy-Walker malformation,,Dandy-Walker malformation,"\04_Clinical symptoms and physical findings (in HPO, from PhenoTips)\Phenotypic abnormality\Abnormality of the nervous system\Abnormality of nervous system morphology\Morphological abnormality of the central nervous system\Abnormality of brain morphology\Abnormality of hindbrain morphology\Abnormality of the metencephalon\Abnormality of the cerebellum\Cerebellar malformation\Abnormality of the cerebellar vermis\Aplasia/Hypoplasia of the cerebellar vermis\Cerebellar vermis hypoplasia\Dandy-Walker malformation\",8,True,Positive,,,phenotypes


Although pretty simple, it can be easily combined with other filters to quickly select necessary variables.

#### Explore patient records using _query_

Beside from the dictionary, the second cornerstone of the API are the query methods (`hpds::query.select`, `hpds::query.require`, `hpds::query.anyof`, `hpds::query.filter`). They are the entering point to **query and retrieve data from the resource**.

First, we need to create a query object.

In [123]:
# create a query object for the resource
my_query <- hpds::new.query(resource = resource)

The query object created will be then passed to the different query methods to build the query:  <font color='orange'>hpds::query.select.add(), hpds::query.require.add(), hpds::query.anyof.add(), and hpds::query.filter.add()</font>. Each of those methods accept a query object, a list of variable names, and eventual additional parameters.

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

##### Building the query
Let's say we want to check some demographics about the data in UDN. We will filter to variables that have observation counts > 50% patient counts.

In [124]:
# select demographic variable names
demographicsDict <- hpds::find.in.dictionary(resource, "demographics") %>% 
    hpds::extract.entries()
mask_obs <- demographicsDict %>% filter(observationCount > patientCount * 0.50)
selected_varnames <- mask_obs %>% pull(name) 
print(paste0('We have found ', length(selected_varnames), ' demographics variable(s) which have observation counts > 50% of patient counts (listed below).'))
selected_varnames

[1] "We have found 6 demographics variable(s) which have observation counts > 50% of patient counts (listed below)."


You may warning messages containing the following text when building your query with multiple variables: 
“the condition has length > 1 and only the first element will be used” - this can be ignored.

To double check that your filter has been applied to your query, you can run ```hpds::query.show(query = my_query)```

In [125]:
# build and query for demographics patient data
hpds::query.select.add(query=my_query, keys=selected_varnames)

“the condition has length > 1 and only the first element will be used”
“the condition has length > 1 and only the first element will be used”
“the condition has length > 1 and only the first element will be used”


##### Retrieving the data
Once our query object is  built, we use the `query.run()` method to retrieve the data corresponding to our query.

In [127]:
# retrieve the query result as a dataframe
demographics_data <- hpds::query.run(my_query, result.type="dataframe")

In [128]:
print(dim(demographics_data))

[1] 2048    7


In [129]:
head(demographics_data)

Unnamed: 0_level_0,Patient ID,\00_Demographics\Age at UDN Evaluation (in years)\,\00_Demographics\Age at symptom onset in years\,\00_Demographics\Current age in years\Current age in years\,\00_Demographics\Ethnicity\,\00_Demographics\Gender\,\00_Demographics\Race\
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
1,6,25,11,30,Not Hispanic or Latino,Male,Asian
2,16,9,0,11,Not Hispanic or Latino,Male,White
3,17,21,6,24,Hispanic or Latino,Female,White
4,20,9,4,12,Not Hispanic or Latino,Male,Black or African American
5,21,7,0,9,Hispanic or Latino,Female,White
6,24,5,0,10,Not Hispanic or Latino,Male,White


##### Working with Variant Data
You can also use the query object to explore variant data. In this example, let's look at variants for the CHD8 gene.

In [141]:
# create a new query
my_query <- hpds::new.query(resource = resource)

In [142]:
# add a filter for a categorical variant: CHD8
hpds::query.filter.add(query=my_query, keys="Gene_with_variant", "CHD8")

Before calling the full data frame of variants, let's ensure that the approximate total count of variants being returned by our query is of a reasonable size. Queries returning more than 100,000 variants could crash your workbook. 

In [143]:
variantCount <- hpds::query.run(my_query, result.type="variantsApproximateCount")
variantCount


“NAs introduced by coercion”


In [148]:
variant_data <- hpds::query.run(my_query, result.type="variantsDataFrame")
head(variant_data)

Unnamed: 0_level_0,CHROM,POSITION,REF,ALT,Variant_consequence_calculated,Variant_class,Gene_with_variant,Variant_severity,Variant_frequency_as_text,Patients.with.this.variant.in.subset,⋯,UDN336336,UDN906298,UDN556501,UDN889016,UDN679990,UDN909615,UDN373820,UDN343939,UDN340901,UDN001168
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,14,21380203,G,A,"upstream_gene_variant,downstream_gene_variant,intron_variant",SNV,"SUPT16H,CHD8,LOC107984643,AL135744.1",MODIFIER,Novel,1/901,⋯,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
2,14,21380208,T,C,"upstream_gene_variant,downstream_gene_variant,intron_variant",SNV,"SUPT16H,CHD8,LOC107984643,AL135744.1",MODIFIER,Novel,4/901,⋯,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
3,14,21380286,G,A,"upstream_gene_variant,downstream_gene_variant,intron_variant",SNV,"SUPT16H,CHD8,LOC107984643,AL135744.1",MODIFIER,Novel,1/901,⋯,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
4,14,21380296,ATT,A,"upstream_gene_variant,downstream_gene_variant,intron_variant",deletion,"SUPT16H,CHD8,LOC107984643,AL135744.1",MODIFIER,Novel,1/901,⋯,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
5,14,21380296,A,AT,"upstream_gene_variant,downstream_gene_variant,intron_variant",insertion,"SUPT16H,CHD8,LOC107984643,AL135744.1",MODIFIER,Novel,3/901,⋯,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/0
6,14,21380296,A,ATTT,"upstream_gene_variant,downstream_gene_variant,intron_variant",insertion,"SUPT16H,CHD8,LOC107984643,AL135744.1",MODIFIER,Novel,38/901,⋯,0/0,0/0,0/0,0/0,0/0,0/0,0/0,0/1,0/0,0/0


Another example of a genomic filter is looking at the variant frequency.
- Novel variants are not found in the rest of the population
- Rare variants are found in <1% of the population
- Common variants are found in >= 1% of the population

In [151]:
# what is Variant_frequency_in_ExAC?

hpds::find.in.dictionary(resource, "Variant_frequency") %>% 
    hpds::extract.entries()

description,name,min,categorical,patientCount,observationCount,max,HpdsDataType,categoryValues
<chr>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>,<list>
"Description=""The variant allele frequency in gnomAD combined population as discrete text categories. Possible values: Novel, Rare (variant frequency less than 1%), Common (variant frequency greater than or equal to 1%).""",Variant_frequency_as_text,,True,,,,info,"Novel , Rare , Common"


In [155]:
# Example querying for novel variants
my_query <- hpds::new.query(resource = resource)
hpds::query.filter.add(query=my_query, keys="Variant_frequency_as_text")
#variant_data <- hpds::query.run(my_query, result.type="variantsDataFrame")
variant_data <- hpds::query.run(my_query)

head(variant_data)

ERROR: HTTP response was bad
Response [https://udn.hms.harvard.edu/picsure/query/sync/]
  Date: 2021-07-29 21:04
  Status: 502
  Content-Type: text/html; charset=iso-8859-1
  Size: 341 B
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request<p>Reason: <strong>Error reading...
</body></html>


{results:{},error:True}
<lgl>,<lgl>


Finally, we can combined genomic and phenotypic filters into a single query:

In [30]:
# Example combining variant and phenotype queries
my_query <- hpds::new.query(resource = resource)
hpds::query.filter.add(query=my_query, keys="Variant_frequency_in_text", 0.4, 0.5)
hpds::query.filter.add(query=my_query, keys="\\00_Demographics\\Gender\\", "Female")
variant_data <- hpds::query.run(my_query, result.type="variantsDataFrame")
head(variant_data)

[1] "ERROR: cannot add, key does not exist in resource: Variant_frequency_in_ExAC"


No.Variants.Found
<lgl>


In [154]:
?query.filter.add