Skip to content

Querying Variants using Storage CLI

Nacho edited this page Nov 2, 2015 · 3 revisions

Overview

There are two different ways of querying indexed data in OpenCGA Storage using the Command Line Interface (CLI), these are:

  • opencga.sh: a top-level CLI for querying data using OpenCGA Catalog.
  • opencga-storage.sh: a low-level CLI for querying data using variant attributes such as region, gene, annotation, stats or genotypes.

Both CLIs implement similar parameters and functionality for querying variants. The main difference between them is that top-level CLI makes use of OpenCGA Catalog for implementing ACL-based (authentication and authorization) security and for making more complex queries by using sample annotations.

Both executables can be found in $OPENCGA_HOME/bin folder.

Using opencga-storage.sh

In version v0.7.0 this is the most complete way of querying data. This allows to query by:

  • genomic regions and feature IDs such as gene and SNPs
  • variant annotations such as consequence types, conservation scores, protein substitution scores (polyphen, sift), population frequencies, clinvar, ...
  • sample genotypes
  • variant stats in the study
  • basic aggregations such as ranks, group-by or counts

All these filters can be combined. There are some query modifiers implemented:

  • skip and limit
  • count: this can be added to all CLIs to return just the number of results

From the $OPENCGA_HOME folder you can execute to see all the parameters:

./bin/opencga-storage.sh fetch-variants -h

Design considerations

There are some design decisions you must be aware of:

  1. Comma character ',' is used in different places in the CLI, this ',' can take two different behaviours. If the comma is used to enumerate query values such as regions, genes, SO terms, ... then this behaves as a logical OR as in region 1:1800000-1900000,1:2000000-2100000. But if comma is used to separate query fields such as "sift<0.2,polyphen<0.5" then it acts as a logical AND.

  2. Independently where regions, genes or SNPs IDs are in the CLI they always behave as a logical OR. For instance in next CLI region and gene parameters act as a logical OR:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2

  1. For all the other CLI parameters a logical AND is executed, so in next query only variants for the specified regions with a sift below 0.2 AND a polyphen score below 0.5 are returned:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5"

Example queries

Using variant attributes

To fetch variants for a specific region:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000

and for several regions separating them by ',':

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000

you can also add a list of genes:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000 --gene BRCA2,TP53

Note: remember all regions and genes are always a logical OR.

If you want SNV, INDELS or SV you can use --type parameter:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 1:1800000-1900000,1:2000000-2100000 --type INDEL

Using variant annotation info

To query by SIFT or PolyPhen2 you use --protein-substitution:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000 --protein-substitution "sift<0.2"

or using both, remember that here the ',' acts as a logical AND:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:15000000-20000000 --protein-substitution "sift<0.2,polyphen<0.5"

To only count the number of variants remember you can always add --count:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 22:1500000-2000000 --protein-substitution "sift<0.2" --count

To query using Consequence Type terms from Sequence Ontology (SO), you can use the terms at http://www.ensembl.org/info/genome/variation/predicted_data.html, use comma to add terms:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count

And you can always combine parameters in a logical AND, so next query will return variants annotated with those SO terms in the specified region:

./bin/opencga-storage.sh fetch-variants --database DATABASE_NAME --region 21:9411443-19411443 --consequence-type SO:0001623,SO:0001624 --count

To query using conservation scores you can use --conservation, next query use both PhastCons and Phylop in separated by ',', since they are different query fields the act as a logical AND:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --region 1 --conservation "phastCons<0.1,phylop<0.2" --count

You can also query using population frequencies from 1000 Genome project, EVS and EXaC using --population-freqs parameter: ./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01" --count

or several populations together separated by comma, since they are different populations and query fields this is a logical AND:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --output-format json --population-freqs "1000GENOMES_phase_1:EUR<0.01,1000GENOMES_phase_1:AFR<0.01" --count

Sample genotype

To query by specific sample genotypes you can use --sample-genotype parameter. You must separate samples by ';', and the accepted genotypes for each sample by ','. This will execute an AND between samples and a OR for the genotypes, so in:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --sample-genotype "15:0/0;20:0/1,1/1" --limit 15

variants which are 0/0 for sample 15 and 0/1 or 1/1 for sample 20 are returned (Note: in a few days sample names will be allowed)

Building more complex queries

You can combine all the parameters above to execute more complex queries:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --region 1:50000-3000000 --sample-genotype "15:0/0;20:0/1,1/1" --protein-substitution "sift<0.2,polyphen<0.5" --conservation "phastCons<0.1"

Some aggregations and rankings

To group variants per gene or consequence type you can use --group-by parameter:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --region 1:1245816-3245819 --group-by gene

You can also rank genes or consequence type using --rank:

./bin/opencga-storage.sh fetch-variants --database opencga_test_demo --region 1:1245816-3245819 --rank gene

Clone this wiki locally