Skip to content

Indexing Variant Data v0.7.0

Jacobo Coll Moragón edited this page Aug 30, 2016 · 4 revisions

Indexing VCF files

VCF can be indexed executing either using the implemented pipeline in the CLI or using the Java API. The aim of this indexation is allow making queries over the indexed data. Indexing data happens in two consecutive steps: transformation and load. During the transformation the VCF data is normalized and converted into an internal variant data model (see Data Models). During the load this normalized and validated file will be loaded in the active storage engine plugin. For more information about the indexation process, see OpenCGA Storage Overview.

For this testing area, we are going to use a sample VCF data from the 1000 Genomes Project. You can use any other file, but all the examples below use the VCF file ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz

Using the Command Line Interface

OpenCGA provides two ways of indexing VCF data:

  1. OpenCGA Analysis: this is a high level CLI that uses OpenCGA Catalog.
  2. OpenCGA Storage: this is a low level CLI completely independent of OpenCGA Catalog. You should use this one if you already have a metadata server and only need a Storage Indexing capabilities.
OpenCGA Analysis and Catalog

You have the complete description of OpenCGA command line interface at Command Line, this is just a quick start example. This tutorial works for the version 0.7.0. To check this we can execute:

./opencga.sh --version
Version 0.7.0
git version: master bc312857972c90ded9448b251674f04afd2bcf74

First of all, you will need a user account. This user information will be only stored in your Catalog. For instance:

./opencga.sh users create -u myuser -p mypass -e my@e.mail -n "my name"

Then, we can login into this account executing. This will create a token that will be used to authenticate the user.

./opencga.sh users login -u myuser -p mypass

Then, you can organize your data in several projects, and several studies in each project. Here we only will have one of each:

./opencga.sh projects create -a myproject -n "Default project" -d "First project created."
./opencga.sh studies create -a mystudy --project-id myuser@myproject -n "Default study" -d "First study created."

Now we can do the actual indexing. First, we have to tell Catalog where is the file we want to operate on:

./opencga.sh files create -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --study-id myuser@myproject/mystudy --bioformat VARIANT

And ask for the indexation. If you have properly configured your storage engine (currently MongoDB or HBase) and want to do the transformation and load, you can go straight and do:

./opencga.sh files index -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz

if you also want to annotate the variants and compute statistics about the genotypes, run instead:

./opencga.sh files index -u myuser -p mypass -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --calculate-stats --annotate

You can always annotate the variants from a study by executing:

./opencga.sh studies annotate-variants -id myuser@myproject/mystudy

Now your vcf should be indexed and ready to receive queries. You can do so with:

./opencga-storage.sh fetch-variants -r 20:16050050-16050100 --database opencga_myuser_myproject

See the command help for more options:

./opencga-storage.sh fetch-variants --help
OpenCGA Storage

⚠️ This CLI is a low level CLI. Any metadata record must be done by the application.

A VCF indexation can be done in one or two steps, depending on if you want to delay the database load or not. //: # (It is more illustrative to do the two steps indexation)

A simple indexation may be done like the next command. Note that at this level you must manage your own ids. We will use 1 and 2 for instance:

./opencga-storage.sh index-variants --studyId 1 --study-name default -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --database chr22_test_db

If your dataset is big and you want to do smaller steps, it is recommended to split the ETL process in two:

./opencga-storage.sh index-variants --study-id 1 --study-name default -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --transform 

./opencga-storage.sh index-variants --study-id 1 --study-name default -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz.variants.json.gz --database chr22_test_db --load  

Then you can query normally with:

./opencga-storage.sh fetch-variants --database chr22_test_db --limit 10

See the command help for more options:

./opencga-storage.sh fetch-variants --help
Clone this wiki locally