Skip to content

Storage Command Line

Nacho edited this page Dec 27, 2015 · 16 revisions

Overview

OpenCGA Storage (from v0.7+) implements two different command line interfaces (CLI) to allow users to easily work with the storage engines. These two CLIs are divided in client and server functionality and scripts:

  • client: allow to load, index and query data among other specific features such as variant annotation. It is available in the script opencga-storage.sh
  • server: two different servers to query data have been implemented, first a standard RESTful web service using Jetty, second a server using new gRPC technology that offers a more high-performance and scalable solution. It is available in the script opencga-storage-server.sh

Both Storage CLIs have been implemented with two levels: commands and subcommands to better organize functionality and provide specific parameters.

Client CLI

The different available commands are feature, alignment and variant, and the subcommands are:

  • feature
  • index: GFF/BED files are indexed using tabix by default, some plugins could override this and index in MongoDB or HBase
  • query: to execute region-based queries
  • alignment
  • index: BAM/CRAM files are indexed using samtools by default, but some Storage plugins can use more advanced technologies such as Apache Hive
  • query (old fetch-alignments): this allows to execute queries implemented in AlignmentDBAdaptor such as region-based or by coverage
  • stats: basic file statistics and coverage is calculated
  • benchmark: executes the common framework implemented, this allows to study the indexing and query times across different plugin implementations
  • variant
  • index: VCF/BCF (and gVCF) files are indexed using tabix by default, but some Storage plugins can use more advanced technologies such as MongoDB and Apache HBase to provide a much more high-performance and scalable solution
  • query (old fetch-variants): execute queries implemented in VariantDBAdaptor such as region-based or by variant annotation
  • query-grpc: this gRPC client executes queries to a remote gRPC server
  • annotate: create and load the variant annotation from CellBase or Ensembl VEP, these are indexed with the data in mongodb and hadoop plugins
  • stats: calculate variant and sample stats for different cohorts, these are indexed with the data in mongodb and hadoop plugins
  • sample: sample-based aggregation queries
  • admin: remove variants, samples, … from databases
  • benchmark: executes the common framework implemented, this allows to study the indexing and query times across different plugin implementations
Server CLI

The different available commands are rest and grpc, and the subcommands are:

  • rest
  • start: start Jetty for RESTful web services at port 9090 by default
  • stop: stop Jetty server
  • status: prints some useful information about the server status
  • grpc
  • start: start gRPC server at port 9091 by default
  • stop: stop gRPC server
  • status: prints some useful information about the server status

General considerations

Dynamic parameters

These parameters are not specified on the command line and will change internal configuration parameters. Depending on the biotype (alignment or variant) and the selected storage engine, this parameters will be added to the redden configuration file in the options field.

-D<configuration-parameter-name>=<value>

Storage Configuration file

The file storage-configuration.yml should be placed at $OPENCGA_HOME/conf/, and contains all configuration needed by OpenCGA-Storage. There are tree main blocks: storageEngines, server and cellbase.

Old CLI Commands (v0.4-v0.6)

The Storage command line interface defines this set of commands:

  • index-variants Index variants file
  • fetch-variants Search over indexed variants
  • annotate-variants Create and load variant annotations into the database
  • stats-variants Create and load stats into a database.
  • create-accessions Creates accession IDs for an input file
  • index-alignments Index alignment file
  • fetch-alignments Search over indexed alignments
Dynamic parameters

These parameters are not specified on the command line and will change internal configuration parameters. Depending on the biotype (alignment or variant) and the selected storage engine, this parameters will be added to the redden configuration file in the options field.

-D<configuration-parameter-name>=<value>

Storage Configuration file

The file storage-configuration.yml should be placed at $OPENCGA_HOME/conf/, and contains all configuration needed by OpenCGA-Storage. There are tree main blocks: storageEngines, server and cellbase.

  • Storage configuration
    • Storage Engine configuration
      • Variant
      • Alignment
    • Server configuration
    • CellBase configuration
Storage Engines configuration

Can define a set of configuration options for each installed storage-engine (mongodb, hadoop, ...). Each one contains a section for every supported biotype, currently alignment and variant.

Variant ETL configuration

Common options between all storage-engines for variants are defined in VariantStorageManager::Options

Alignment ETL configuration

Common options between all storage-engines for alignments are defined in AlignmentStorageManager::Options

Server configuration
CellBase configuration
Clone this wiki locally