Skip to content

Storage Implementation

Jacobo Coll Moragón edited this page Mar 10, 2016 · 2 revisions

#:no_entry: OUTDATED :no_entry:

Go to OpenCGA Storage Overview


As many organizations involved in Big Data projects have done, we decided to use technologies such as Hadoop and MongoDB. Relatively small datasets can be stored in a Mongo-only backend, while those expecting a really high load of information could opt for a combination of both technologies, which we decided to name Monbase :)

Mongo-only backend

This schema uses Mongo for storing all the raw data from the file. The Mongo collection establishes the relationships between variants and sources (typically files). It also stores the most commonly accessed variant statistics for each source. When translated to JSON, the schema would be something similar to the following:

{
        "_id" : ObjectId("53207bebe41ada68993d75de"),
        "id" : "rs12345",        
        "chromosome" : "X",
        "start" : 60034,
        "end" : 60034,
        "assembly": "GRCh37",
        "length" : 1,
        "ref" : "C",
        "alt" : "A",
        "type" : "SNV",
        "hgvs" : [
                        {"name": "X:g.64224271C>T", "type": "genomic"},
                        {"name": "ENST00000581797.1:c.-73G>A", "type": "RNA"}
        ],
        "effect" : [
                        {"geneName": "", "so": "regulatory_region_variant"},
                        {"geneName": "BRCA2", "so": "coding_sequence_variant"}
        ],       
        "files" : [
                {
                        "fileId" : "chrX",
                        "studyId" : "1000G",
                        "attributes" : {
                                "QUAL" : "256",
                                "FILTER" : "PASS",
                                "AA" : "...",
                                "AC" : "117",
                                "AF" : "0.05",
                                "AFR_AF" : "0.09",
                                "AMR_AF" : "0.05",
                                "AN" : "2184",
                                "ASN_AF" : "0.07",
                                "AVGPOST" : "0.9664",
                                "ERATE" : "0.0027",
                                "EUR_AF" : "0.02",
                                "LDAF" : "0.0610",
                                "RSQ" : "0.7797",
                                "THETA" : "0.0087",
                                "VT" : "INDEL"
                        },
                        "samples" : [
                                {
                                    "id": "NA20818",
                                    "attributes": {
                                        "GL" : "0.00,-1.20,-22.90",
                                        "GT" : "C|C",
                                        "DS" : "0.000"
                                    }
                                },
                                {
                                    "id": "NA20819",
                                    "attributes": {
                                        "GL" : "0.00,-2.10,-34.30",
                                        "GT" : "C|C",
                                        "DS" : "0.000"
                                    }
                                },
                                {
                                    "id": "NA20826",
                                    "attributes": {
                                        "GL" : "0.00,-0.60,-11.40",
                                        "GT" : "C|A",
                                        "DS" : "0.000"
                                    }
                                }
                        ],
                        "stats" : {
                                "maf" : 0.0535714291036129,
                                "mgf" : 0.002747252816334367,
                                "alleleMaf" : "A",
                                "genotypeMaf" : "A|A",
                                "missAllele" : 0,
                                "missGenotypes" : 0,
                                "genotypeCount" : {
                                        "0/0" : 978,
                                        "0/1" : 111,
                                        "1/1" : 3
                                }
                        }
                }
        ]
}

Whether to store samples, statistics and variant effects can be configured.

Monbase (Mongo + HBase) backend

This schema uses HBase for storing raw data and Mongo as an index for the HBase database.

In HBase, variants are stored in a table with a chr:position in each row. There are 2 column families, one (data) for storing the variant and samples information, and another one (info) for storing variant statistics in a certain file.

Row Key Column Family: Data (d) Column Family: Info (i)
chromosome:position columns in the input file study statistics
1:123456 d:f1_ref = { A }
d:f1_alt = { C }
d:f1_NA001 = { GT : A/C, DP = 5 }
i:f1_stats = { MAF : 0.05, MGF : 0.20, miss : 10 }
i:f1_stats = { MAF : 0.04, MGF : 0.15, miss : 2 }

Storing pre-calculated statistics allows to retrieve global information for a file very fast. Statistics for a subset of the samples in the file or for combinations of samples from multiple studies must be calculated on demand and won't be stored afterwards; saving all of them would not be affordable.

The Mongo collection establishes the relationships between variants and files. It also stores the most commonly accessed variant statistics for each file. When translated to JSON, the schema would be something similar to the following:

{
 "position" : "1:123456",
 "sources" : [ 
    {   
      "sourceId" : "f1",  
      "sourceName" : "file.vcf",
      "ref" : "A",
      "alt" : [ "AT", "TT" ],
      "stats" : {
         "MAF" : 0.05,
         "allele_maf" : "A",
         "missing" : 2,
         "genotype_count" : {
            "0/0" : 12,
            "0/1" : 23,
            "./." : 2
         }
    },
    { "sourceId" : "f2", ...  }
  ]
}
Clone this wiki locally