Skip to content

[Step 3] Database structure

Mikhail Dozmorov edited this page Feb 24, 2016 · 1 revision

The database structure is organized to simplify navigation and selection cell type-specific (categories of) regulatory datasets.

The ENCODE data is organized using [category]/([subcategory])/[tier] schema.

  • The DNase, Histone, TFBS_cellspecific categories contain corresponding cell type-specific regulatory data. The TFBS_combined category contains the non-cell type-specific summary of binding of 161 transcription factors. The [chromStates](ENCODE chromStates) category contains cell type-specific chromatin states obtained using different methods.
  • The tier system reflects [cell type specificity and quality](ENCODE cell types) of the data.

The Roadmap Epigenomics data follows the [category]/[cell/tissue type] schema.

  • The DNase/Histone categories contain corresponding cell type-specific regulatory data. The _bPk/_gPk/_nPk suffixes correspond to peaks called using broad/gapped/narrow peaks settings, respectively. See c. Peak Calling section for more details. We recommend using _bPk data. The processed/imputed suffixes correspond to experimentally obtained/computationally imputed regulatory data, respectively. See Imputed signal tracks for more details. We recommend using processed data.
  • The cell/tissue type system organizes data derived from [general anatomical categories](Roadmap cell types).

The file names generally follow [cell]-[factor]-[category] schema to quickly identify regulatory datasets without the need to consult detailed description.