Skip to content
Coda: a convolutional denoising algorithm for genome-wide ChIP-seq data
Python R Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Public commit Aug 10, 2016
PRROC.R Added acknowledgements for PRROC. Aug 10, 2016 Updated Jul 14, 2017 Public commit Aug 10, 2016 Public commit Aug 10, 2016

Coda: a convolutional denoising algorithm for genome-wide ChIP-seq data

Coda uses convolutional neural networks to learn a mapping from noisy to high-quality ChIP-seq data. These trained networks can then be used to remove noise and improve the quality of new ChIP-seq data. For more details, please refer to our paper

Koh PW, Pierson E, Kundaje A, Denoising genome-wide histone ChIP-seq with convolutional neural networks. Bioinformatics (2017) 33 (14): i225-i233 URL: (ISMB 2017 Proceedings)

bioRxiv doi:


The code is written in Python 2.7 and requires the following Python packages to run:

  • Numpy (1.11.1)
  • Scipy (0.18.0)
  • Scikit-learn (0.17.1)
  • Pandas (0.18.1)
  • h5py (2.6.0)
  • rpy2 (2.8.1)
  • Keras (1.0.7)

In addition, if you want to process your own data, you will need:

  • AQUAS ChIP-seq pipeline
  • SAMtools (1.2)
  • BEDtools (2.23)
  • ucsc_tools (3.0.9)

Training and testing a model with pre-processed data

The fastest way to get started is to download data that has already been pre-processed. We have uploaded processed ChIP-seq data from lymphoblastoid cell lines GM12878 and GM18526, taken from [1]. Each cell line has two sets of ChIP-seq data, one derived from 1M reads per mark and the other from 100M+ reads per mark. The instructions below will train a model to recover high-depth data from low-depth data on GM12878, and then apply it to low-depth data on GM18526, evaluating the model output against high-depth data on GM18526:

  1. Clone the repo and install the dependencies above.

  2. Edit to reflect the paths where you want to store the data, code, results, etc.

  3. Run This runs a few test imports to make sure you have the required libraries, and sets up the directory structure as specified in

  4. Run This copies the required data (including hg19 blacklist and chromosome sizes) to the appropriate folders. Note that the data is 6GB in size, so please run this script in a location where there's enough space!

  5. Finally, run python to get the experiments going. Numerical results will be written to RESULTS_ROOT. Output tracks (reconstructed signal and peak calls) will be written to RESULTS_BIGWIG_ROOT. We make use of the R 'PRROC' package, written by Jan Grau and Jens Keilwagen, to evaluate peak calls.

Processing your own data

We use the AQUAS ChIP-seq pipeline ( to process raw ChIP-seq data. The script (and the contents of the scripts folder) contains wrapper functions that call the AQUAS pipeline for you.

Please install the AQUAS pipeline before proceeding. Note that this pipeline is still under some development and might be changing in non-backwards-compatible ways. Our code has been tested with commit 7b7dd27d42d46ac52f5687f80904c576d1b6595d of the AQUAS pipeline.

To create the processed data that we provided above, you may run the following steps:

  1. Follow steps 1-3 of the above section.

  2. Download the files corresponding to GM12878 and GM18526:

  3. Run python make_intervals hg19. You only need to do this once.

  4. Run python run_GM_pipeline.

This code assumes that you've downloaded the files to a shared location (REMOTE_ROOT, specified in It makes copies of the files in a local directory, RAW_ROOT, before proceeding. This setup is useful if REMOTE_ROOT is shared across multiple machines and RAW_ROOT is local to the machine that you're running the code on, because there will be a lot of IO operations that will be faster if done locally. If you do not need this, modify merge_BAMs() in to remove the copying.

To process your own data, simply modify the paths in or copy your data to the right directories. While we start from BAM files in this example, the AQUAS pipeline can start from a variety of input files (e.g., FASTQ, tagAligns). Edit scripts/ and scripts/ if you want to change the parameters that are passed into AQUAS.


If you have any questions, please contact:


[1] Kasowski M, Kyriazopoulou-Panagiotopoulou S, Grubert F, Zaugg JB, Kundaje A, Liu Y, et al. Extensive variation in chromatin states across humans. Science (New York, NY). 2013 11;342(6159):750–2

You can’t perform that action at this time.