Coda: a convolutional denoising algorithm for genome-wide ChIP-seq data
Coda uses convolutional neural networks to learn a mapping from noisy to high-quality ChIP-seq data. These trained networks can then be used to remove noise and improve the quality of new ChIP-seq data. For more details, please refer to our paper
Koh PW, Pierson E, Kundaje A, Denoising genome-wide histone ChIP-seq with convolutional neural networks. Bioinformatics (2017) 33 (14): i225-i233 URL:https://doi.org/10.1093/bioinformatics/btx243 (ISMB 2017 Proceedings)
bioRxiv doi: https://doi.org/10.1101/052118
The code is written in Python 2.7 and requires the following Python packages to run:
- Numpy (1.11.1)
- Scipy (0.18.0)
- Scikit-learn (0.17.1)
- Pandas (0.18.1)
- h5py (2.6.0)
- rpy2 (2.8.1)
- Keras (1.0.7)
In addition, if you want to process your own data, you will need:
- AQUAS ChIP-seq pipeline
- SAMtools (1.2)
- BEDtools (2.23)
- ucsc_tools (3.0.9)
Training and testing a model with pre-processed data
The fastest way to get started is to download data that has already been pre-processed. We have uploaded processed ChIP-seq data from lymphoblastoid cell lines GM12878 and GM18526, taken from . Each cell line has two sets of ChIP-seq data, one derived from 1M reads per mark and the other from 100M+ reads per mark. The instructions below will train a model to recover high-depth data from low-depth data on GM12878, and then apply it to low-depth data on GM18526, evaluating the model output against high-depth data on GM18526:
Clone the repo and install the dependencies above.
diConstants.pyto reflect the paths where you want to store the data, code, results, etc.
setup.py. This runs a few test imports to make sure you have the required libraries, and sets up the directory structure as specified in
copyData.py. This copies the required data (including hg19 blacklist and chromosome sizes) to the appropriate folders. Note that the data is 6GB in size, so please run this script in a location where there's enough space!
python runGMExperiments.pyto get the experiments going. Numerical results will be written to
RESULTS_ROOT. Output tracks (reconstructed signal and peak calls) will be written to
RESULTS_BIGWIG_ROOT. We make use of the R 'PRROC' package, written by Jan Grau and Jens Keilwagen, to evaluate peak calls.
Processing your own data
We use the AQUAS ChIP-seq pipeline (https://github.com/kundajelab/TF_chipseq_pipeline)
to process raw ChIP-seq data. The script
prepData.py (and the contents of the
contains wrapper functions that call the AQUAS pipeline for you.
Please install the AQUAS pipeline before proceeding. Note that this pipeline is still under some development and might be changing in non-backwards-compatible ways. Our code has been tested with commit 7b7dd27d42d46ac52f5687f80904c576d1b6595d of the AQUAS pipeline.
To create the processed data that we provided above, you may run the following steps:
Follow steps 1-3 of the above section.
Download the files corresponding to GM12878 and GM18526: http://gbsc-share.stanford.edu/chromovar/rawdata/mapped/bam/personal/reconcile/dedup/
python prepData.py make_intervals hg19. You only need to do this once.
python prepData.py run_GM_pipeline.
This code assumes that you've downloaded the files to a shared location
REMOTE_ROOT, specified in diConstants.py). It makes copies of the files in a
RAW_ROOT, before proceeding. This setup is useful if
is shared across multiple machines and
RAW_ROOT is local to the machine that you're
running the code on, because there will be a lot of IO operations that will be faster
if done locally. If you do not need this, modify
to remove the copying.
To process your own data, simply modify the paths in
diConstants.py or copy your
data to the right directories. While we start from BAM files in this example, the AQUAS
pipeline can start from a variety of input files (e.g., FASTQ, tagAligns). Edit
scripts/findPeaks.sh if you want to change the parameters that
are passed into AQUAS.
If you have any questions, please contact:
- Pang Wei Koh email@example.com
- Emma Pierson firstname.lastname@example.org
- Anshul Kundaje email@example.com
 Kasowski M, Kyriazopoulou-Panagiotopoulou S, Grubert F, Zaugg JB, Kundaje A, Liu Y, et al. Extensive variation in chromatin states across humans. Science (New York, NY). 2013 11;342(6159):750–2