Skip to content

Latest commit

 

History

History
230 lines (165 loc) · 12.8 KB

ChIP-seq.rst

File metadata and controls

230 lines (165 loc) · 12.8 KB

ChIP-seq

What it does

The ChIP-seq pipeline takes one or more BAM files and attempts to find peaks. If multiple samples and a sample sheet are provided, then CSAW is additionally used to call differential peaks. Both sharp and broad peak calling are supported.

../images/ChIPseq_pipeline.png

In addition to peaks, bigWig tracks are also generated.

Input requirements

The DNA mapping pipeline generates output that is fully compatible with the ChIP-seq pipeline input requirements! When running the ChIP-seq pipeline, please specify the output directory of DNA-mapping pipeline as the working directory (-w).

If you need to provides file NOT generated by the DNA-mapping pipeline, then you must provide a directory with the following structure:

.
├── deepTools_qc
│   └── bamPEFragmentSize
│       ├── fragmentSize.metric.tsv
│       └── fragmentSizes.png
├── filtered_bam
│   ├── sample1.filtered.bam
│   ├── sample1.filtered.bam.bai
│   ├── sample2.filtered.bam
│   └── sample2.filtered.bam.bai
├── Sambamba
│   ├── flagstat_report_all.tsv
│   ├── sample1.markdup.txt
│   └── sample2.markdup.txt
└── sampleSheet.yaml
  • deepTools_qc contains the output of bamPEFragmentSize from deepTools, run on all the BAM files.
  • Sambamba directory contains the output of flagstat command from sambamba (the .markdup.txt files) and a single file summarizing that with columns sample (sample name, such as sampl1), total (total reads), dup (number of duplicate reads), and mapped (number of mapped reads).
  • filtered_bam directory contains the input BAM files (either filtered or unfiltered, however you prefer).
  • sampleSheet.tsv (OPTIONAL) is only needed to test for differential binding.

Sample configuration

The ChIP-seq sample configuration yaml file describes what type of peak calling to perform on each sample and which sample to use as the input control

chip_dict:
  SRR6761497:
    control: SRR6761502
    broad: True
  SRR6761498:
    control: SRR6761502
    broad: True
  SRR6761495:
    control: SRR6761502
    broad: False
  SRR6761499:
    control: SRR6761502
    broad: False

As you can see above, the same control can be used for multiple samples.

Note

Set the flag broad to True for broad marks, such as H3K27me and H3K9me3

Spikein Normalization

If chromatin from an external organism was spikein in, it is possible to obtain spikein-derived scaling factors for the ChIP (and input) samples with the flag --useSpikeInForNorm. This requires providing a hybrid bam file, with reads aligned to a hybrid genome of host and spikein chromosomes. Spikein chromosome extention can be specified with --spikeinExt. Scale factors can be obtained either from whole spikein genome in the ChIP samples, from windows centered on TSS in the spikein genome in the ChIP samples, or from whole spikein genome in the input samples . The default scale factors from whole spikein genome in the ChIP samples can be changed to something else with --getSizeFactorsFrom.

DESeq2-style scaling factors produced with deepTools multiBamSummary will then be used to create bam coverage tracks and passed to CSAW as size Factors if sample sheet is provided.

A hybrid genome can be obtained with createIndices workflow and can be passed to the DNA-mapping workflow without any particular arguments.

Differential Binding analysis

If you wish to perform differential binding analysis between two group of samples, for example wild-type vs Knock-outs, via snakePipes. You would require a sample-sheet and the --sampleSheet option. Sample sheet may contain only a subset of samples used in the previous steps e.g. for peak calling. In addition, input samples are filtered out prior to the analysis using the sample configuration yaml (see above).

The sample sheet is a tab-separated file with two columns, named name and condition. An example is below:

name    condition
sample1 wild-type
sample2 wild-type
SRR7013047      wild-type
SRR7013048      mutant
SRR7013049      mutant
SRR7013050      mutant

For comparison between two conditions, the name you assign to "condition" is not relevant, but rather the order is. The group mentioned first (in the above case "wild-type") would be used as a "control" and the group mentioned later would be used as "test".

The differential binding module utilizes the R package CSAW to detect significantly different peaks between two conditions. The analysis is performed on a "union" of peaks from all samples mentioned in the sample sheet. This merged set of regions are provided as an output inside the CSAW folder as the file 'DiffBinding_allregions.bed'. All differentially bound regions are available in 'CSAW/DiffBinding_significant.bed'. Two thresholds are applied to produce Filtered.results.bed : FDR (default 0.05 ) as well as absolute log fold change (1). These can be specified either in the defaults.yaml dictionary or via commandline parameters '--FDR' and '--LFC'. Additionally, filtered results are split into up to 3 bed files, representing direction change (UP, DOWN, or MIXED).

If the user provides additional columns between 'name' and 'condition' in the sample sheet, the variables stored there will be used as blocking factors in the order they appear in the sample sheet. Condition will be the final column and it will be used for any statistical inference.

Note

In order to include or exclude peaks from selected samples in the union of peaks used in the differential binding analysis, the user may provide an additional column named 'UseRegions' and set it to True or False, accordingly. This column must supersede the 'condition' column in the column order.

Merged regions from filtered results with any direction change are further used to produce deepTools heatmaps, using log2 ratio of chip signal to input or depth-normalized coverage. For this purpose, the regions are rescaled to 1kb, and extended by 0.2kb on each side.

An html report summarizing the differential binding analysis is produced in the same folder.

Filtered results are also annotated with the distance to the closest gene using bedtools closest and written as '.txt' files to the AnnotatedResults_* folder.

Configuration file

There is a configuration file in snakePipes/workflows/ChIP-seq/defaults.yaml:

pipeline: chip-seq
configFile:
clusterConfigFile:
local: false
maxJobs: 5
## workingdir need to be required DNA-mapping output dir, 'outdir' is set to workingdir internally
workingdir:
## preconfigured target genomes (mm9,mm10,dm3,...) , see /path/to/snakemake_workflows/shared/organisms/
## Value can be also path to your own genome config file!
genome:
## Which peak caller should be used?
peakCaller: 'MACS2'
## paired end data?
pairedEnd: true
## Bin size of output files in bigWig format
bwBinSize: 25
## Median/mean fragment length, only relevant for single-end data (default: 200)
fragmentLength: 200
verbose: false
# sampleInfo_DB
sample_info:
# windowSize
windowSize: 150
plot_format: png
##dummy string to skip filtering annotation
filter_annotation:
##parameters to filter DB regions on
fdr: 0.05
absBestLFC: 1

The only parameters that are useful to change are bwBinSize, fragmentLength, and windowSize. Note however that those can be more conveniently changed on the command line.

Understanding the outputs

The ChIP-seq pipeline will generate additional output as follows:

.
├── deepTools_ChIP
│   ├── bamCompare
│   │   ├── sample1.filtered.log2ratio.over_SRR6761502.bw
│   │   ├── sample1.filtered.subtract.SRR6761502.bw
│   │   ├── sample2.filtered.log2ratio.over_SRR6761502.bw
│   │   └── sample2.filtered.subtract.SRR6761502.bw
│   └── plotFingerprint
│       ├── plotFingerprint.metrics.txt
│       └── plotFingerprint.png
├── histoneHMM
│   ├── sample2.filtered.histoneHMM-em-posterior.txt.gz
│   ├── sample2.filtered.histoneHMM-regions.gff.gz
│   ├── sample2.filtered.histoneHMM-regions.gff.gz.tbi
│   ├── sample2.filtered.histoneHMM.txt.gz
│   ├── sample2.filtered.histoneHMM-zinba-emfit.pdf
│   ├── sample2.filtered.histoneHMM-zinba-params-em.RData
│   └── sample2.filtered.histoneHMM-zinba-params-em.txt
├── Genrich
│   └── sample2.narrowPeak
└── MACS2
    ├── sample1.filtered.BAM_peaks.narrowPeak
    ├── sample1.filtered.BAM_peaks.qc.txt
    ├── sample1.filtered.BAM_peaks.xls
    ├── sample1.filtered.BAMPE_peaks.narrowPeak
    ├── sample1.filtered.BAMPE_peaks.xls
    ├── sample1.filtered.BAMPE_summits.bed
    ├── sample1.filtered.BAM_summits.bed
    ├── sample2.filtered.BAM_peaks.broadPeak
    ├── sample2.filtered.BAM_peaks.gappedPeak
    ├── sample2.filtered.BAM_peaks.qc.txt
    ├── sample2.filtered.BAM_peaks.xls
    ├── sample2.filtered.BAMPE_peaks.broadPeak
    ├── sample2.filtered.BAMPE_peaks.gappedPeak
    └── sample2.filtered.BAMPE_peaks.xls

Following up on the DNA-mapping module results (see :doc:`DNA-mapping`), the workflow produces the following output directories :

  • deepTools_ChIP: Contains output from two of the deepTools modules. The bamCompare output contains the input-normalized coverage files for the samples, which is very useful for downstream analysis, such as visualization in IGV and plotting the heatmaps. The plotFingerPrint output is a useful QC plot to assess signal enrichment in the ChIP samples.
  • Genrich: This folder contains the output of Genrich. This will only exist IF you specified --peakCaller Genrich and you have samples with non-broad peaks. The output is in narrowPeak format, like that from MACS2.
  • MACS2: This folder contains the output of MACS2 on the ChIP samples, MACS2 would perform either a narrow or broad peak calling on the samples, as indicated by the ChIP sample configuration file (see :ref:`ChIPconfig`). The outputs files would contain the respective tags (narrowPeak or broadPeak). This folder will only exist if you have non-broad marks and use MACS2 for peak calling
  • histoneHMM: This folder contains the output of histoneHMM. This folder will only exist if you have broad marks.
  • CSAW_sampleSheet: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`)
  • AnnotatedResults_sampleSheet: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`). Differentially bound regions annotated with distance to nearest gene are stored here.

Note

Although in case of broad marks, we also perform the MACS2 broadpeak analysis (output available as MACS2/<sample>.filtered.BAM_peaks.broadPeak), we would recommend using the histoneHMM outputs in these cases, since histoneHMM produces better results than MACS2 for broad peaks.

Note

The _sampleSheet suffix for the CSAW_sampleSheet is drawn from the name of the sample sheet you use. So if you instead named the sample sheet mySampleSheet.txt then the folder would be named CSAW_mySampleSheet. This facilitates using multiple sample sheets.

Note

At the moment Genrich is NOT jointly calling peaks within a group since it's not aware of which samples contain which antibody. It is utilizing the input control if one exists.

Command line options

.. argparse::
    :func: parse_args
    :filename: ../snakePipes/workflows/ChIP-seq/ChIP-seq
    :prog: ChIP-seq
    :nodefault: