ChIP Seq Alignment and Processing Pipeline

This page explains the steps to get from fastq files to BAM and BED files in ChIP-Seq and then how to call peaks using JAMM. This is by no means the only pipeline, but it's one option that works well.

Things you will need installed on your computer:

Bowtie2
samtools
bedtools
JAMM (to call peaks)

Things you will need to have:

Bowtie2 indeces, which you can get from here: http://support.illumina.com/sequencing/sequencing_software/igenome.html

1. Step One, align your fastq reads to the genome using bowtie2 with default parameters:

Bowtie2 indeces (files that end in ".bt2") is how bowtie2 defines the genome. They are not one file, there are multiple files but all end in ".bt2". Let's say your Bowtie2 indeces are here:

/home/someusername/bowtie2Indeces/somegenome.1.bt2 /home/someusername/bowtie2Indeces/somegenome.1.bt2 /home/someusername/bowtie2Indeces/somegenome.rev.1.bt2

and let's say your fastq file that you want to process is here:

/home/someusername/myFastqFile.fastq

You can run bowtie2 like so:

bowtie2 -x /home/someusername/bowtie2Indeces/somegenome -U /home/someusername/myFastqFile.fastq -S /home/someusername/myFastqFile.sam

"myFastqFile.sam" is the output file. They are called SAM files and define where your reads are in the genome, along with other information about the quality of the alignment, the mismatches in the alignment and so on.

Bowtie2 will also output some information about the alignment statistics directly to the terminal. Take note of those numbers. If alignment percentage is low (I'd say < 90% or 80%), something might be off with your data. This can be that your reads need adapter trimming. We will not cover this here, but one thing you can do is check the quality of your fastq file using a program called FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

2. Step Two, filter your SAM file for nonuniquely aligned reads that have two or less mismatches:

This is an optional step, but is a common one to do. In some cases, you can trust Bowtie2 to do the right thing and figure out where your reads should go but the more conservative approach is to remove reads that aligned to multiple locations in the genome (nonuniquely aligned) and those that had more than a certain amount of mismatches.

To do this, do:

samtools view -Sh /home/someusername/myFastqFile.sam | grep -e "^@" -e "XM:i:[012][^0-9]" | grep -v "XS:i:" > /home/someusername/myFastqFile.sam.filtered.sam

3. Step Three, convert your SAM file to a BAM file and sort it:

A BAM file is a binary compressed SAM file. It makes sense to do this conversion, because it's a lot easier to work with these smaller binary BAM files than with the huge text SAM files.

To do this conversion, type:

samtools view -S -b /home/someusername/myFastqFile.sam.filtered.sam > /home/someusername/myFastqFile.sam.filtered.sam.bam

and to sort this BAM file, type:

samtools sort /home/someusername/myFastqFile.sam.filtered.sam.bam /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted

This last statement will produce a file called: /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam, which is a BAM file that is sorted by chromosome coordinates.

4. Step Four, remove potential PCR duplicates from your BAM file:

PCR amplification can produce reads that are just artifacts. The conservative way is to just "collapse" your reads, meaning if multiple reads map to the same location in the genome, keep only one. This removes all potential PCR duplicates.

This can be easily done using samtools like so:

samtools rmdup -s /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam

This file: /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam, is your final BAM file..congratulations :)

5. Step Five, index your BAM file:

Producing an index for your final BAM file will allow you to load your BAM file in IGV browser or UCSC browser.

type:

samtools index /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam

This will produce a new file here: /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam.bai

Once you produced this file, you can load your BAM file "myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam" in a genome browser to check how your data looks like.

6. Step Six: convert your BAM file into a BED file:

BED files are simple text, tab-delimited files defining genomic coordinates. BAM files can be converted into BED files. BED files are bigger files but can be easier to analyze and manipulate. To convert your final BAM file into a BED file, type:

bedtools bamtobed -i /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam > /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam.bed

7. Step Seven: call peaks with JAMM

JAMM takes as input a folder where all your replicates are located in BED file format. It will analyze all files in this folder as replicates.

to run JAMM on your final BED file that you just produced, type:

bash /path/to/JAMM/folder/JAMM.sh -s /home/someusername -g /path/to/your/chromosome/size/file -o /home/someusername

This will create two folders, one called "peaks" where you can find the resulting peak files and the other called "xcorr" with the JAMM-predicted fragment length.

This command will give you a large list of peaks, so that you can decide on how to threshold it. If you don't like that, try typing:

bash /path/to/JAMM/folder/JAMM.sh -s /home/someusername -g /path/to/your/chromosome/size/file -o /home/someusername -e auto

This will give you a thresholded list of peaks. Check other pages on the WIKI to find out more about how to run JAMM and its output.

Questions or Corrections or Comments?

send me an email here
reach out on twitter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly