-
Notifications
You must be signed in to change notification settings - Fork 13
ChIP Seq Alignment and Processing Pipeline
This page explains the steps to get from fastq files to BAM and BED files in ChIP-Seq and then how to call peaks using JAMM. This is by no means the only pipeline, but it's one option that works well.
Things you will need installed on your computer:
- Bowtie2
- samtools
- bedtools
- JAMM (to call peaks)
Things you will need to have:
- Bowtie2 indeces, which you can get from here: http://support.illumina.com/sequencing/sequencing_software/igenome.html
Bowtie2 indeces (files that end in ".bt2") is how bowtie2 defines the genome. They are not one file, there are multiple files but all end in ".bt2". Let's say your Bowtie2 indeces are here:
/home/someusername/bowtie2Indeces/somegenome.1.bt2
/home/someusername/bowtie2Indeces/somegenome.1.bt2
/home/someusername/bowtie2Indeces/somegenome.rev.1.bt2
and let's say your fastq file that you want to process is here:
/home/someusername/myFastqFile.fastq
You can run bowtie2 like so:
bowtie2 -x /home/someusername/bowtie2Indeces/somegenome -U /home/someusername/myFastqFile.fastq -S /home/someusername/myFastqFile.sam
"myFastqFile.sam" is the output file. They are called SAM files and define where your reads are in the genome, along with other information about the quality of the alignment, the mismatches in the alignment and so on.
Bowtie2 will also output some information about the alignment statistics directly to the terminal. Take note of those numbers. If alignment percentage is low (I'd say < 90% or 80%), something might be off with your data. This can be that your reads need adapter trimming. We will not cover this here, but one thing you can do is check the quality of your fastq file using a program called FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
This is an optional step, but is a common one to do. In some cases, you can trust Bowtie2 to do the right thing and figure out where your reads should go but the more conservative approach is to remove reads that aligned to multiple locations in the genome (nonuniquely aligned) and those that had more than a certain amount of mismatches.
To do this, do:
samtools view -Sh /home/someusername/myFastqFile.sam | grep -e "^@" -e "XM:i:[012][^0-9]" | grep -v "XS:i:" > /home/someusername/myFastqFile.sam.filtered.sam
A BAM file is a binary compressed SAM file. It makes sense to do this conversion, because it's a lot easier to work with these smaller binary BAM files than with the huge text SAM files.
To do this conversion, type:
samtools view -S -b /home/someusername/myFastqFile.sam.filtered.sam > /home/someusername/myFastqFile.sam.filtered.sam.bam
and to sort this BAM file, type:
samtools sort /home/someusername/myFastqFile.sam.filtered.sam.bam /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted
This last statement will produce a file called: /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam
, which is a BAM file that is sorted by chromosome coordinates.
PCR amplification can produce reads that are just artifacts. The conservative way is to just "collapse" your reads, meaning if multiple reads map to the same location in the genome, keep only one. This removes all potential PCR duplicates.
This can be easily done using samtools like so:
samtools rmdup -s /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam
This file: /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam
, is your final BAM file..congratulations :)
Producing an index for your final BAM file will allow you to load your BAM file in IGV browser or UCSC browser.
type:
samtools index /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam
This will produce a new file here: /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam.bai
Once you produced this file, you can load your BAM file "myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam" in a genome browser to check how your data looks like.
BED files are simple text, tab-delimited files defining genomic coordinates. BAM files can be converted into BED files. BED files are bigger files but can be easier to analyze and manipulate. To convert your final BAM file into a BED file, type:
bedtools bamtobed -i /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam > /home/someusername/myFastqFile.sam.filtered.sam.bam.sorted.bam.nodup.bam.bed
JAMM takes as input a folder where all your replicates are located in BED file format. It will analyze all files in this folder as replicates.
to run JAMM on your final BED file that you just produced, type:
bash /path/to/JAMM/folder/JAMM.sh -s /home/someusername -g /path/to/your/chromosome/size/file -o /home/someusername
This will create two folders, one called "peaks" where you can find the resulting peak files and the other called "xcorr" with the JAMM-predicted fragment length.
This command will give you a large list of peaks, so that you can decide on how to threshold it. If you don't like that, try typing:
bash /path/to/JAMM/folder/JAMM.sh -s /home/someusername -g /path/to/your/chromosome/size/file -o /home/someusername -e auto
This will give you a thresholded list of peaks. Check other pages on the WIKI to find out more about how to run JAMM and its output.