This pipeline performs the following tasks:
- Intallation and Reference Genomes
- Raw read QC (FastQC)
- Adapter trimming (Trim Galore)
- Alignment (BWA or bowtie2)
- Peak calling(MACS2)
- Differential Peak calling(DiffBind)
- Motif identification(HOMER)
- Inferring transcriptional regulators(LISA)
- Peaks data visualisation (IGV)
- Visualization(R)
- Running the pepline
- Experiments should have two or more biological replicates
- Each ChIP-seq experiment should have a corresponding input control experiment with matching run type, read length, and replicate structure.
- Library complexity is measured using the Non-Redundant Fraction (NRF) and PCR Bottlenecking Coefficients 1 and 2, or PBC1 and PBC2. Preferred values are as follows: NRF>0.9, PBC1>0.9, and PBC2>10.
- For narrow-peak histone experiments, each replicate should have 20 million usable fragments.(H3F3A, H3K27me3, H3K36me3, H3K4me1, H3K79me2, H3K79me3, H3K9me1, H3K9me2, H4K20me1)
- For broad-peak histone experiments, each replicate should have 45 million usable fragments.(H2AFZ, H3ac, H3K27ac, H3K4me2, H3K4me3, H3K9ac)
- H3K9me3 is an exception as it is enriched in repetitive regions of the genome. Compared to other broad marks, there are few H3K9me3 peaks in non-repetitive regions of the genome in tissues and primary cells. This results in many ChIP-seq reads that map to a non-unique position in the genome. Tissues and primary cells should have 45 million total mapped reads per replicate.
- For transcription factor experiments, each replicate should have 10 million usable fragments.
- Linux/Unix
- Python
- R
We uses the Miniconda3 package management system to harmonize all of the software packages. Use the following commands to install Minicoda3:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n chip-seq
conda activate chip-seq
Tools needed for this analysis are: R, bedtools, FastQC, Trim Galore, MACS2, BWA, Lisa.
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c r r
conda install -c bioconda bedtools
conda install -c bioconda fastqc
conda install trim-galore
conda install -c bioconda macs2
conda install -c bioconda bwa
conda install -c liulab-dfci lisa2
conda install -c bioconda homer
Obtain a reference genome from Ensembl, iGenomes, NCBI or UCSC. In this example analysis we will use the mouse mm10 version of the genome from UCSC.
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines.
run FastQC interactively or using ht CLI, which offers the following options:
fastqc seqfile1 seqfile2 .. seqfileN
Use trim_glore to trim sequence adapter from the read FASTQ files.
trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 \
--paired $dir/cmp/01raw_data/$fq1 $dir/cmp/01raw_data/$fq2 \
--gzip -o $input_data
Reads were preprocessed and aligned reads to the appropriate reference genome (e.g., GRCh38 for human) using bowtie2, as shown below.
#set global variables
GENOME=/anno/human/genome/Homo_sapiens.GRCh38.dna.primary_assemblyexport
BASE_PATH=/home/user/human/ChIP-seq/
SUFFIX=_trim_paired.fastq.gz
cd $BASE_PATH/GENE_ID/replicate1
# replace _read_ID_ with your IDs
R1=QC/_read_ID_$SUFFIX
R2=QC/_read_ID_$SUFFIX
# do alignment
bowtie2 -p 4 -x $GENOME -1 $R1 -2 $R2 -S alignment.sam 2> log.txt
# if single-read, use
READ=QC/_read_ID_$SUFFIX
bowtie2 -p 4 -x $GENOME -U $READ -S alignment.sam 2> log.txt
#remove unmapped reads and duplicated reads (268= Read unmapped (4) or Mate unmapped (8) or Not primary alignment (256))
samtools view -h -F 268 -q 5 -bS alignment.sam > unique_alignment.bam
samtools sort unique_alignment.bam -o unique_alignment_sorted.bam
samtools rmdup unique_alignment_sorted.bam unique_alignment_sorted_rd.bam
rm alignment.sam unique_alignment.bam unique_alignment_sorted.bam
Peak calling was performed using MACS2 (200bp each, ±100bp from the peak summit), followed by removal of peak regions overlapping any blacklisted regions.
## for narrow peaks:
macs2 callpeak -t IP.bam -c Input.bam -n test -p 0.01 --nomodel --extsize fragment_length --keep-dup all -g hs
## for borad regions:
macs2 callpeak -t IP.bam -c Input.bam -n test --broad -p 0.01 --nomodel --extsize fragment_length --keep-dup all -g hs
HOMER contains two tools, findMotifs.pl and findMotifsGenome.pl, that manage all the steps for discovering motifs in promoter and genomic regions, respectively.
findMotifs.pl <inputfile.txt> <promoter set> <output directory> [options]
##findMotifsGenome.pl H3K4Me3.bed hg19 H3K4Me3_motif -len 8,10,12
##This will search for motifs of length 8,10,and 12 from -400 to +100 relative to the TSS, using 4 threads (i.e. 4 CPUs)
##findMotifs.pl will produce a number of output files in the "output directory". The primary output will be in HTML files that should be opened with you favorite web browser.
DiffBind is an R Bioconductor package that is used for identifying sites that are differentially enriched between two or more sample groups.
Rscript ~rcode/DiffBind.R
Running the pepline using [CHIPS(https://github.com/liulab-dfci/CHIPS)]
- ENCODE: Encyclopedia of DNA Elements ENCODExplorer: A compilation of metadata from ENCODE. A bioc package to access the meta data of ENCODE and download the raw files.
- ENCODE Factorbook
- ChromNet ChIP-seq interactions
paper: Learning the human chromatin network using all ENCODE ChIP-seq datasets - The International Human Epigenome Consortium (IHEC) epigenome data portal
- GEO. Sequences are in .sra format, need to use sratools to dump into fastq.
- European Nucleotide Archive. Sequences are available in fastq format.
- Data bases and software from Sheirly Liu's lab at Harvard
- Blueprint epigenome
- A collection of tools and papers for nucelosome positioning and TF ChIP-seq
- review paper:Deciphering ENCODE
- EpiFactors is a database for epigenetic factors, corresponding genes and products.
- biostar handbook. My ChIP-seq chapter is out April 2017!
- ReMap 2018 An integrative ChIP-seq analysis of regulatory regions. The ReMap atlas consits of 80 million peaks from 485 transcription factors (TFs), transcription coactivators (TCAs) and chromatin-remodeling factors (CRFs) from public data sets. The atlas is available to browse or download either for a given TF or cell line, or for the entire dataset.