GitHub - lzy604/ChIP-seq: ChIP-seq

ChIP-seq PIPELINE

This pipeline performs the following tasks:

Intallation and Reference Genomes
Raw read QC (FastQC)
Adapter trimming (Trim Galore)
Alignment (BWA or bowtie2)
Peak calling(MACS2)
Differential Peak calling(DiffBind)
Motif identification(HOMER)
Inferring transcriptional regulators(LISA）
Peaks data visualisation (IGV)
Visualization(R)
Running the pepline

ChIP-seq Data Standards

Experiments should have two or more biological replicates
Each ChIP-seq experiment should have a corresponding input control experiment with matching run type, read length, and replicate structure.
Library complexity is measured using the Non-Redundant Fraction (NRF) and PCR Bottlenecking Coefficients 1 and 2, or PBC1 and PBC2. Preferred values are as follows: NRF>0.9, PBC1>0.9, and PBC2>10.
For narrow-peak histone experiments, each replicate should have 20 million usable fragments.（H3F3A, H3K27me3, H3K36me3, H3K4me1, H3K79me2, H3K79me3, H3K9me1, H3K9me2, H4K20me1）
For broad-peak histone experiments, each replicate should have 45 million usable fragments.(H2AFZ, H3ac, H3K27ac, H3K4me2, H3K4me3, H3K9ac)
H3K9me3 is an exception as it is enriched in repetitive regions of the genome. Compared to other broad marks, there are few H3K9me3 peaks in non-repetitive regions of the genome in tissues and primary cells. This results in many ChIP-seq reads that map to a non-unique position in the genome. Tissues and primary cells should have 45 million total mapped reads per replicate.
For transcription factor experiments, each replicate should have 10 million usable fragments.

System requirements

Linux/Unix
Python
R

Installation

We uses the Miniconda3 package management system to harmonize all of the software packages. Use the following commands to install Minicoda3：

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Create an isolated environment for ChIP-seq

conda create -n chip-seq
conda activate chip-seq

Install tools

Tools needed for this analysis are: R, bedtools, FastQC, Trim Galore, MACS2, BWA, Lisa.

conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c r r 
conda install -c bioconda bedtools 
conda install -c bioconda fastqc
conda install trim-galore
conda install -c bioconda macs2
conda install -c bioconda bwa 
conda install -c liulab-dfci lisa2 
conda install -c bioconda homer

Genome files

Obtain a reference genome from Ensembl, iGenomes, NCBI or UCSC. In this example analysis we will use the mouse mm10 version of the genome from UCSC.

Quality control on FastQ files

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines.

run FastQC interactively or using ht CLI, which offers the following options:

fastqc seqfile1 seqfile2 .. seqfileN

Adapter Trim[OPTIONAL]

Use trim_glore to trim sequence adapter from the read FASTQ files.

trim_galore -q 20 --phred33 --stringency 3 --length 20 -e 0.1 \
            --paired $dir/cmp/01raw_data/$fq1 $dir/cmp/01raw_data/$fq2  \
            --gzip -o $input_data

Alignment

Reads were preprocessed and aligned reads to the appropriate reference genome (e.g., GRCh38 for human) using bowtie2, as shown below.

#set global variables
GENOME=/anno/human/genome/Homo_sapiens.GRCh38.dna.primary_assemblyexport
BASE_PATH=/home/user/human/ChIP-seq/
SUFFIX=_trim_paired.fastq.gz
cd $BASE_PATH/GENE_ID/replicate1

# replace _read_ID_ with your IDs
R1=QC/_read_ID_$SUFFIX
R2=QC/_read_ID_$SUFFIX

# do alignment
bowtie2 -p 4 -x $GENOME -1 $R1 -2 $R2 -S alignment.sam 2> log.txt

# if single-read, use
READ=QC/_read_ID_$SUFFIX
bowtie2 -p 4 -x $GENOME -U $READ -S alignment.sam 2> log.txt

#remove unmapped reads and duplicated reads (268= Read unmapped (4) or  Mate unmapped (8) or  Not primary alignment (256))
samtools view -h -F 268 -q 5 -bS alignment.sam > unique_alignment.bam
samtools sort unique_alignment.bam -o unique_alignment_sorted.bam
samtools rmdup unique_alignment_sorted.bam unique_alignment_sorted_rd.bam
rm alignment.sam unique_alignment.bam unique_alignment_sorted.bam

Peak calling

Peak calling was performed using MACS2 (200bp each, ±100bp from the peak summit), followed by removal of peak regions overlapping any blacklisted regions.

## for narrow peaks:
macs2 callpeak -t IP.bam -c Input.bam -n test -p 0.01 --nomodel --extsize fragment_length --keep-dup all -g hs

## for borad regions:
macs2 callpeak -t IP.bam -c Input.bam -n test --broad -p 0.01 --nomodel --extsize fragment_length --keep-dup all -g hs

Motif discovery

HOMER contains two tools, findMotifs.pl and findMotifsGenome.pl, that manage all the steps for discovering motifs in promoter and genomic regions, respectively.

findMotifs.pl <inputfile.txt> <promoter set> <output directory> [options]

##findMotifsGenome.pl H3K4Me3.bed hg19 H3K4Me3_motif -len 8,10,12 
##This will search for motifs of length 8，10，and 12 from -400 to +100 relative to the TSS, using 4 threads (i.e. 4 CPUs)

##findMotifs.pl will produce a number of output files in the "output directory".  The primary output will be in HTML files that should be opened with you favorite web browser.

Differential Peak calling

DiffBind is an R Bioconductor package that is used for identifying sites that are differentially enriched between two or more sample groups.

Rscript ~rcode/DiffBind.R

Running the pepline using [CHIPS(https://github.com/liulab-dfci/CHIPS)]

Resources for ChIP-seq

ENCODE: Encyclopedia of DNA Elements ENCODExplorer: A compilation of metadata from ENCODE. A bioc package to access the meta data of ENCODE and download the raw files.
ENCODE Factorbook
ChromNet ChIP-seq interactions
paper: Learning the human chromatin network using all ENCODE ChIP-seq datasets
The International Human Epigenome Consortium (IHEC) epigenome data portal
GEO. Sequences are in .sra format, need to use sratools to dump into fastq.
European Nucleotide Archive. Sequences are available in fastq format.
Data bases and software from Sheirly Liu's lab at Harvard
Blueprint epigenome
A collection of tools and papers for nucelosome positioning and TF ChIP-seq
review paper:Deciphering ENCODE
EpiFactors is a database for epigenetic factors, corresponding genes and products.
biostar handbook. My ChIP-seq chapter is out April 2017!
ReMap 2018 An integrative ChIP-seq analysis of regulatory regions. The ReMap atlas consits of 80 million peaks from 485 transcription factors (TFs), transcription coactivators (TCAs) and chromatin-remodeling factors (CRFs) from public data sets. The atlas is available to browse or download either for a given TF or cell line, or for the entire dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
R_code		R_code
shell_command		shell_command
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChIP-seq PIPELINE

ChIP-seq Data Standards

System requirements

Installation

Create an isolated environment for ChIP-seq

Install tools

Genome files

Quality control on FastQ files

Adapter Trim[OPTIONAL]

Alignment

Peak calling

Motif discovery

Differential Peak calling

Running the pepline using [CHIPS(https://github.com/liulab-dfci/CHIPS)]

Resources for ChIP-seq

About

Releases

Packages

Languages

lzy604/ChIP-seq

Folders and files

Latest commit

History

Repository files navigation

ChIP-seq PIPELINE

ChIP-seq Data Standards

System requirements

Installation

Create an isolated environment for ChIP-seq

Install tools

Genome files

Quality control on FastQ files

Adapter Trim[OPTIONAL]

Alignment

Peak calling

Motif discovery

Differential Peak calling

Running the pepline using [CHIPS(https://github.com/liulab-dfci/CHIPS)]

Resources for ChIP-seq

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages