# 2016 AMIA: Introduction to ChIP-seq

<center>Kyle Hernandez, Ph.D. khernandez at bsd.uchicago.edu</center>
<center>Center for Research Informatics, University of Chicago</center>

*This document briefly covers the basics of what ChIP-Seq is and the types of questions it can answer. It is by no means exhaustive. I provide some links and citations for you to read through on your own time if you are interested in more in-depth knowledge. In addition, there are some more in-depth notebooks available in the workshop_extended directory which you can go through on your own time.*

## What is ChIP-seq?

<img src="workshop_extended/assets/f01_chipseq_overview.jpg" alt="Figure 01" style="width: 400px; float: right;"/>

* **Ch**romatin **I**mmuno**P**recipitation followed by **seq**uencing
* Sequencing of the genomic DNA fragments that co-precipitate with the protein of interest using high-throughput sequencing technologies
* Detect epigenetic changes
    * A "discovery" tool
    * Genome-wide

## There are different types of ChIP-seq

| Protein of interest | Enriched genomic DNA fragments |
| ------------------- | ------------------------------ |
| Transcription factors | Promoter, enhancer, silencer, insulator, other cis elements |
| RNA polymerase | Regions under active transcription |
| DNA polymerase | Regions under replication |
| Modified histones | Chromatin modification |

<center>...and more! We will focus on **transcription factors** in these sessions.</center>

## What can we learn from ChIP-seq?

* Location
    * Where does my protein of interest bind?
* Quantification
    * How strong is the signal?
* Annotation
    * Which type of sequence motif is enriched/present in the peaks? _(We won't have time to go over motif analysis, but the [MEME suite](http://meme-suite.org/) is a great set of tools)_
    * What are the target genes?
    * Which network/pathways are my target genes enriched?

## What we can't learn from ChIP-seq

* Gene expression changes (RNAseq)
* DNA sequence changes (WES, WGS, target amplicon sequencing)
* DNA methylation changes (MeDIPseq, bisulfite sequencing)
* RNA-protein interaction (CLIPseq)

## Overview of a ChIP-seq experiment

<img src="workshop_extended/assets/f02_chipseq_experiment.png" alt="Figure 02" width="800px" style="float: right;"/>
* Cross-link proteins to DNA (usually with formaldehyde)
* Cell disruption and sonication to shear the chromatin to a target size (100-300bp)
* Protein of interest and its bound DNA is enriched by purification with an antibody
* Next-generation sequencing
* Identify putatively enriched genomic regions

## Experimental design

* Antibody quality is important
    * Finding a _sensitive_ and _specific_ antibody to protein of interest is most crucial and challenging
    * 20-35% of commercial "ChIP-grade" antibodies unusable ([modENCODE](http://www.modencode.org/))
    * Check your antibody ahead if possible (e.g., Western blot)
    * Antibody list from ENCODE: https://www.encodeproject.org/search/?type=AntibodyLot

## Experimental design

* You should use control samples to control for background noise
    * "input" - crosslinking + fragmentation, but no IP genomic DNA; _most commonly used control_
    * "Mock IP" - DNA obtained with a control antibody that reacts with an irrelevant, non-nuclear antigen (e.g., IgG); crosslinking + fragmentation + IP with IgG antibody
* You should use biological replicates
    * Recommend at least 2 biological replicates
    * Used to establish biological variability
    * Reduces false positives (more power)

## Experimental design

* The million dollar question: How many reads do I need?
    * Transcription factors (sharp peaks)
        * 20+ million reads per sample
        * 40+ million per condition with 2 replicates
        * 150 milion reads per Illumina HiSeq lane
    * Try and keep similar sequencing depths between different IP experiments and between IP and INPUT

## A must read for ChIP-Seq projects

<center><blockquote>Landt et al., 2012. Genome Research 22:1813-1831</blockquote></center>

<img src="workshop_extended/assets/f03_chipseq_encode_paper.png" alt="Figure 03" width="800px" style="float: center;"/>

# 2016 AMIA: Analysis ChIP-seq

<center>Kyle Hernandez, Ph.D. khernandez at bsd.uchicago.edu</center>
<center>Center for Research Informatics, University of Chicago</center>

_So far we have introduced the concept of ChIP-Seq and discussed important experimental design aspects. Here, we will focus on a subset of the analytical steps after receiving your ChIP-Seq data from the sequencing facility._

## File formats

We do not have time to discuss file formats; however, we have provided a document (`workshop_extended/2016-AMIA-Workshop-Common-Formats`) which goes into more details. The most commonly used file formats in ChIP-seq analyses are:

* [fastq](https://en.wikipedia.org/wiki/FASTQ_format) - sequence data
* [fasta](https://en.wikipedia.org/wiki/FASTA_format) - sequence data
* [SAM/BAM](http://samtools.github.io/hts-specs/SAMv1.pdf) - alignment data
* [bed](https://genome.ucsc.edu/FAQ/FAQformat#format1) - peak data
* [narrowPeak](https://genome.ucsc.edu/FAQ/FAQformat#format12) - peak data
* [bigWig](https://genome.ucsc.edu/FAQ/FAQformat#format6.1) - normalized enrichment data

## Basic ChIP-Seq Workflow

<img src="workshop_extended/assets/f04_chipseq_basic_workflow.png" alt="Figure 01" width="850px" style="float: right;"/>
* Starts with your sequencing data (in fastq format)
* Focus only on peak calling and ChIP-Seq quality statistics
* Hands-on session that covers the annotation module
* There is more information available in the extended documents available in the github repository.

## Peak calling

**<center>Goal: detect regions (peaks) of enrichment in our IP samples</center>**

* At this point in the workflow we have a set of reads aligned to our reference genome (BAM format)
* In simple terms, peak calling software searches for regions in the genome with a greater than expected number of alignments ("sequencing tags") compared to the "background noise".
* A control sample (e.g., IgG, or input) helps with modelling the "noise" and greatly reduce the number of false positives.

## What does the ChIP-seq signal look like?

<img src="workshop_extended/assets/f06_chipseq_tag_shift.png" width="450px" alt="Figure 03" style="float: left;"/>
* Enriched sequence tags cluster at locations bound by the <span style="color:#FF8C00">protein of interest</span> (e.g., transcription factor)
* Sequencing tags accumulate on both the <span style="color:#B22222">forward</span> and <span style="color:#6495ED">reverse</span> strands centered around the binding site. That is to say, the tags are _shifted_ away from the center.
* The distance (shift) from the center depends on the _fragment size_ of your sequencing library
* The input control sequences lack this pattern of shifted stranded sequence tag

## ChIP-Seq quality control

The are several metrics and tools out there for determing the quality of your ChIP-Seq experiment (see 
[ENCODE guidelines](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/)); however, we will only touch 
on two metrics:

1. Relative strand correlation (RSC)
2. Fraction of reads falling within peak regions (FRiP)


## RSC scores

<img src="workshop_extended/assets/f11_shift.png" width="450px" alt="Figure 04" style="float: left;"/>
* Quantify the sequencing tag clustering (IP enrichment) genome-wide
* Pearson correlation of sequence tag densities between the strands after shifting by _k_ base pairs

## RSC scores
<img src="workshop_extended/assets/f07_chipseq_cc_plots.png" alt="Figure 04" width="1000px" style="float: center;"/>
* Two peaks are produced when cross-correlation is plotted against the shift value:
    1. A peak of enrichment corresponding to predominant fragment length
    2. Peak corresponding to the read length (called a "phantom" peak) 
* The ratio between the fragment-length peak and the read-length peak is the RSC

## RSC scores
<img src="workshop_extended/assets/f07_chipseq_cc_plots.png" alt="Figure 04" width="1200px" style="float: center;"/>

* A good estimate for the signal-to-noise ratio in ChIP-seq experiments
* High-quality ChIP-seq datasets tend to have a **larger** fragment-length peak compared with the read-length peak
* ENCODE guidelines suggest that you **repeat samples with RSC values less than 0.8**

## FRiP

* Fraction of your mapped reads that fall into peak regions identified by a peak-calling algorithm
* Rough metric for estimating the global enrichment of ChIP-seq data
* Even in highly enriched ChIP-seq experiments, only a minority of reads occur in peaks (the majority are background)
* ENCODE has shown that FRiP values correlate positively and linearly with RSC
* ENCODE guidelines suggest that you **repeat experiments with FRiP values below 1%**

**NOTE: In the session questions, we accidentally provided really high FRiP estimates and those values will probably never be seen in your experiments**