## Overview: <a class="anchor" id="top"></a>
* [Introduction to ChIP-Seq](#01)
    * [What is ChIP-seq?](#01.1)
    * [What can we learn from ChIP-seq?](#01.2)
    * [What can we can't learn from ChIP-seq?](#01.3)
    * [Overview of a ChIP-seq experiment](#01.4)
    * [Experimental design](#01.5)
    * [A must read for ChIP-seq experiments](#01.6)
* [Analysis of ChIP-Seq](#02)
    * [File formats](#02.1)
    * [Basic ChIP-Seq workflow](#02.2)
    * [Peak calling](#02.3)
    * [ChIP-Seq quality control](#02.4)

# Introduction to ChIP-seq<a class="anchor" id="01"></a> <small>[[top](#top)]</small>

Kyle Hernandez, Ph.D. [khernandez@bsd.uchicago.edu](mailto:khernandez@bsd.uchicago.edu)

**2016 AMIA Pre-conference symposium**

This document briefly covers the basics of what ChIP-Seq is and the types of questions it can answer. It is
by no means exhaustive. I provide some links and citations for you to read through on your own time if you are 
interested in more in-depth knowledge. In addition, there are some more in-depth notebooks available in the `workshop_extended` directory which you can go through on your own time.

## What is ChIP-seq?<a class="anchor" id="01.1"></a> <small>[[top](#top)]</small>

<img src="workshop_extended/assets/f01_chipseq_overview.jpg" alt="Figure 01" style="width: 300px; float: right;"/>

* **Ch**romatin **I**mmuno**P**recipitation followed by **seq**uencing
* Sequencing of the genomic DNA fragments that co-precipitate with the protein of interest using high-throughput sequencing technologies
* Detect epigenetic changes
    * A "discovery" tool
    * Genome-wide

### Various types of ChIP-seq

| Protein of interest | Enriched genomic DNA fragments |
| ------------------- | ------------------------------ |
| Transcription factors | Promoter, enhancer, silencer, insulator, other cis elements |
| RNA polymerase | Regions under active transcription |
| DNA polymerase | Regions under replication |
| Modified histones | Chromatin modification |

...and more! We will focus on **transcription factors** in these sessions.

## What can we learn from ChIP-seq?<a class="anchor" id="01.2"></a> <small>[[top](#top)]</small>

* Location
    * Where does my protein of interest bind?
* Quantification
    * How strong is the signal?
* Annotation
    * Which type of sequence motif is enriched/present in the peaks? _(We won't have time to go over motif analysis, but the [MEME suite](http://meme-suite.org/) is a great set of tools)_
    * What are the target genes?
    * Which network/pathways are my target genes enriched?

## What we can't learn from ChIP-seq<a class="anchor" id="01.3"></a> <small>[[top](#top)]</small>

* Gene expression changes (RNAseq)
* DNA sequence changes (WES, WGS, target amplicon sequencing)
* DNA methylation changes (MeDIPseq, bisulfite sequencing)
* RNA-protein interaction (CLIPseq)

## Overview of a ChIP-seq experiment<a class="anchor" id="01.4"></a> <small>[[top](#top)]</small>

* Cross-link proteins to DNA (usually with formaldehyde)
* Cell disruption and sonication to shear the chromatin to a target size (100-300bp)
* Protein of interest and its bound DNA is enriched by purification with an antibody
* Next-generation sequencing
* Identify putatively enriched genomic regions

<img src="workshop_extended/assets/f02_chipseq_experiment.png" alt="Figure 02" style="float: center;"/>

## Experimental design<a class="anchor" id="01.5"></a> <small>[[top](#top)]</small>

* Antibody quality is important
    * Finding a _sensitive_ and _specific_ antibody to protein of interest is most crucial and challenging
    * 20-35% of commercial "ChIP-grade" antibodies unusable ([modENCODE](http://www.modencode.org/))
    * Check your antibody ahead if possible (e.g., Western blot)
    * Antibody list from ENCODE: https://www.encodeproject.org/search/?type=AntibodyLot
* You should use control samples to control for background noise
    * "input" - crosslinking + fragmentation, but no IP genomic DNA; _most commonly used control_
    * "Mock IP" - DNA obtained with a control antibody that reacts with an irrelevant, non-nuclear antigen (e.g., IgG); crosslinking + fragmentation + IP with IgG antibody
* You should use biological replicates
    * Recommend at least 2 biological replicates
    * Used to establish biological variability
    * Reduces false positives (more power)
* The million dollar question: How many reads do I need?
    * Transcription factors (sharp peaks)
        * 20+ million reads per sample
        * 40+ million per condition with 2 replicates
        * 150 milion reads per Illumina HiSeq lane
            * multiplex 4 samples (2 IP + 2 INPUT / lane)
    * Histone modification / Nucleosome positioning (broad peaks)
        * 40+ million per sample
        * 400 million or more may be needed!
    * It is important to _try_ and keep similar sequencing depths between different IP experiments (e.g., treatments), and between IP and INPUT (or fewer in INPUT)



## A must read for ChIP-Seq projects<a class="anchor" id="01.6"></a> <small>[[top](#top)]</small>

> Landt et al., 2012. Genome Research 22:1813-1831

<img src="workshop_extended/assets/f03_chipseq_encode_paper.png" alt="Figure 03" style="float: center;"/>

<hr>

# Analysis of ChIP-Seq<a class="anchor" id="02"></a> <small>[[top](#top)]</small>

So far we have introduced the concept of ChIP-Seq and discussed important experimental design aspects. Here, we will focus on a subset of the analytical steps after receiving your ChIP-Seq data from the sequencing facility.

## File formats<a class="anchor" id="02.1"></a> <small>[[top](#top)]</small>

We do not have time to discuss file formats; however, we have provided a document (`workshop_extended/2016-AMIA-Workshop-Common-Formats`) which goes into more details. The most commonly used file formats in ChIP-seq analyses are:

* [fastq](https://en.wikipedia.org/wiki/FASTQ_format) - sequence data
* [fasta](https://en.wikipedia.org/wiki/FASTA_format) - sequence data
* [SAM/BAM](http://samtools.github.io/hts-specs/SAMv1.pdf) - alignment data
* [bed](https://genome.ucsc.edu/FAQ/FAQformat#format1) - peak data
* [narrowPeak](https://genome.ucsc.edu/FAQ/FAQformat#format12) - peak data
* [bigWig](https://genome.ucsc.edu/FAQ/FAQformat#format6.1) - normalized enrichment data

## Basic ChIP-Seq Workflow<a class="anchor" id="02.2"></a> <small>[[top](#top)]</small>

This workflow starts after you have received your sequencing data back from the sequencing facility (in fastq format). Due to time constraints we will focus only on peak calling and ChIP-Seq quality statistics with a hands-on session that covers the annotation module (starred modules in figure). Again, there is more information available in the extended documents available in the github repository.


<img src="workshop_extended/assets/f04_chipseq_basic_workflow.png" alt="Figure 01" style="float: center;"/>

## Peak calling<a class="anchor" id="02.3"></a> <small>[[top](#top)]</small>

> Goal: detect regions (peaks) of enrichment in our IP samples

At this point in the workflow we have a set of reads aligned to our reference genome (BAM format) for each 
sample/treatment/control in the experiment. In simple terms, peak calling software searches for regions in the 
genome with a greater than expected number of alignments ("sequencing tags") compared to the "background noise". 
When you use a control sample (e.g., IgG, or input), peak calling software can better model the "noise" 
and greatly reduce the number of false positives.

<img src="workshop_extended/assets/f06_chipseq_tag_shift.png" width="450px" alt="Figure 03" style="float: left;"/>

### What does the ChIP-Seq signal look like?

* Enriched sequence tags cluster at locations bound by the <span style="color:#FF8C00">protein of interest</span> (e.g., transcription factor)
* Sequencing tags accumulate on both the <span style="color:#B22222">forward</span> and <span style="color:#6495ED">reverse</span> strands centered around the binding site. That is to say, the tags are _shifted_ away from the center.
* The distance (shift) from the center depends on the _fragment size_ of your sequencing library
* The input control sequences lack this pattern of shifed stranded sequence tag

[MACS2](http://liulab.dfci.harvard.edu/MACS/) is one of the most popular tools for detecting ChIP-seq peaks and is a good place to start for those who are new to peak detection.

```
Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137
```


## ChIP-Seq quality control<a class="anchor" id="02.4"></a> <small>[[top](#top)]</small>

The are several metrics and tools out there for determing the quality of your ChIP-Seq experiment (see 
[ENCODE guidelines](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/)); however, we will only touch 
on two metrics:

1. Relative strand correlation (RSC)
2. Fraction of reads falling within peak regions (FRiP)

### RSC scores

* Quantify the sequencing tag clustering (IP enrichment) genome-wide
* Pearson correlation between the strands after shifting the strands by _k_ base pairs
* Two peaks are produced when cross-correlation is plotted against the shift value:
    1. A peak of enrichment corresponding to predominant fragment length
    2. Peak corresponding to the read length (called a "phantom" peak) 
    
<img src="workshop_extended/assets/f07_chipseq_cc_plots.png" alt="Figure 04" style="float: center;"/>

* The ratio between the fragment-length peak and the read-length peak is the RSC $$RSC=\frac{cc\left(fragment\_length\right)-\min\left(cc\right)}{cc\left(read\_length\right)-\min\left(cc\right)}$$
* A good estimate for the signal-to-noise ratio in ChIP-seq experiments
* High-quality ChIP-seq datasets tend to have a **larger** fragment-length peak compared with the read-length peak
* ENCODE guidelines suggest that you **repeat samples with RSC values less than 0.8**

### FRiP

* Fraction of your mapped reads that fall into peak regions identified by a peak-calling algorithm
* Rough metric for estimating the global enrichment of ChIP-seq data
* Even in highly enriched ChIP-seq experiments, only a minority of reads occur in peaks (the majority are background)
* ENCODE has shown that FRiP values correlate positively and lineraly with RSC
* ENCODE guidelines suggest that you **repeat experiments with FRiP values below 1%**

**NOTE: In the session questions, we accidentally provided really high FRiP estimates and those values will probably never be seen in your experiments**
