# Introduction to ChIP-seq

Kyle Hernandez, Ph.D. [khernandez@bsd.uchicago.edu](mailto:khernandez@bsd.uchicago.edu)

License: LGPLv3

**2016 AMIA Pre-conference symposium**

This document briefly covers the basics of what ChIP-Seq is and the types of questions it can answer. It is
by no means exhaustive. I provide some links and citations for you to read through on your own time if you are 
interested in more in-depth knowledge.

## What is ChIP-seq?

<img src="assets/f01_chipseq_overview.jpg" alt="Figure 01" style="width: 300px; float: right;"/>

* **Ch**romatin **I**mmuno**P**recipitation followed by **seq**uencing
* Sequencing of the genomic DNA fragments that co-precipitate with the protein of interest using high-throughput sequencing technologies
* Detect epigenetic changes
    * A "discovery" tool
    * Unbiased (theoretically)
    * Genome-wide

### Various types of ChIP-seq

| Protein of interest | Enriched genomic DNA fragments |
| ------------------- | ------------------------------ |
| Transcription factors | Promoter, enhancer, silencer, insulator, other cis elements |
| RNA polymerase | Regions under active transcription |
| DNA polymerase | Regions under replication |
| Modified histones | Chromatin modification |

...and more! We will focus on **transcription factors** in these sessions.


## What can we learn from ChIP-seq?

* Location
    * Where does my protein of interest bind?
* Quantification
    * How strong is the signal?
* Annotation
    * Which type of sequence motif is enriched/present in the peaks? _(We won't have time to go over motif analysis, but the [MEME suite](http://meme-suite.org/) is a great set of tools)_
    * What are the target genes?
    * Which network/pathways are my target genes enriched?

## What we can't learn from ChIP-seq

* Gene expression changes (RNAseq)
* DNA sequence changes (WES, WGS, target amplicon sequencing)
* DNA methylation changes (MeDIPseq, bisulfite sequencing)
* RNA-protein interaction (CLIPseq)

## Overview of a ChIP-seq Experiment


<img src="assets/f02_chipseq_experiment.png" alt="Figure 02" style="float: center;"/>

## Experimental Design

* Antibody quality is important
    * Finding a _sensitive_ and _specific_ antibody to protein of interest is most crucial and challenging
    * 20-35% of commercial "ChIP-grade" antibodies unusable (modENCODE)
    * Check your antibody ahead if possible (e.g., Western blot)
    * Antibody list from ENCODE: https://www.encodeproject.org/search/?type=AntibodyLot
* You should use control samples to control for background noise
    * "input" - crosslinking + fragmentation, but no IP genomic DNA; _most commonly used control_
    * "Mock IP" - DNA obtained with a control antibody that reacts with an irrelevant, non-nuclear antigen (e.g., IgG); crosslinking + fragmentation + IP with IgG antibody; the amount is limited, leading to inconsistent results
* You should use biological replicates
    * Recommend at least 2 biological replicates
    * Used to establish biological variability
    * Reduces false positives (more power)
* The million dollar question: How many reads do I need?
    * Transcription factors (sharp peaks)
        * 20+ million reads per sample
        * 40+ million per condition with 2 replicates
        * 150 milion reads per Illumina HiSeq lane
            * multiplex 4 samples (2 IP + 2 INPUT / lane)
    * Histone modification / Nucleosome positioning (broad peaks)
        * 40+ million per sample
        * 400 million or more may be needed!
    * It is important to _try_ and keep similar sequencing depths between different IP experiments (e.g., treatments), and between IP and INPUT (or fewer in INPUT)

## A must read for ChIP-Seq projects

> Landt et al., 2012. Genome Research 22:1813-1831

<img src="assets/f03_chipseq_encode_paper.png" alt="Figure 03" style="float: center;"/>