/
EpiCompare.Rmd
341 lines (263 loc) · 12.4 KB
/
EpiCompare.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
title: "`EpiCompare`: Getting started"
author: "<h4>Authors: <i>`r auths <- eval(parse(text = gsub('person','c',read.dcf('../DESCRIPTION', fields = 'Authors@R'))));paste(auths[names(auths)=='given'],auths[names(auths)=='family'], collapse = ', ')`</i></h4>"
date: "<h4>Vignette updated: <i>`r format( Sys.Date(), '%b-%d-%Y')`</i></h4>"
csl: nature.csl
output:
BiocStyle::html_document:
vignette: >
%\VignetteIndexEntry{EpiCompare}
%\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
---
# Overview
The *EpiCompare* package is designed to facilitate the comparison of epigenomic
datasets for quality control and benchmarking purposes. The package combines
several downstream analysis tools for epigenomic data and generates a single
report that collates all results of the analysis. This allows users to conduct
downstream analysis of multiple epigenomic datasets simultaneously and compare
the results in a simple and efficient way.
## Introduction
For many years, ChIP-seq has been the standard method for epigenomic profiling,
but it suffers from a host of limitations. Recently, many other epigenomic
technologies (e.g. CUT&Run, CUT&Tag and TIP-seq etc.), designed to overcome
these constraints, have been developed. To better understand the performance of
these novel approaches, it is important that we systematically compare these
technologies and benchmark against a “gold-standard”.
There are many tools in R (e.g. *ChIPseeker*) that can be used to conduct
downstream analysis and comparison of epigenomic datasets. However, these are
often scattered across different packages and difficult to use for researchers
with none or little computational experience.
*EpiCompare* is designed to provide a simple and comprehensive way of analysing
and comparing epigenomic datasets. It combines many useful downstream analysis
tools, which can easily be controlled by users and it collates the results in a
single report. This allows researchers to systematically compare different
epigenomic technologies.
While the main functionality of *EpiCompare* is to contrast epigenomic
technologies, it can also be used to compare datasets generated using different
experimental conditions and data analysis workflows of one technology. This
can help researchers to establish a consensus regarding the optimal use of the
method.
Currently, *EpiCompare* only works for human genome as it uses human-based
hg19 and/or hg38 genome references.
# Data
The *EpiCompare* package contains a small subset of histone mark H3K27ac profile
data obtained/generated from:
* ENCODE (data accession: ENCFF044JNJ)
* CUT&Tag from Kaya-Okur et al., (2019). (PMID: 31036827)
* CUT&Run from Meers et al., (2019). (PMID: 31232687)
It also contains human genome hg19 and hg38 blacklisted regions obtained from
ENCODE. The ENCODE blacklist includes regions of the genome that have anomalous
and/or unstructured signals independent of the cell-line or experiment. Removal
of ENCODE blacklist is recommended for quality measure.
These dataset will be used to showcase *EpiCompare* functionality
# Installation
To install the package, run the following:
```R
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EpiCompare")
```
# Running *EpiCompare*
In this example analysis, we will compare CUT&Run and CUT&Tag of histone mark
H3K27ac against ENCODE ChIP-seq.
## Load package and example datasets
Once installed, load the package:
```{r setup_vignette}
library(EpiCompare)
```
Load example datasets used in this analysis:
```{r}
data("encode_H3K27ac") # ENCODE ChIP-seq
data("CnT_H3K27ac") # CUT&Tag
data("CnR_H3K27ac") # CUT&Run
data("hg19_blacklist") # hg19 genome blacklist
data("CnT_H3K27ac_picard") # CUT&Tag Picard summary output
data("CnR_H3K27ac_picard") # CUT&Run Picard summary output
```
## Prepare input data
### Peaklist
*EpiCompare* accepts datasets both as `GRanges` object and as paths to BED
files. Peakfiles (`GRanges` or paths) that you would like to analyse must be
listed and named (see below).
```R
# To import BED files as GRanges object
peak_GRanges <-ChIPseeker::readPeakFile("/path/to/peak/file.bed",as = "GRanges")
# EpiCompare also accepts paths (to BED files) as input
peak_path <- "/path/to/BED/file1.bed"
# Create named peak list
peaklist <- list(peak_GRanges, peak_path)
names(peaklist) <- c("sample1", "sample2")
```
In this example, we will use built-in data, which have been converted into
`GRanges` object previously (`CnT_H3K27ac` and `CnR_H3K27ac`).
```{r}
peaklist <- list(CnT_H3K27ac, CnR_H3K27ac) # create list of peakfiles
names(peaklist) <- c("CnT", "CnR") # set names
```
### Blacklist
ENCODE blacklist contains regions of the genome that have anomalous
and/or unstructured signals independent of the cell-line or experiment. Removal
of these regions from peakfiles is recommended for quality measure.
*EpiCompare* has three built-in blacklist files, hg19, hg38 and mm10, downloaded
from ENCODE. Run `?hg19_blacklist` for more information.
In this example analysis, since all peakfiles (`encode_H3K27ac`, `CnT_H3K27ac`,
`CnR_H3K27ac`) were generated using human genome reference build hg19,
`hg19_blacklist` will be used. For hg38, use `data(hg38_blacklist)`.
Please ensure that you specify the correct blacklist.
### Picard summary files
Note that this is *OPTIONAL*. If you want the report to include metrics on
DNA fragments (e.g. mapped fragments and duplication rate), please input summary
files from Picard.
[Picard *MarkDuplicates*](https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates) can be used to mark duplicate reads that are found within the alignment. This
tool outputs a metrics file with the ending `.MarkDuplicates.metrics.txt`. To
import this text file into R as data frame, use:
```R
picard <- read.table("/path/to/Picard/.MarkDuplicates.metrics.txt", header = TRUE, fill = TRUE)
```
In this example. we will use built-in data, which have been converted into data
frame previously (`CnT_H3K27ac_picard` and `CnR_H3K27ac_picard`). The files
must be listed and named:
```{r}
# create list of Picard summary
picard <- list(CnT_H3K27ac_picard, CnR_H3K27ac_picard)
names(picard) <- c("CnT", "CnR") # set names
```
### Reference file
This is OPTIONAL. If reference peak file is provided, `stat_plot` and
`chromHMM_plot` of overlapping peaks are included in the report (see
*Optional plots* section below).
Reference file must be listed and named. In this example, we will use built-in
data (`encode_H3K27ac`), which has been converted to `GRanges` previously:
```{r}
reference_peak <- list("ENCODE_H3K27ac" = encode_H3K27ac)
```
### Output Directory
When running `EpiCompare()`, please ensure that you specify `output_dir`. All
outputs (figures and HTML report) will be saved in the specified `output_dir`.
## Run EpiCompare
Running EpiCompare is done using the function, `EpiCompare()` . Users can choose
which analyses to run and include in the report by setting parameters to `TRUE`
or `FALSE`.
```{r}
EpiCompare(peakfiles = peaklist,
genome_build = "hg19",
blacklist = hg19_blacklist,
picard_files = picard,
reference = reference_peak,
upset_plot = FALSE,
stat_plot = FALSE,
chromHMM_plot = FALSE,
chromHMM_annotation = "K562",
chipseeker_plot = FALSE,
enrichment_plot = FALSE,
tss_plot = FALSE,
interact = FALSE,
save_output = FALSE,
output_filename = "EpiCompare_test",
output_timestamp = FALSE,
output_dir = tempdir())
```
### Optional plots
By default, these plots will not be included in the report unless set `TRUE`.
* `upset_plot` : Upset plot showing the number of overlapping peaks between
samples. *EpiCompare* uses `UpSetR` package.
* `stat_plot` : A `reference` peakfile must be included for this plot. The plot
displays distribution of statistical significance (q-values) of sample peaks
that are overlapping/non-overlapping with the `reference` dataset.
* `chromHMM_plot` : [ChromHMM](http://compbio.mit.edu/ChromHMM/#:~:text=ChromHMM%20is%20software%20for%20learning,and%20spatial%20patterns%20of%20marks.)
annotation of peaks. If `reference` is provided, ChromHMM annotation of
overlapping and non-overlapping peaks with the `reference` is also included in
the report.
* `chipseeker_plot` : `ChIPseeker` functional annotation of peaks.
* `enrichment_plot` : KEGG pathway and GO enrichment analysis of peaks.
* `tss_plot` : Peak frequency around (+/- 3000bp) transcriptional start site.
Note that it may take awhile to generate this plot for large sample sizes.
### Other options
* `chromHMM_annotation` : Cell-line annotation for ChromHMM. Default is K562.
Options are:
+ "K562" = K-562 cells
+ "Gm12878" = Cellosaurus cell-line GM12878
+ "H1hesc" = H1 Human Embryonic Stem Cell
+ "Hepg2" = Hep G2 cell
+ "Hmec" = Human Mammary Epithelial Cell
+ "Hsmm" = Human Skeletal Muscle Myoblasts
+ "Huvec" = Human Umbilical Vein Endothelial Cells
+ "Nhek" = Normal Human Epidermal Keratinocytes
+ "Nhlf" = Normal Human Lung Fibroblasts
* `interact` : By default, all heatmaps (percentage overlap and
ChromHMM heatmaps) in the report will be interactive. If set FALSE, all heatmaps
will be static. N.B. If `interact=TRUE`, interactive heatmaps will be saved as
html files, which may take time for larger sample sizes.
* `output_filename` : By default, the report is named EpiCompare.html. You can
specify the filename of the report here.
* `output_timestamp` : By default FALSE. If TRUE, the filename of the report
includes the date.
## Output
*EpiCompare* outputs
* An HTML report consisting of three sections:
+ 1. **General Metrics**: Metrics on peaks (percentage of blacklisted and
non-standard peaks, and peak widths) and fragments (duplication rate) of
samples.
+ 2. **Peak Overlap**: Percentage and statistical significance of overlapping
and non-overlapping peaks. Also includes upset plot.
+ 3. **Functional Annotation**: Functional annotation (ChromHMM, ChIPseeker
and enrichment analysis) of peaks. Also includes peak enrichment around TSS.
* `EpiCompare_file` containing all plots generated by EpiCompare if
`save_output = TRUE`.
Both outputs are saved in the specified `output_dir`.
# Future Enhancements
In the current version, *EpiCompare* only recognizes certain BED formats. We hope
to improve this. Moreover, if there are other downstream analysis tools that may
be suitable in *EpiCompare*, feel free to report this through
[Github](https://github.com/neurogenomics/EpiCompare).
# Code used to generate the example report
An example report comparing ATAC-seq, DNase-seq and ChIP-seq of K562 can be found
[here](https://neurogenomics.github.io/EpiCompare/articles/example_report.html).
Code used to generate this:
<details>
```R
## Load data
# ATAC-seq data: https://www.encodeproject.org/files/ENCFF558BLC/
atac1_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF558BLC.bed", as="GRanges")
# Dnase-seq data: https://www.encodeproject.org/files/ENCFF274YGF/
dna1_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF274YGF.bed", as="GRanges")
# Dnase-seq data: https://www.encodeproject.org/files/ENCFF185XRG/
dna2_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCFF185XRG.bed", as="GRanges")
# ChIP-seq data: https://www.encodeproject.org/files/ENCFF038DDS/
chip_hg38 <- ChIPseeker::readPeakFile("/path/to/bed/ENCODE_H3K27ac_hg38_ENCFF038DDS.bed", as="GRanges")
## Peaklist
peaklist <- list("ATAC_ENCFF558BLC" = atac1_hg38_unique,
"Dnase_ENCFF274YGF" = dna1_hg38,
"ChIP_ENCFF038DDS" = chip_hg38)
## Reference
reference <- list("Dnase_ENCFF185XRG_reference"=dna2_hg38)
## Blacklist
data("hg38_blacklist")
## Run Epicompare
EpiCompare(peakfiles = peaklist,
genome_build = "hg38",
genome_build_output = "hg38",
blacklist = hg38_blacklist,
reference = reference,
picard_files = NULL,
upset_plot = T,
stat_plot = T,
save_output = F,
chromHMM_plot = T,
chromHMM_annotation = "K562",
chipseeker_plot = T,
enrichment_plot = T,
tss_plot = T,
precision_recall_plot =T,
corr_plot = T,
interact = T,
output_dir = "/path/")
```
</details>
# Session Information
<details>
```{r Session_Info_vignette}
utils::sessionInfo()
```
</details>