/
derfinder-quickstart.Rmd
424 lines (330 loc) · 18.1 KB
/
derfinder-quickstart.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
---
title: "derfinder quick start guide"
author:
- name: Leonardo Collado-Torres
affiliation:
- &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus
- &ccb Center for Computational Biology, Johns Hopkins University
email: lcolladotor@gmail.com
output:
BiocStyle::html_document:
self_contained: yes
toc: true
toc_float: true
toc_depth: 2
code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('derfinder')`"
vignette: >
%\VignetteIndexEntry{derfinder quick start guide}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Basics
## Install `r Biocpkg('derfinder')`
`R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('derfinder')` is a `R` package available via the [Bioconductor](http://bioconductor/packages/derfinder) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('derfinder')` by using the following commands in your `R` session:
```{r 'installDer', eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("derfinder")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
```
## Required knowledge
`r Biocpkg('derfinder')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like `r Biocpkg('Rsamtools')`, `r Biocpkg('GenomicAlignments')` and `r Biocpkg('rtracklayer')` that allow you to import the data. A `r Biocpkg('derfinder')` user is not expected to deal with those packages directly but will need to be familiar with `r Biocpkg('GenomicRanges')` to understand the results `r Biocpkg('derfinder')` generates. It might also prove to be highly beneficial to check the `r Biocpkg('BiocParallel')` package for performing parallel computations.
If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU).
## Asking for help
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `derfinder` tag and check [the older posts](https://support.bioconductor.org/t/derfinder/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
We would like to highlight the `r Biocpkg('derfinder')` user [Jessica Hekman](https://support.bioconductor.org/u/6877/). She has used `r Biocpkg('derfinder')` with non-human data, and in the process of doing so discovered some small bugs or sections of the documentation that were not clear.
## Citing `r Biocpkg('derfinder')`
We hope that `r Biocpkg('derfinder')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
```{r 'citation'}
## Citation info
citation("derfinder")
```
# Quick start to using to `r Biocpkg('derfinder')`
```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Track time spent on making the vignette
startTime <- Sys.time()
## Bib setup
library("RefManageR")
## Write bibliography information
bib <- c(
derfinder = citation("derfinder")[1],
BiocStyle = citation("BiocStyle"),
knitr = citation("knitr")[3],
rmarkdown = citation("rmarkdown")[1],
brainspan = RefManageR::BibEntry(
bibtype = "Unpublished",
key = "brainspan",
title = "Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.",
author = "BrainSpan", year = 2011, url = "http://www.brainspan.org/"
),
originalder = citation("derfinder")[2],
R = citation(),
IRanges = citation("IRanges"),
sessioninfo = citation("sessioninfo"),
testthat = citation("testthat"),
GenomeInfoDb = RefManageR::BibEntry(
bibtype = "manual",
key = "GenomeInfoDb",
author = "Sonali Arora and Martin Morgan and Marc Carlson and H. Pagès",
title = "GenomeInfoDb: Utilities for manipulating chromosome and other 'seqname' identifiers",
year = 2017, doi = "10.18129/B9.bioc.GenomeInfoDb"
),
GenomicRanges = citation("GenomicRanges"),
ggplot2 = citation("ggplot2"),
bumphunter = citation("bumphunter")[1],
TxDb.Hsapiens.UCSC.hg19.knownGene = citation("TxDb.Hsapiens.UCSC.hg19.knownGene"),
AnnotationDbi = RefManageR::BibEntry(
bibtype = "manual",
key = "AnnotationDbi",
author = "Hervé Pagès and Marc Carlson and Seth Falcon and Nianhua Li",
title = "AnnotationDbi: Annotation Database Interface",
year = 2017, doi = "10.18129/B9.bioc.AnnotationDbi"
),
BiocParallel = citation("BiocParallel"),
derfinderHelper = citation("derfinderHelper"),
GenomicAlignments = citation("GenomicAlignments"),
GenomicFeatures = citation("GenomicFeatures"),
GenomicFiles = citation("GenomicFiles"),
Hmisc = citation("Hmisc"),
qvalue = citation("qvalue"),
Rsamtools = citation("Rsamtools"),
rtracklayer = citation("rtracklayer"),
S4Vectors = RefManageR::BibEntry(
bibtype = "manual", key = "S4Vectors",
author = "Hervé Pagès and Michael Lawrence and Patrick Aboyoun",
title = "S4Vectors: S4 implementation of vector-like and list-like objects",
year = 2017, doi = "10.18129/B9.bioc.S4Vectors"
),
bumphunterPaper = RefManageR::BibEntry(
bibtype = "article",
key = "bumphunterPaper",
title = "Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies",
author = "Jaffe, Andrew E and Murakami, Peter and Lee, Hwajin and Leek, Jeffrey T and Fallin, M Daniele and Feinberg, Andrew P and Irizarry, Rafael A",
year = 2012, journal = "International Journal of Epidemiology"
),
derfinderData = citation("derfinderData"),
RefManageR = citation("RefManageR")[1]
)
```
Here is a very quick example of a DER Finder analysis. This analysis is explained in more detail later on in this document.
```{r 'ultraQuick', eval = FALSE}
## Load libraries
library("derfinder")
library("derfinderData")
library("GenomicRanges")
## Determine the files to use and fix the names
files <- rawFiles(system.file("extdata", "AMY", package = "derfinderData"),
samplepatt = "bw", fileterm = NULL
)
names(files) <- gsub(".bw", "", names(files))
## Load the data from disk -- On Windows you have to load data from Bam files
fullCov <- fullCoverage(files = files, chrs = "chr21", verbose = FALSE)
## Get the region matrix of Expressed Regions (ERs)
regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE)
## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == "AMY")
## Identify which ERs are differentially expressed, that is, find the DERs
library("DESeq2")
## Round matrix
counts <- round(regionMat$chr21$coverageMatrix)
## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender)
## Perform DE analysis
dse <- DESeq(dse, test = "LRT", reduced = ~gender, fitType = "local")
## Extract results
mcols(regionMat$chr21$regions) <- c(mcols(regionMat$chr21$regions), results(dse))
## Save info in an object with a shorter name
ers <- regionMat$chr21$regions
ers
```
# Introduction
`r Biocpkg('derfinder')` is an R package that implements the DER Finder approach `r Citep(bib[['originalder']])` for RNA-seq data. Briefly, this approach is an alternative to feature-counting and transcript assembly. The basic idea is to identify contiguous base-pairs in the genome that present differential expression signal. These base-pairs are grouped into _d_ifferentially _e_xpressed _r_regions (DERs). This approach is annotation-agnostic which is a feature you might be interested in. In particular, `r Biocpkg('derfinder')` contains functions that allow you to identify DERs via two alternative methods. You can find more details on the full [derfinder users guide](derfinder-users-guide.html).
# Sample DER Finder analysis
This is a brief overview of what a DER Finder analysis looks like. In particular, here we will be identifying expressed regions (ERs) without relying on annotation. Next, we'll identify candidate differentially expressed regions (DERs). Finally, we'll compare the DERs with known annotation features.
We first load the required packages.
```{r 'start', message=FALSE}
## Load libraries
library("derfinder")
library("derfinderData")
library("GenomicRanges")
```
Next, we need to locate the chromosome 21 coverage files for a set of 12 samples. These samples are a small subset from the _BrainSpan Atlas of the Human Brain_ `r Citep(bib[['brainspan']])` publicly available data. The function `rawFiles()` helps us in locating these files.
```{r 'locateAMYfiles'}
## Determine the files to use and fix the names
files <- rawFiles(system.file("extdata", "AMY", package = "derfinderData"),
samplepatt = "bw", fileterm = NULL
)
names(files) <- gsub(".bw", "", names(files))
```
Next, we can load the full coverage data into memory using the `fullCoverage()` function. Note that the _BrainSpan_ data is already normalized by the total number of mapped reads in each sample. However, that won't be the case with most data sets in which case you might want to use the `totalMapped` and `targetSize` arguments. The function `getTotalMapped()` will be helpful to get this information.
```{r 'getData', eval = .Platform$OS.type != "windows"}
## Load the data from disk
fullCov <- fullCoverage(
files = files, chrs = "chr21", verbose = FALSE,
totalMapped = rep(1, length(files)), targetSize = 1
)
```
```{r 'getDataWindows', eval = .Platform$OS.type == "windows", echo = FALSE}
## Load data in Windows case
foo <- function() {
load(system.file("extdata", "fullCov", "fullCovAMY.RData",
package = "derfinderData"
))
return(fullCovAMY)
}
fullCov <- foo()
```
Now that we have the data, we can identify expressed regions (ERs) by using a cutoff of 30 on the base-level mean coverage from these 12 samples. Once the regions have been identified, we can calculate a coverage matrix with one row per ER and one column per sample (12 in this case). For doing this calculation we need to know the length of the sequence reads, which in this study were 76 bp long.
```{r 'regionMatrix'}
## Get the region matrix of Expressed Regions (ERs)
regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76, verbose = FALSE)
```
`regionMatrix()` returns a list of elements, each with three pieces of output. The actual ERs are arranged in a `GRanges` object named `regions`.
```{r 'exploreRegMatRegs'}
## regions output
regionMat$chr21$regions
```
`bpCoverage` is the base-level coverage list which can then be used for plotting.
```{r 'exploreRegMatBP'}
## Base-level coverage matrices for each of the regions
## Useful for plotting
lapply(regionMat$chr21$bpCoverage[1:2], head, n = 2)
## Check dimensions. First region is 565 long, second one is 138 bp long.
## The columns match the number of samples (12 in this case).
lapply(regionMat$chr21$bpCoverage[1:2], dim)
```
The end result of the coverage matrix is shown below. Note that the coverage has been adjusted for read length. Because reads might not fully align inside a given region, the numbers are generally not integers but can be rounded if needed.
```{r 'exploreRegMatrix'}
## Dimensions of the coverage matrix
dim(regionMat$chr21$coverageMatrix)
## Coverage for each region. This matrix can then be used with limma or other pkgs
head(regionMat$chr21$coverageMatrix)
```
We can then use the coverage matrix and packages such as `r Biocpkg('limma')`, `r Biocpkg('DESeq2')` or `r Biocpkg('edgeR')` to identify which ERs are differentially expressed. Here we'll use `r Biocpkg('DESeq2')` for which we need some phenotype data.
```{r 'phenoData'}
## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == "AMY")
```
Now we can identify the DERs using a rounded version of the coverage matrix.
```{r 'identifyDERsDESeq2'}
## Identify which ERs are differentially expressed, that is, find the DERs
library("DESeq2")
## Round matrix
counts <- round(regionMat$chr21$coverageMatrix)
## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, pheno, ~ group + gender)
## Perform DE analysis
dse <- DESeq(dse, test = "LRT", reduced = ~gender, fitType = "local")
## Extract results
mcols(regionMat$chr21$regions) <- c(
mcols(regionMat$chr21$regions),
results(dse)
)
## Save info in an object with a shorter name
ers <- regionMat$chr21$regions
ers
```
We can then compare the DERs against known annotation to see which DERs overlap known exons, introns, or intergenic regions. A way to visualize this information is via a Venn diagram which we can create using `vennRegions()` from the `r Biocpkg('derfinderPlot')` package as shown in Figure \@ref(fig:vennRegions).
```{r 'vennRegions', fig.cap = "Venn diagram showing ERs by annotation class."}
## Find overlaps between regions and summarized genomic annotation
annoRegs <- annotateRegions(ers, genomicState$fullGenome, verbose = FALSE)
library("derfinderPlot")
venn <- vennRegions(annoRegs,
counts.col = "blue",
main = "Venn diagram using TxDb.Hsapiens.UCSC.hg19.knownGene annotation"
)
```
We can also identify the nearest annotated feature. In this case, we'll look for the nearest known gene from the UCSC hg19 annotation.
```{r 'nearestGene'}
## Load database of interest
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- keepSeqlevels(TxDb.Hsapiens.UCSC.hg19.knownGene, "chr21")
## Find nearest feature
library("bumphunter")
genes <- annotateTranscripts(txdb)
annotation <- matchGenes(ers, subject = genes)
## Restore seqlevels
txdb <- restoreSeqlevels(txdb)
## View annotation results
head(annotation)
## You can use derfinderPlot::plotOverview() to visualize this information
```
We can check the base-level coverage information for some of our DERs. In this example we do so for the first 5 ERs (Figures \@ref(fig:firstfive1), \@ref(fig:firstfive2), \@ref(fig:firstfive3), \@ref(fig:firstfive4), \@ref(fig:firstfive5)).
```{r 'firstfive', fig.cap = paste0("Base-pair resolution plot of differentially expressed region ", 1:5, ".")}
## Extract the region coverage
regionCov <- regionMat$chr21$bpCoverage
plotRegionCoverage(
regions = ers, regionCoverage = regionCov,
groupInfo = pheno$group, nearestAnnotation = annotation,
annotatedRegions = annoRegs, whichRegions = seq_len(5), txdb = txdb,
scalefac = 1, ask = FALSE, verbose = FALSE
)
```
You can then use the `r Biocpkg('regionReport')` package to generate interactive HTML reports exploring the results.
If you are interested in using `r Biocpkg('derfinder')` we recommend checking the [derfinder users guide](derfinder-users-guide.html) and good luck with your analyses!
# Reproducibility
This package was made possible thanks to:
* R `r Citep(bib[['R']])`
* `r Biocpkg('AnnotationDbi')` `r Citep(bib[['AnnotationDbi']])`
* `r Biocpkg('BiocParallel')` `r Citep(bib[['BiocParallel']])`
* `r Biocpkg('bumphunter')` `r Citep(bib[['bumphunter']])` and `r Citep(bib[['bumphunterPaper']])`
* `r Biocpkg('derfinderHelper')` `r Citep(bib[['derfinderHelper']])`
* `r Biocpkg('GenomeInfoDb')` `r Citep(bib[['GenomeInfoDb']])`
* `r Biocpkg('GenomicAlignments')` `r Citep(bib[['GenomicAlignments']])`
* `r Biocpkg('GenomicFeatures')` `r Citep(bib[['GenomicFeatures']])`
* `r Biocpkg('GenomicFiles')` `r Citep(bib[['GenomicFiles']])`
* `r Biocpkg('GenomicRanges')` `r Citep(bib[['GenomicRanges']])`
* `r CRANpkg('Hmisc')` `r Citep(bib[['Hmisc']])`
* `r Biocpkg('IRanges')` `r Citep(bib[['IRanges']])`
* `r Biocpkg('qvalue')` `r Citep(bib[['qvalue']])`
* `r Biocpkg('Rsamtools')` `r Citep(bib[['Rsamtools']])`
* `r Biocpkg('rtracklayer')` `r Citep(bib[['rtracklayer']])`
* `r Biocpkg('S4Vectors')` `r Citep(bib[['S4Vectors']])`
* `r Biocexptpkg('derfinderData')` `r Citep(bib[['derfinderData']])`
* `r CRANpkg('sessioninfo')` `r Citep(bib[['sessioninfo']])`
* `r CRANpkg('ggplot2')` `r Citep(bib[['ggplot2']])`
* `r CRANpkg('knitr')` `r Citep(bib[['knitr']])`
* `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])`
* `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`
* `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])`
* `r CRANpkg('testthat')` `r Citep(bib[['testthat']])`
* `r Biocannopkg('TxDb.Hsapiens.UCSC.hg19.knownGene')` `r Citep(bib[['TxDb.Hsapiens.UCSC.hg19.knownGene']])`
Code for creating the vignette
```{r createVignette, eval=FALSE}
## Create the vignette
library("rmarkdown")
system.time(render("derfinder-quickstart.Rmd", "BiocStyle::html_document"))
## Extract the R code
library("knitr")
knit("derfinder-quickstart.Rmd", tangle = TRUE)
```
Date the vignette was generated.
```{r reproducibility1, echo=FALSE}
## Date the vignette was generated
Sys.time()
```
Wallclock time spent generating the vignette.
```{r reproducibility2, echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)
```
`R` session information.
```{r reproducibility3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```
# Bibliography
This vignette was generated using `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])`
with `r CRANpkg('knitr')` `r Citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])` running behind the scenes.
Citations made with `r CRANpkg('RefManageR')` `r Citep(bib[['RefManageR']])`.
```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```