/
AnalyzingMultipleDatasets.Rmd
215 lines (174 loc) · 7.65 KB
/
AnalyzingMultipleDatasets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Analyzing Multiple Datasets"
package: BRGenomics
output:
BiocStyle::html_document:
toc: true
toc_float: true
BiocStyle::pdf_document:
toc: true
vignette: |
%\VignetteIndexEntry{Analyzing Multiple Datasets}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
---
# Lists
To efficiently work with several datasets, we recommend storing the GRanges
objects within a standard, named list, e.g. `grl <- list(a_rep1 = gr1, b_rep1 =
gr2, ...)`.
```{r, message = FALSE}
library(BRGenomics)
data("PROseq")
data("txs_dm6_chr4")
```
```{r}
# make 3 datasets
ps1 <- PROseq[seq(1, length(PROseq), 3)]
ps2 <- PROseq[seq(2, length(PROseq), 3)]
ps3 <- PROseq[seq(3, length(PROseq), 3)]
# use the "=" assignment in list() to give names to the list elements
ps_list <- list(ps1 = ps1, ps2 = ps2, ps3 = ps3)
names(ps_list)
ps_list
```
A named list like the one above can be passed as an argument to nearly every
function in BRGenomics, and many functions will automatically return
dataframes, or melted dataframes that use the list names as the sample names
(which can simplify plotting with `ggplot2` or `lattice`).
---
_Note that BRGenomics does also support the use of `GRangesList` or
`CompressedGRangesList` classes for grouping multiple datasets. However, care
should be taken when using these, as many functions with methods for `GRanges`
objects also have methods for `GRangesList` objects. Furthermore, `GRangesList`
objects can be automatically coerced into `CompressedGRangesList` objects,
which can lower memory usage but can also incur significant performance
penalties._
---
## Example: Summarizing Signal from Multiple Samples
We can pass our list of GRanges directly to `getCountsByRegions` to
simultaneously count reads for each dataset, and melt the result for plotting
with `ggplot`:
```{r}
getCountsByRegions(ps_list, txs_dm6_chr4[1:5], ncores = 1)
```
```{r}
# melt, and use the optional region_names argument
txs_counts <- getCountsByRegions(ps_list, txs_dm6_chr4, melt = TRUE,
region_names = txs_dm6_chr4$tx_name,
ncores = 1)
head(txs_counts)
```
```{r}
library(ggplot2)
ggplot(txs_counts, aes(x = sample, y = signal, fill = sample)) +
geom_violin() +
theme_bw()
```
## Example: Making Browser Shots in R
By using the `getCountsByPositions()` on several datasets over a single
region-of-interest, we can bake our own genome browser shots within R. Using
the same region we plotted earlier:
```{r}
cbp_maxtx <- getCountsByPositions(ps_list, txs_dm6_chr4[135],
melt = TRUE, ncores = 1)
head(cbp_maxtx)
```
```{r, fig.height=4, fig.width=6}
ggplot(cbp_maxtx, aes(x = position, y = signal)) +
facet_wrap(~sample, ncol = 1, strip.position = "right") +
geom_col(size = 0.5, color = "darkgray") +
coord_cartesian(expand = FALSE) +
labs(title = txs_dm6_chr4$tx_name[135],
x = "Distance from TSS", y = "PRO-seq Signal") +
theme_classic() +
theme(strip.text.y = element_text(angle = 0),
strip.background = element_blank(),
axis.line.x = element_blank(),
axis.ticks.x = element_blank())
```
# Multiplexed GRanges
There is another data structure that is widely supported with BRGenomics,
called a __multiplexed GRanges__. A multiplexed GRanges is a single GRanges
object that contains multiple metadata fields containing basepair-resolution
coverage data for different datasets. We currently recommend users use lists of
GRanges objects, but you might find that multiplexed GRanges objects have
performance benefits for your data (notes on this below).
A multiplexed GRanges object can be made using the `mergeGRangesData()` with
option `multiplex = TRUE`.
```{r}
ps_multi <- mergeGRangesData(ps1, ps2, ps3, multiplex = TRUE, ncores = 1)
ps_multi
```
As in any GRanges object, the metadata fields are accessible as a dataframe
with the `mcols()` function, and individual columns are accessible with the `$`
operator.^[The dataframe, by default, is not a base R `data.frame`, but rather
an S4 `DataFrame`. The distinction isn't important for end users, and it's
unlikely users will encounter any reason to coerce the class using
`as.data.frame`.]
```{r}
mcols(ps_multi)
```
```{r, collapse = TRUE}
ps_multi$ps1[1:5]
```
This data structure can be passed to most functions in BRGenomics. To be
explicit, users should set the `field` argument, when it exists, to set the
datasets for which calculations should be performed:
```{r}
# for all datasets (all fields), get counts in the first 5 transcripts
getCountsByRegions(ps_multi, txs_dm6_chr4[1:5],
field = names(mcols(ps_multi)),
ncores = 1)
# get counts for ps2 dataset only
getCountsByRegions(ps_multi, txs_dm6_chr4[1:5],
field = "ps2", ncores = 1)
# if no field is given, most functions will default to using all fields
getCountsByRegions(ps_multi, txs_dm6_chr4[1:5],
ncores = 1)
```
In our experience with PRO-seq data, and to our surprise, a list of GRanges
objects typically outperforms multiplexed GRanges on a typical laptop, despite
the theoretical benefits of multiplexing.
---
_In principle, the multiplexed GRanges structure is designed to reduce memory
usage and increase performance, but this has not been our experience when
dealing with sparse, basepair-resolution data like PRO-seq. Overlapping signal
with regions of interest (e.g. in `getCountsByRegions()` or
`getCountsByPosition()`) is relatively efficient for reasonably sized GRanges
objects, and these calculations scale reasonably well with multicore
processing. While multiplexing means that signal overlapping only has to be
performed once, we've found that this is a relatively small benefit in practice
that is easily offset by having to do signal calculations on large sparse
vectors of signal counts. However, the relative benefits of using lists or
multiplexing are likely to depend on the nature of the data being analyzed, as
well as well as computer hardware._
---
# Saving Binary R Files
While the handling of GRanges data in BRGenomics is relatively fast, the
initial importation of bigWig or bedGraph files as GRanges objects remains a
noticeable bottleneck. This may be tolerable for interactive workflows, in
which data is imported once before undergoing lengthy analysis, but the
bottleneck is a significant detriment for users who regularly import data.
To avoid this bottleneck, users can save reusable data structures as binary R
files, which effectively save the memory state of R objects. Not only are these
objects rapidly reloaded into memory upon importation, but they have the added
benefit of saving the user from repeating data formatting.
Any R object can be saved to storage using the `saveRDS()` function, and
re-imported using `readRDS()`.
```{r, eval = FALSE}
# save the PRO-seq GRanges for later import
saveRDS(PROseq, file = "~/PROseq.RData")
# save a list of GRanges
saveRDS(ps_list, file = "~/ps_list.RData")
# re-import
ps_list <- readRDS("~/ps_list.RData")
```
The `save()` and `load()` commands can also be used to accomplish the same
thing, although they work slightly differently.^[Unlike the `saveRDS()`/
`readRDS()` commands, the `read()`/`load()` commands maintain the original name
of the objects. For instance, if you `save(PROseq, file = "~/ps.RData")`, and
in a new R session run `read("~/ps.RData")`, a new object called `PROseq` will
be created in your new environment. (Note that this is the same way that
RStudio saves your current working environment to disk, i.e. it saves the
entire environment into an RData file, which can then be reloaded, remaking
every data object).]