/
SpikeInNormalization.Rmd
290 lines (221 loc) · 11.2 KB
/
SpikeInNormalization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
title: "Spike-in Normalization"
package: BRGenomics
output:
BiocStyle::html_document:
toc: true
toc_float: true
BiocStyle::pdf_document:
toc: true
vignette: |
%\VignetteIndexEntry{Spike-in Normalization}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
---
BRGenomics includes useful utilities for spike-in normalization.
A typical approach is to add the spike-in (either exogenous cells or synthetic
oligonucleotides) before library preparation, and to subsequently map to a
__combined genome__ containing both the target organisms chromosomes (to map the
experimental reads) as well as sequences/chromosomes for the spike-in.
This so-called "competitive alignment" results in the creation of BAM files
containing a mix of chromosomes, for which it should be straightforward to
identify the spike-in chromosomes.
# Counting Spike-in Reads
```{r, message = FALSE}
library(BRGenomics)
```
```{r, echo = FALSE}
gr1_rep1 <- GRanges(seqnames = c("chr1", "chr2", "spikechr1", "spikechr2"),
ranges = IRanges(start = 1:4, width = 1),
strand = "+")
gr2_rep2 <- gr2_rep1 <- gr1_rep2 <- gr1_rep1
# set readcounts
score(gr1_rep1) <- c(1, 1, 1, 1) # 2 exp + 2 spike = 4 total
score(gr2_rep1) <- c(2, 2, 1, 1) # 4 exp + 2 spike = 6 total
score(gr1_rep2) <- c(1, 1, 2, 1) # 2 exp + 3 spike = 5 total
score(gr2_rep2) <- c(4, 4, 2, 2) # 8 exp + 4 spike = 12 total
grl <- list(gr1_rep1, gr2_rep1,
gr1_rep2, gr2_rep2)
names(grl) <- c("gr1_rep1", "gr2_rep1",
"gr1_rep2", "gr2_rep2")
```
For this section, we'll use a list of 4 dummy datasets containing normal, as
well as spike-in chromosomes. Consider the first 2 datasets:
```{r}
grl[1:2]
```
We can identify the spike-in chromosomes either by full names, or by a regular
expression that matches the spike-in chromosomes. In this case, we named our
spike-in chromosomes to contain the string "spike" which makes them easy to
identify.
To count the reads for each dataset:
```{r}
getSpikeInCounts(grl, si_pattern = "spike", ncores = 1)
```
# Filtering Spike-in Reads
We can also remove the spike-in reads from our data:
```{r}
removeSpikeInReads(grl[1:2], si_pattern = "spike", ncores = 1)
```
And if we wanted to isolate the spike-in reads, there is an analogous
`getSpikeInReads()` function.
# The Spike-in Normalization Factor
There are several methods by which to generate spike-in normalization factors,
but we advocate for a particular method, which generates units we call
<b>S</b>pike-in normalized <b>R</b>eads <b>P</b>er <b>M</b>illion mapped reads
in the negative <b>C</b>ontrol (<b>SRPMC</b>). The SRPMC normalization factor
for a given sample $i$ is defined as such:
$$SRPMC:\ NF_i = \frac{\sum reads_{spikein, control}}{\sum reads_{spikein, i}}
\cdot \frac{10^6}{\sum reads_{experimental, control}}$$
This expression effectively calculates Reads Per Million (RPM) normalization for
the negative control, and all other samples $i$ are scaled into equivalent units
according to the ratio of their spike-in reads. We provide a more explicit
derivation below.
## Derivation of SRPMC
The fundamental concept of spike-in normalization is that the ratio of
experimental reads to spike-in reads can be used to correct for global changes
in starting material. Let's call this ratio <u>R</u>eads <u>P</u>er
<u>S</u>pike-in read (<b>RPS</b>):
$$RPS = \frac{\sum reads_{experimental}}{\sum reads_{spikein}}$$
In isolation, this number only reflects the relative amounts of spike-in
material recovered and mapped. The meaningful information about changes in
material can only arise from making direct comparisons between samples. For any
sample, $i$, we can calculate the global change in signal as a proportion of the
material recovered from a negative control:
$$RelativeSignal_i = \frac{RPS_i}{RPS_{control}}$$
The usual purpose of spike-in normalization is to measure a biological
difference in total material (e.g. RNA) between samples, and the above ratio is
a direct measurement of this.
To generate normalization factors, we use the above ratio to adjust RPM (Reads
Per Million mapped reads) normalization factors, which we define below for
clarity:
$$RPM:\ NF_i = \frac{1}{\frac{\sum{reads_i}}{10^6}} = \frac{10^6}{\sum{reads_i}}$$
(Unless indicated, $reads$ refers to non-spike-in reads).
RPM normalization (i.e. read depth normalization) is the simplest and likely
most familiar form of normalization. For a basal (unperturbed) negative control,
RPM should produce the most portable metric of signal, given that we intend for
the negative control to demonstrate typical physiology, and we hope that this
state is reproducible. We therefore want to have our normalized signal in unit
terms of RPM in the negative control.
To accomplish this, we multiply the ratio of spike-in normalized reads between
the sample $i$ and the negative control to the RPM normalization factor for
sample $i$. This converts readcounts into units we summarize as <u>S</u>pike-in
normalized <u>R</u>eads <u>P</u>er <u>M</u>illion mapped reads in the negative
<u>C</u>ontrol (SRPMC):
$$SRPMC:\ NF_i = \frac{RPS_i}{RPS_{control}} \cdot \frac{10^6}{\sum{reads_i}}$$
Again, SRPMC results in the negative control being RPM (sequencing depth)
normalized, while all other samples are in equivalent, directly comparable
units. And we've effectively determined the relative scaling of those samples
based on the ratios of spike-in reads.
This becomes more apparent if we substitute the $RPS$ variables above.
$\sum{reads_i}$ cancels, and simplifying the fraction produces the original
formula:
$$SRPMC:\ NF_i = \frac{\sum reads_{spikein, control}}{\sum reads_{spikein, i}}
\cdot \frac{10^6}{\sum reads_{experimental, control}}$$
# Calculating Normalization Factors
We can calculate SRPMC normalization factors for each sample using the
`getSpikeInNFs()` function, using the same syntax we used to count the spike-in
reads.
However, we have to also identify the negative control, which is the sample that
will have the "reference" (RPM) normalization. We do this either using a regular
expression (`ctrl_pattern` argument) or by supplying the name(s) of the negative
controls (the `ctrl_names` argument).
The default method is "SRPMC", but there are other options, as well.
```{r}
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1)
```
(The NFs are high because our dummy data contain only a few reads).
By default, normalization factors utilize __batch normalization__, such that in
any replicate (identified by the characters following "rep" in the sample
names), the negative control is RPM normalized, and the other conditions are
normalized to the within-replicate negative control (see the documentation for
further details).
Currently, batch normalization requires the sample names end with strings
matching the format `"_rep1"`, `"_rep2"`, etc. If sample names do not conform to
this pattern, you can rename them by using the `sample_names` argument.
# Normalizing Data
We can also use the `spikeInNormGRanges()` function to simultaneously find the
spike-in reads, calculate the spike-in normalization factors, filter out
spike-in reads, and normalize the readcounts:
```{r}
spikeInNormGRanges(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1)
```
# Normalization by Sub-sampling
## Rationale
When viewing genomics data in a genome browser (or otherwise plotting signal for
a single gene), the sparsity of basepair resolution data can challenge our
visual perception.
Consider two datasets from identical samples, but where one is sequenced to a
higher depth. The two datasets can be normalized such that the signal counts are
equivalent, but the dataset with higher sequencing depth will have also
uncovered additional sites. When plotted in a genome browser, the total signal
within a region may be the same, but the more highly sequenced dataset will
cover more positions but have lower peaks, while the less sequenced dataset will
look sparse and spikey in comparison.
Below, we compare PRO-seq data derived from the same dataset over the same gene.
In one case, we randomly sample half of the reads over that gene, while in
another, we divide all the readcounts by 2.
```{r}
data("PROseq")
data("txs_dm6_chr4")
```
```{r}
# choose a single gene
gene_i <- txs_dm6_chr4[185]
reads.gene_i <- subsetByOverlaps(PROseq, gene_i)
# sample half the raw reads
set.seed(11)
sreads.gene_i <- subsampleGRanges(reads.gene_i, prop = 0.5, ncores = 1)
# downscale raw reads by a factor of 2
score(reads.gene_i) <- 0.5 * score(reads.gene_i)
```
```{r, collapse = TRUE}
sum(score(reads.gene_i))
sum(score(sreads.gene_i))
```
```{r, results = "hold"}
plot(x = 1:width(gene_i),
y = getCountsByPositions(sreads.gene_i, gene_i),
type = "h", ylim = c(0, 20),
main = "PRO-seq (down-sampled)",
xlab = "Distance from TSS", ylab = "Down-sampled PRO-seq Reads")
plot(x = 1:width(gene_i),
y = getCountsByPositions(reads.gene_i, gene_i),
type = "h", ylim = c(0, 20),
main = "PRO-seq (down-scaled)",
xlab = "Distance from TSS", ylab = "Down-scaled PRO-seq Reads")
```
The two plots above come from the same data, and contain the same quantity of
signal, but their profiles are notably distinct. Particularly when plotting many
samples over large regions within a genome browser, differences caused by
sequencing depth can be misleading. It can be challenging to estimate
differences if some datasets are "tall and spikey" while others are "short and
smooth".
Absent global changes in signal, the above scenario can be resolved beforehand
by equal sequencing depths, or by down-sampling to match readcounts.
However, matching raw readcounts is not a solution when significant biological
changes in total signal should be accounted for.
For instance, consider an example in which there is a true, two-fold biological
difference in transcription between two samples. If we could avoid all technical
artifacts and measure the transcription directly in each individual cell, we
would expect to uncover half the number of transcribing complexes in the lower
condition. Having equivalent sequencing depth across those two conditions is
effectively a technical artifact, and down-scaling the signal by multiplication
can cause the visual challenges observed above.
## Sub-sampling for Normalization
To address the above concerns, we've included a function `subsampleBySpikeIn()`
to randomly sample reads to match the normalized signal proportions between
datasets.
Internally, the function uses the `getSpikeInNFs()` function, but instead of
SRPMC normalization, using the option `method = "SNR"`, which calculates
normalization factors that downscale each dataset to match the dataset with the
least spike-in reads. From this, the number of "desired reads" is established
for each dataset, and subsequently that number of reads is randomly sampled.
```{r}
removeSpikeInReads(grl, si_pattern = "spike", ncores = 1)
getSpikeInNFs(grl, si_pattern = "spike", method = "SNR", batch_norm = FALSE,
ncores = 1)
subsampleBySpikeIn(grl, si_pattern = "spike", batch_norm = FALSE, ncores = 1)
```
Normalization by sub-sampling sacrifices information to reduce biases across
datasets.