/
preNivolumabOnNivolumab.Rmd
248 lines (198 loc) · 7.2 KB
/
preNivolumabOnNivolumab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
title: "EDA, QC, and DESeq2 / edgeR analysis of 109 RNA-seq samples of immunotherapy treatment"
author: "Michael Love"
output:
html_document:
self_contained: no
---
# EDA, QC, and DESeq2 / edgeR analysis of 109 RNA-seq samples of immunotherapy treatment
```{r}
x <- read.csv("GSE91061_BMS038109Sample.hg19KnownGene.raw.csv.gz", row.names=1)
condition <- factor(sub(".+_(.+)_.+", "\\1", colnames(x)))
table(condition)
```
Always look at your data with a PCA plot before any hypothesis testing.
Here we see there are some extreme outliers that should be removed.
Perhaps these are failed samples, or otherwise very different than
the other 100 samples. Outliers like this deserve further investigation
with those who generated the libraries (to avoid the problem re-occuring),
but we know they will impair inference on the condition effect,
so we remove them at the beginning.
```{r pca1, message=FALSE, cache=TRUE}
library(DESeq2)
dds <- DESeqDataSetFromMatrix(x,
colData=data.frame(condition),
~condition)
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd)
```
We can remove them via PC1:
```{r pcaForOutlier}
rv <- rowVars(assay(vsd))
pc <- prcomp(t(assay(vsd)[head(order(-rv),1000),]))
plot(pc$x[,1:2], col=condition)
idx <- pc$x[,1] < -25
sum(idx)
plot(pc$x[,1:2], col=idx+1, pch=20, asp=1)
```
```{r}
condition <- condition[!idx]
dds <- dds[,!idx]
```
For comparison below, we start with minimal filtering with edgeR.
This basically removes genes with very low counts across most samples.
```{r, message=FALSE}
library(edgeR)
y <- DGEList(counts=counts(dds), group=condition)
keep <- filterByExpr(y)
table(keep)
y <- y[keep,]
dds <- dds[keep,]
```
We can see there is still structure in the 2D PCA, which is not
related to the known covariate `condition`. This must be modeled
or else we will have spurious results.
```{r pca2, cache=TRUE}
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd)
```
First we see what happens when we run simple DESeq2 analysis comparing
the two groups, without attempting to control for the technical variation.
Here we use `glmGamPoi` which is an efficient method for estimating
dispersion when we have many samples. This can speed up the analysis
by an order of magnitude. For details, see:
<https://doi.org/10.1093/bioinformatics/btaa1009>
```{r deseq2-simple, cache=TRUE}
system.time({
dds <- DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi")
})
```
There are hundreds of DE genes. As we will see, these are the result
of uncontrolled confounding.
```{r}
res <- results(dds)
table(res$padj < .1)
```
Ignoring the technical variation is not appropriate, if there is
correlation between the technical variation and the condition.
Not including variables in the design formula that control for the
effect on the expression estimates will lead to invalid inference,
regardless of the method we choose.
We can estimate the technical variation with a number of methods,
including RUV, SVA, or PEER. Here we demonstrate usage of RUV. We will
use the `RUVg` method that takes empirically defined negative control
genes to estimate low rank technical variation in the data. We provide
it with genes that had a large p-value in the naive analysis. We
set `k=5` here for a first pass analysis, but some datasets may require
larger values of `k`. An iterative strategy is recommended, including
consideration of both postiive and negative control features.
For details on the RUV-Seq method, see:
<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4404308/>
```{r, message=FALSE, cache=TRUE}
library(RUVSeq)
set <- newSeqExpressionSet(counts(dds))
set <- betweenLaneNormalization(set, which="upper")
not_sig <- rownames(res)[which(res$pvalue > .1)]
empirical <- rownames(set)[ rownames(set) %in% not_sig ]
set <- RUVg(set, empirical, k=5)
```
The factors of unwanted variation are labelled `W_1`, `W_2`, etc.
and are stored as metadata. We add the original condition variable:
```{r}
pdat <- pData(set)
pdat$condition <- condition
```
We can visualize how the factors of unwanted variation describe
the samples in the PC1 and PC2 space:
```{r pca3}
vsd$W1 <- pdat$W_1
vsd$W2 <- pdat$W_2
plotPCA(vsd, intgroup="W1")
plotPCA(vsd, intgroup="W2")
```
Adding the factors to the design, and performing a LRT is fairly
simple, first we show this with DESeq2:
```{r}
colData(dds) <- cbind(colData(dds), pdat[,1:5])
design(dds) <- ~W_1 + W_2 + W_3 + W_4 + W_5 + condition
```
```{r deseq2-controlling, cache=TRUE}
system.time({
dds <- DESeq(dds, test="LRT", reduced=~W_1 + W_2 + W_3 + W_4 + W_5,
fitType="glmGamPoi")
})
```
Controlling for technical variation in this case greatly reduces the number
of DE genes. It could also be the opposite case, that controlling for
technical variation increased the apparent number of DE genes.
That the number is reduced here indicates that some of the previous
results were likely due to confounding of technical variation with
the condition variable. Ignoring that confounding, again, will result
in invalid inference for all methods that look for shifts in the
expression values.
```{r}
res <- results(dds)
table(res$padj < .1)
```
```{r}
res_sig <- res[which(res$padj < .1),]
```
```{r maplot}
DESeq2::plotMA(res, ylim=c(-5,5))
```
We repeat the analysis with the edgeR quasi-likelihood test. For
details on this method see:
* <https://f1000research.com/articles/5-1438>
* <http://www.statsci.org/smyth/pubs/QuasiSeqPreprint.pdf>
```{r edger, cache=TRUE}
y <- calcNormFactors(y)
design <- model.matrix(~W_1 + W_2 + W_3 + W_4 + W_5 + condition, data=pdat)
y <- estimateDisp(y, design)
qlfit <- glmQLFit(y, design)
qlft <- glmQLFTest(qlfit)
```
edgeR when controlling for the technical variation, has no DE genes:
```{r}
tt <- topTags(qlft, n=nrow(y))[[1]]
sum(tt$FDR < .1)
```
Where do the DESeq2 significant genes fall in terms of test statistic
according to edgeR? This is mostly a check to make sure we can line
up the two tables, etc.
```{r edgerDESeq2Compare}
hist(tt$F, freq=FALSE)
F <- tt[rownames(res_sig),"F"]
lines(density(F[!is.na(F)]))
```
The order of the DESeq2 significant genes according to edgeR.
We can see they are all top ranked, so the methods agree in general
on the ranking of the top 20 genes, but DESeq2 is giving these
genes a lower p-value:
```{r}
match(rownames(res_sig), rownames(tt))
```
Another way to show that the methods are in agreement on the
top genes:
```{r}
table(rownames(res_sig) %in% head(rownames(tt), 20))
```
We can see that DESeq2 tends to have smaller p-values on this dataset
but the two methods agree on the ranking:
```{r edgerDESeq2pvalues}
res_top500 <- res[head(order(res$pvalue),500),]
plot(-log10(tt[rownames(res_top500),"PValue"]),
-log10(res_top500$pvalue), log="xy",
xlab="edgeR pvalue", ylab="DESeq2 pvalue")
abline(0,1,col="red")
```
For more details on controlling for technical variation in RNA-seq
experiments, see the following papers:
* RUVSeq - <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4404308/>
* svaseq - <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4245966/>
* PEER - <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3398141/>
* 't Hoen (2013) - <https://pubmed.ncbi.nlm.nih.gov/24037425/>
* SEQC collection - <https://www.nature.com/collections/wzcnyyrcsd>
# Session info
```{r}
sessionInfo()
```