preNivolumabOnNivolumab.Rmd

---
title: "EDA, QC, and DESeq2 / edgeR analysis of 109 RNA-seq samples of immunotherapy treatment"
author: "Michael Love"
output: 
  html_document:
    self_contained: no
---

# EDA, QC, and DESeq2 / edgeR analysis of 109 RNA-seq samples of immunotherapy treatment

```{r}
x <- read.csv("GSE91061_BMS038109Sample.hg19KnownGene.raw.csv.gz", row.names=1)
condition <- factor(sub(".+_(.+)_.+", "\\1", colnames(x)))
table(condition)
```

Always look at your data with a PCA plot before any hypothesis testing.
Here we see there are some extreme outliers that should be removed.
Perhaps these are failed samples, or otherwise very different than
the other 100 samples. Outliers like this deserve further investigation
with those who generated the libraries (to avoid the problem re-occuring), 
but we know they will impair inference on the condition effect, 
so we remove them at the beginning.

```{r pca1, message=FALSE, cache=TRUE}
library(DESeq2)
dds <- DESeqDataSetFromMatrix(x, 
                              colData=data.frame(condition), 
                              ~condition)
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd)
```

We can remove them via PC1:

```{r pcaForOutlier}
rv <- rowVars(assay(vsd))
pc <- prcomp(t(assay(vsd)[head(order(-rv),1000),]))
plot(pc$x[,1:2], col=condition)
idx <- pc$x[,1] < -25
sum(idx)
plot(pc$x[,1:2], col=idx+1, pch=20, asp=1)
```

```{r}
condition <- condition[!idx]
dds <- dds[,!idx]
```

For comparison below, we start with minimal filtering with edgeR.
This basically removes genes with very low counts across most samples.

```{r, message=FALSE}
library(edgeR)
y <- DGEList(counts=counts(dds), group=condition)
keep <- filterByExpr(y)
table(keep)
y <- y[keep,]
dds <- dds[keep,]
```

We can see there is still structure in the 2D PCA, which is not
related to the known covariate `condition`. This must be modeled
or else we will have spurious results.

```{r pca2, cache=TRUE}
vsd <- vst(dds, blind=FALSE)
plotPCA(vsd)
```

First we see what happens when we run simple DESeq2 analysis comparing 
the two groups, without attempting to control for the technical variation.
Here we use `glmGamPoi` which is an efficient method for estimating 
dispersion when we have many samples. This can speed up the analysis
by an order of magnitude. For details, see: 

<https://doi.org/10.1093/bioinformatics/btaa1009>

```{r deseq2-simple, cache=TRUE}
system.time({
  dds <- DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi")
})
```

There are hundreds of DE genes. As we will see, these are the result
of uncontrolled confounding.

```{r}
res <- results(dds)
table(res$padj < .1)
```

Ignoring the technical variation is not appropriate, if there is 
correlation between the technical variation and the condition.
Not including variables in the design formula that control for the
effect on the expression estimates will lead to invalid inference,
regardless of the method we choose.

We can estimate the technical variation with a number of methods, 
including RUV, SVA, or PEER. Here we demonstrate usage of RUV. We will
use the `RUVg` method that takes empirically defined negative control
genes to estimate low rank technical variation in the data. We provide
it with genes that had a large p-value in the naive analysis. We
set `k=5` here for a first pass analysis, but some datasets may require
larger values of `k`. An iterative strategy is recommended, including
consideration of both postiive and negative control features.

For details on the RUV-Seq method, see:

<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4404308/>

```{r, message=FALSE, cache=TRUE}
library(RUVSeq)
set <- newSeqExpressionSet(counts(dds))
set <- betweenLaneNormalization(set, which="upper")
not_sig <- rownames(res)[which(res$pvalue > .1)]
empirical <- rownames(set)[ rownames(set) %in% not_sig ]
set <- RUVg(set, empirical, k=5)
```

The factors of unwanted variation are labelled `W_1`, `W_2`, etc.
and are stored as metadata. We add the original condition variable:

```{r}
pdat <- pData(set)
pdat$condition <- condition
```

We can visualize how the factors of unwanted variation describe
the samples in the PC1 and PC2 space:

```{r pca3}
vsd$W1 <- pdat$W_1
vsd$W2 <- pdat$W_2
plotPCA(vsd, intgroup="W1")
plotPCA(vsd, intgroup="W2")
```

Adding the factors to the design, and performing a LRT is fairly 
simple, first we show this with DESeq2:

```{r}
colData(dds) <- cbind(colData(dds), pdat[,1:5])
design(dds) <- ~W_1 + W_2 + W_3 + W_4 + W_5 + condition
```

```{r deseq2-controlling, cache=TRUE}
system.time({
  dds <- DESeq(dds, test="LRT", reduced=~W_1 + W_2 + W_3 + W_4 + W_5, 
               fitType="glmGamPoi")
})
```

Controlling for technical variation in this case greatly reduces the number
of DE genes. It could also be the opposite case, that controlling for
technical variation increased the apparent number of DE genes.

That the number is reduced here indicates that some of the previous
results were likely due to confounding of technical variation with
the condition variable. Ignoring that confounding, again, will result
in invalid inference for all methods that look for shifts in the 
expression values.

```{r}
res <- results(dds)
table(res$padj < .1)
```

```{r}
res_sig <- res[which(res$padj < .1),]
```

```{r maplot}
DESeq2::plotMA(res, ylim=c(-5,5))
```

We repeat the analysis with the edgeR quasi-likelihood test. For
details on this method see:

* <https://f1000research.com/articles/5-1438>
* <http://www.statsci.org/smyth/pubs/QuasiSeqPreprint.pdf>

```{r edger, cache=TRUE}
y <- calcNormFactors(y)
design <- model.matrix(~W_1 + W_2 + W_3 + W_4 + W_5 + condition, data=pdat)
y <- estimateDisp(y, design)
qlfit <- glmQLFit(y, design)
qlft <- glmQLFTest(qlfit)
```

edgeR when controlling for the technical variation, has no DE genes:

```{r}
tt <- topTags(qlft, n=nrow(y))[[1]]
sum(tt$FDR < .1)
```

Where do the DESeq2 significant genes fall in terms of test statistic
according to edgeR? This is mostly a check to make sure we can line
up the two tables, etc.

```{r edgerDESeq2Compare}
hist(tt$F, freq=FALSE)
F <- tt[rownames(res_sig),"F"]
lines(density(F[!is.na(F)]))
```

The order of the DESeq2 significant genes according to edgeR.
We can see they are all top ranked, so the methods agree in general
on the ranking of the top 20 genes, but DESeq2 is giving these
genes a lower p-value:

```{r}
match(rownames(res_sig), rownames(tt))
```

Another way to show that the methods are in agreement on the
top genes:

```{r}
table(rownames(res_sig) %in% head(rownames(tt), 20))
```

We can see that DESeq2 tends to have smaller p-values on this dataset
but the two methods agree on the ranking:

```{r edgerDESeq2pvalues}
res_top500 <- res[head(order(res$pvalue),500),]
plot(-log10(tt[rownames(res_top500),"PValue"]), 
     -log10(res_top500$pvalue), log="xy",
     xlab="edgeR pvalue", ylab="DESeq2 pvalue")
abline(0,1,col="red")
```

For more details on controlling for technical variation in RNA-seq
experiments, see the following papers:

* RUVSeq - <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4404308/>
* svaseq - <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4245966/>
* PEER - <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3398141/>
* 't Hoen (2013) - <https://pubmed.ncbi.nlm.nih.gov/24037425/>
* SEQC collection - <https://www.nature.com/collections/wzcnyyrcsd>

# Session info

```{r}
sessionInfo()
```