/
topic4_02_GSVA.Rmd
134 lines (99 loc) · 3.78 KB
/
topic4_02_GSVA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: "Topic 4-02: GSVA"
author: "Zuguang Gu <z.gu@dkfz.de>"
date: '`r Sys.Date()`'
output:
html_document:
toc: true
toc_float: true
vignette: >
%\VignetteIndexEntry{Topic 4-02: GSVA}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## GSVA
We need an expression matrix as well as a collection of gene sets.
```{r}
lt = readRDS(system.file("extdata", "p53_expr.rds", package = "GSEAtraining"))
expr = lt$expr
condition = lt$condition
ln = strsplit(readLines(system.file("extdata", "c2.symbols.gmt", package = "GSEAtraining")), "\t")
gs = lapply(ln, function(x) x[-(1:2)])
names(gs) = sapply(ln, function(x) x[1])
```
Running **GSVA** analysis is simple. Just note the gene IDs in `expr` should match the gene IDs in `gs`.
The returned value `gs_mat` is a set-sample matrix that contains "single sample-based gene set variation scores".
```{r}
library(GSVA)
gs_mat = gsva(expr, gs, verbose = FALSE)
dim(gs_mat)
```
`gs_mat` can be used for downstream analysis. E.g. make heatmaps:
```{r, message = FALSE}
library(ComplexHeatmap)
Heatmap(gs_mat, top_annotation = HeatmapAnnotation(cond = condition),
column_split = condition)
```
Apply t-test on each row of `gs_mat` to test whether the gene set-level
profile has difference between the two conditions.
```{r, message=FALSE}
library(genefilter)
tdf = rowttests(gs_mat, factor(condition))
tdf$fdr = p.adjust(tdf$p.value, "BH")
```
How many gene sets are significant?
```{r}
tdf[tdf$fdr < 0.05, ]
```
## Compare to normal GSEA
As a comparison, we also perform a normal GSEA analysis. We use the t-value as the gene-level score:
```{r}
s = rowttests(expr, factor(condition))[, "statistic"]
names(s) = rownames(expr)
```
Here we also test to the c2 gene set collection, so we need to convert `gs` to the format **clusterProfiler** accepts:
```{r, message = FALSE}
map = data.frame(
gene_set = rep(names(gs), times = sapply(gs, length)),
gene = unlist(gs)
)
library(clusterProfiler)
tb = GSEA(geneList = sort(s, decreasing = TRUE), TERM2GENE = map, pvalueCutoff = 1)
head(tb)
```
Let's check the expression of genes in the p53Pathway:
```{r}
g = intersect(gs[["p53Pathway"]], rownames(expr))
mm = expr[g, ]
mm = t(scale(t(mm))) # z-score transformation
Heatmap(mm, name = "z-score", top_annotation = HeatmapAnnotation(cond = condition),
column_title = "p53Pathway",
right_annotation = rowAnnotation(bar = anno_barplot(s[g], axis_param = list(direction = "reverse"), width = unit(2, "cm"))),
column_split = condition) %v%
Heatmap(gs_mat["hivnefPathway", , drop = FALSE], name = "set variation")
```
Next we make pairwise scatterplot of p-values and statistics for both analysis.
```{r, fig.width = 10, fig.height = 5}
cn = intersect(rownames(tdf), tb$ID)
par(mfrow = c(1, 2))
plot(tdf[cn, "p.value"], tb[cn, "pvalue"],
xlab = "GSVA", ylab = "normal GSEA", main = "p-values")
plot(tdf[cn, "statistic"], tb[cn, "NES"],
xlab = "GSVA / t-values", ylab = "normal GSEA / NES")
abline(h = 0, lty = 2, col = "grey")
abline(v = 0, lty = 2, col = "grey")
```
This makes a problem here that one method generates a significant gene set
which can be completely insignificant under another method (left plot). The
direction of the general differentiation of a gene set can be reversed in
different methods.
```{r}
g = intersect(gs[["hivnefPathway"]], rownames(expr))
mm = expr[g, ]
mm = t(scale(t(mm))) # z-score transformation
Heatmap(mm, name = "z-score", top_annotation = HeatmapAnnotation(cond = condition),
right_annotation = rowAnnotation(bar = anno_barplot(s[g], axis_param = list(direction = "reverse"), width = unit(2, "cm"))),
column_split = condition) %v%
Heatmap(gs_mat["hivnefPathway", , drop = FALSE], name = "set variation")
```
**Conclusion**: It does not seem ssGSEA is better than GSEA. Use with caution.