/
topic2_03_clusterProfiler_ora.Rmd
260 lines (184 loc) · 6.29 KB
/
topic2_03_clusterProfiler_ora.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "Topic 2-03: Use clusterProfiler for ORA"
author: "Zuguang Gu <z.gu@dkfz.de>"
date: '`r Sys.Date()`'
output:
html_document:
toc: true
toc_float: true
vignette: >
%\VignetteIndexEntry{Topic 2-03: Use clusterProfiler for ORA}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
Let's start with a vector of DE genes:
```{r}
lt = readRDS(system.file("extdata", "ora.rds", package = "GSEAtraining"))
diff_gene = lt$diff_gene
bg_gene = lt$bg_gene
head(diff_gene)
```
Load the **clusterProfiler** package.
```{r, message=FALSE}
library(clusterProfiler)
```
**clusterProfiler** supports several major gene ID types in the input, but it is suggested
to use Entrez IDs as the input because it is the "central gene ID type" in many databases/datasets.
We convert `diff_gene` to Entrez IDs. Note some genes may be lost
due to the conversion.
```{r}
library(GSEAtraining)
diff_gene = convert_to_entrez_id(diff_gene)
head(diff_gene)
length(diff_gene)
```
Next we perform ORA on different gene set sources.
## GO enrichment
```{r, message = FALSE}
library(org.Hs.eg.db)
tb = enrichGO(gene = diff_gene, ont = "BP", OrgDb = org.Hs.eg.db)
head(tb)
```
## KEGG enrichment
```{r, message = FALSE}
tb = enrichKEGG(gene = diff_gene, organism = "hsa")
head(tb)
```
## Reactome enrichment
```{r}
library(ReactomePA)
tb = enrichPathway(gene = diff_gene, organism = "human")
head(tb)
```
## DO enrichment
```{r}
library(DOSE)
tb = enrichDO(gene = diff_gene)
head(tb)
```
## MSigDB enrichment
There is no built-in function specific for MSigDB gene sets, but there is a universal function `enricher()` which accepts
manually-specified gene sets. The gene sets object is simply a two-column data frame:
- the first column is the gene set ID
- the second column is the gene ID
```{r}
library(msigdbr)
gene_sets = msigdbr(category = "H")
map = gene_sets[, c("gs_name", "entrez_gene")]
head(map)
tb = enricher(gene = diff_gene, TERM2GENE = map)
head(tb)
```
## Setting background
Note, in **clusterProfiler**, the background is intersect(all genes in the gene sets, user's background genes).
```{r}
bg_gene = convert_to_entrez_id(bg_gene)
bg_gene = sample(bg_gene, 10000)
tb = enrichGO(gene = diff_gene, ont = "BP", OrgDb = org.Hs.eg.db, universe = bg_gene)
head(tb)
```
<style>
blockquote {
padding: 10px 20px;
margin: 0 0 20px;
font-size: 14px;
background: #eee;
border-left: 6px #CCCCCC solid;
}
</style>
> **Warning:** It seems when setting `universe`, the input gene list `diff_gene` is not intersected to `universe`
> in the analysis. You can check the values in `GeneRatio` which is 777 for all DE genes, and this number is the same
> as when `universe` is not manually set. So be careful with this "improper" behaviour.
We can compare ORA with different backgrounds:
```{r}
tb1 = enrichGO(gene = diff_gene, ont = "BP", OrgDb = org.Hs.eg.db)
tb2 = enrichGO(gene = diff_gene, ont = "BP", OrgDb = org.Hs.eg.db, universe = bg_gene)
tb1 = tb1@result
tb2 = tb2@result
cn = intersect(tb1$ID, tb2$ID)
plot(tb1[cn, "pvalue"], tb2[cn, "pvalue"], log = "xy",
xlim = c(1e-25, 1), ylim = c(1e-25, 1),
xlab = "default background", ylab = "self-defined background")
```
Or overlap between the significant GO terms:
```{r}
library(eulerr)
plot(euler(list("default_background" = tb1$ID[tb1$p.adjust < 0.05],
"bg_gene" = tb2$ID[tb2$p.adjust < 0.05])),
quantities = TRUE)
```
## Example for other organism
1. GO enrichment
`enrichGO()` accepts an `OrgDb` object as the source of the GO gene sets.
We take pig as an example. `random_genes()` is from **GSEAtraining** package.
```{r}
library(org.Ss.eg.db)
diff_gene = random_genes(org.Ss.eg.db, 1000, "ENTREZID")
```
Here we set `pvalueCutoff = 1, qvalueCutoff = 1` because genes are random genes and
it is expected there won't be too much significant terms left.
```{r, message = FALSE}
tb = enrichGO(gene = diff_gene, ont = "BP", OrgDb = org.Ss.eg.db,
pvalueCutoff = 1, qvalueCutoff = 1)
head(tb)
```
2. KEGG enrichment
The KEGG code of a specific organism can be found at https://rest.kegg.jp/list/organism
```{r, message = FALSE}
tb = enrichKEGG(gene = diff_gene, organism = "ssc", pvalueCutoff = 1, qvalueCutoff = 1)
head(tb)
```
3. MSigDB
Use `msigdbr::msigdbr_species()` to see which organisms are supported.
```{r}
gene_sets = msigdbr(species = "pig", category = "H")
map = gene_sets[, c("gs_name", "entrez_gene")]
map$entrez_gene = as.character(map$entrez_gene)
tb = enricher(gene = diff_gene, TERM2GENE = map, pvalueCutoff = 1, qvalueCutoff = 1)
head(tb)
```
## Examples for less well-studied organisms
### When an `OrgDb` object is avaiable
You can use `enrichGO()` as long as you have an `OrgDb` object:
Taking dolphin as an example.
```{r}
library(AnnotationHub)
ah = AnnotationHub()
org_db = ah[["AH112418"]]
diff_gene = random_genes(org_db, 1000, "ENTREZID")
tb = enrichGO(diff_gene, ont = "BP", OrgDb = org_db,
pvalueCutoff = 1, qvalueCutoff = 1)
head(tb)
```
### enrichKEGG() also supports many other orgainsms
```r
enrichKEGG(gene = diff_gene, organism = ...)
```
### Manually construct the gene sets
We have introduced many ways to obtain gene sets for less well-studies organisms.
The only thing to do here is to convert gene sets to a two-column data frame where gene sets
are in the first column and genes are in the second column. Then use `enricher()`.
```r
enricher(diff_gene, TERM2GENE = ...)
```
## Look at the `tb` object
We used the same variable name `tb` for the object returned by the various enrichment functions. They are all in the same format. It looks like a table, but be careful, it is actually not:
```{r}
class(tb)
str(tb)
```
To convert it into a "real" table, DO NOT use `as.data.frame()` which only returns the significant terms, use `tb@result`.
Also some columns in `tb@result` are not in the proper format, e.g. `"GeneRatio"` and `"BgRatio"` where numbers should be in numeric mode while not characters. In **GSEAtraining**, there is a `add_more_columns()` which adds more columns to the `tb@result` table.
```{r}
tb2 = add_more_columns(tb)
head(tb2@result)
```
## Practice
### {.tabset}
#### Practice 1
Use the DE gene list in `webgestalt_example_gene_list.rds` to perform ORA on
GO/KEGG/MSigDB gene sets.
```{r}
genes = readRDS(system.file("extdata", "webgestalt_example_gene_list.rds",
package = "GSEAtraining"))
```