/
topic1_02_kegg.Rmd
267 lines (193 loc) · 8.11 KB
/
topic1_02_kegg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
---
title: "Topic 1-02: Get pathways from KEGG"
author: "Zuguang Gu <z.gu@dkfz.de>"
date: '`r Sys.Date()`'
output:
html_document:
toc: true
toc_float: true
vignette: >
%\VignetteIndexEntry{Topic 1-02: Get pathways from KEGG}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
<style>
blockquote {
padding: 10px 20px;
margin: 0 0 20px;
font-size: 14px;
background: #eee;
border-left: 6px #CCCCCC solid;
}
</style>
KEGG is a comprehensive database which not only contains pathways but also a large variaty
of other information. The entrence of getting KEGG data programmatically is via the KEGG REST API.
Please note the restrictions of using KEGG data (https://www.kegg.jp/kegg/legal.html):
> ### Academic use of KEGG
> Academic users may freely use the KEGG website at https://www.kegg.jp/ or its mirror site at GenomeNet https://www.genome.jp/kegg/. Academic users who utilize KEGG for providing academic services are requested to obtain an academic service provider license, which is included in the KEGG FTP academic subscription. The KEGG FTP academic subscription, which is a paid service (see background information), may also be obtained to conveniently download the entire KEGG database.
>
> ### Non-academic use of KEGG
> Non-academic users must understand that KEGG is not a public database and non-academic use of KEGG generally requires a commercial license. There are two types of commercial licenses available: end user and business. The end user license includes access rights to the FTP site and the website, while the business license includes access rights to the FTP site only. Please contact Pathway Solutions for more details.
## Directly read from the KEGG API
KEGG provides its data via a REST API (https://rest.kegg.jp/). There are
several commands that can be used to retrieve specific types of data. The URL form of the request is
```
https://rest.kegg.jp/<operation>/<argument>[/<argument2[/<argument3> ...]]
```
To get the KEGG pathway gene sets, we will use the following two operators:
- `link`: get the mapping between two sources.
- `list`: get the details of a list of items.
The `link` operator returns the mapping between two sources of information. We can use
the following command to get the mappings between genes and pathways for human.
```{r}
df1 = read.table(url("https://rest.kegg.jp/link/pathway/hsa"), sep = "\t")
head(df1)
```
In the example, `url()` constructs a connection object that directly transfer
data from the remote URL. You can also first download the output from KEGG into
a local file, read it and later delete the temporary file. `url()` basically automates
such steps.
```{r, eval = FALSE}
temp = tempfile()
download.file("https://rest.kegg.jp/link/pathway/hsa", destfile = temp)
df1 = read.table(temp, sep = "\t")
file.remove(temp)
```
Also the URL `"https://rest.kegg.jp/link/hsa/pathway"` returns identical results which
only switches the two columns in the table.
In the output, the first column contains Entrez ID
(users may remove the `"hsa:"` prefix for downstream analysis) and the second
columncontains KEGG pathways IDs (users may remove the `"path:"` previx).
```{r}
df1[, 1] = gsub("hsa:", "", df1[, 1])
df1[, 2] = gsub("path:", "", df1[, 2])
head(df1)
```
To get the full name of pathways, use the `list` command:
```{r}
df2 = read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
head(df2)
df2[, 2] = gsub(" - Homo.*$", "", df2[, 2])
head(df2)
```
## KEGGREST
There is a **KEGGREST** package which implements a full interface to access KEGG
data in R. All the APIs from KEGG REST service are suppoted in **KEGGREST**. The
names of functions in **KEGGREST** are consistent to the operators in the KEGG
API. For example, to get the mapping between genes and pathways, the function
`keggLink()` can be used.
```{r}
library(KEGGREST)
gene2pathway = keggLink("pathway", "hsa")
head(gene2pathway)
```
This named vector is not commonly used for downstream gene set analysis. A more
used format is a data frame. We can simply converted `gene2pathway` to a data
frame.
```{r}
df3 = data.frame(
gene_id = gsub("hsa:", "", names(gene2pathway)),
pathway_id = gsub("path:", "", gene2pathway)
)
head(df3)
```
Similar as the `list` operator in KEGG API, `keggList()` function can be used to
retrieve pathway IDs and their names. The function returns a named vector.
```{r}
head(keggList("pathway", "hsa"))
```
### Highlight genes on the pathway image
<style>
.delete {
text-decoration: line-through;
}
</style>
Users may want to highlight some genes on the pathway image, e.g. to
highlight differentially expressed genes. <span class="delete">The function
`color.pathway.by.objects()` in **KEGGREST** accepts a vector of genes and assigns
them with a set of colors. In the following example, I randomly pick 10
genes from the pathway "hsa04911".</span> **It does not work any more because KEGG has changed its API.**
Let's manually color genes for a specific pathway on KEGG web site.
Open https://www.genome.jp/pathway/hsa04911.
```{r, comment = ""}
genes = names(gene2pathway[gene2pathway == "path:hsa04911"])
diff_genes = sample(genes, 10)
settings = data.frame(
genes = diff_genes,
colors = paste0(rep("orange", 10),",",rep("blue", 10)) # background,labels
)
write.table(settings, file = stdout(), row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")
```
How to set gene colors: https://www.genome.jp/kegg/webapp/color_gui.html
The package **ggkegg** also supports customizing KEGG graph.
## clusterProfiler
The second Bioconductor pacakge **clusterProfiler** has a simple function
`download_KEGG()` which accepts the prefix of a organism and returns a list of
two data frames, one for the mapping between genes and pathways and the other
for the full name of pathways.
```{r, warnings = FALSE, message = FALSE}
lt = clusterProfiler::download_KEGG("hsa")
head(lt$KEGGPATHID2EXTID)
head(lt$KEGGPATHID2NAME)
```
> Note: all the methods we introduced can not be used to retrieve the network data from KEGG.
## Practice
### {.tabset}
#### Practice 1
In the previous example, we obtained the mapping between genes and KEGG pathways for human, using the
organism code of `"hsa"`. Since human is very well-studied and everyone knows `"hsa"` is for human, what if
you are studying dogs? Try to find out the organism code for dog and get the gene-pathway mappings for dog.
Hint: use the `list` operator in KEGG API and search for dog.
#### Solution
You can get organism codes for all organisms by https://rest.kegg.jp/list/organism. Then you
will find the code for dog is `"cfa"`. Use one of the following three ways to
download the data.
```{r, eval = FALSE}
df = read.table(url("https://rest.kegg.jp/link/pathway/cfa"), sep = "\t")
pathway2gene = KEGGREST::keggLink("pathway", "cfa")
lt = clusterProfiler::download_KEGG("cfa")
```
### {.tabset}
#### Practice 2
Make the distribution of the numbers of genes in KEGG pathways (for human).
There is a pathway that contains far more genes than all other pathways. What is the pathway ID of it and what is its full name?
Go to the web page of this largest pathway to see its real structure.
#### Solution
```{r}
df1 = read.table(url("https://rest.kegg.jp/link/pathway/hsa"), sep = "\t")
n_gene = table(df1[, 2])
hist(n_gene)
```
Making the intervals smaller is better:
```{r}
hist(n_gene, nc = 100)
```
The largest pathway:
```{r}
which.max(n_gene)
```
Its name:
```{r}
df2 = read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
df2[df2[, 1] == "hsa01100", ]
```
And its url on KEGG: https://www.genome.jp/pathway/hsa01100.
### {.tabset}
#### Practice 3
Highlight the following genes "5527", "7534", "2033", "1111", "5885" on the pathway image of "Cell cycle". Choose colors you like.
#### Solution
```{r}
df2 = read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
df2[, 2] = gsub(" - Homo .*$", "", df2[, 2])
pathway_id = df2[df2[, 2] == "Cell cycle", 1]
pathway_id
```
Prepare the color table:
```
5527 red,blue
7534 blue,red
2033 yello,darkgreen
1111 grey,red
5885 purple,blue
```
Then go to https://www.genome.jp/pathway/`r pathway_id` and paste the color table there.