/
BioCor_2_advanced.Rmd
440 lines (364 loc) · 21.4 KB
/
BioCor_2_advanced.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
title: "Advanced usage of BioCor"
abstract: >
Describes how to use the BioCor package to answer several biological
questions and how to use functional similarities with other measures.
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('BioCor')`"
output:
BiocStyle::html_document:
fig_caption: true
code_folding: show
self_contained: yes
toc_float:
collapsed: true
toc_depth: 3
author:
- name: Lluís Revilla
affiliation:
- August Pi i Sunyer Biomedical Research Institute (IDIBAPS); Liver Unit, Hospital Clinic
email: lrevilla@clinic.cat
- name: Juan José Lozano
affiliation:
- Centro de Investigación Biomédica en Red de Enfermedades Hepaticas y Digestivas (CIBEREHD); Barcelona, Spain
- name: Pau Sancho-Bru
affiliation:
- August Pi i Sunyer Biomedical Research Institute (IDIBAPS); Liver Unit, Hospital Clinic
vignette: >
%\VignetteIndexEntry{Advanced usage of BioCor}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
opengraph:
image:
src: man/figures/logo.png
alt: "BioCor logo"
twitter:
creator: "@Lluis_Revilla"
card: summary
editor_options:
chunk_output_type: console
---
```{r setup, message=FALSE, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, warning = TRUE, fig.wide = TRUE,
cache = FALSE, crop = FALSE)
suppressPackageStartupMessages(library("org.Hs.eg.db"))
genesKegg <- base::as.list(org.Hs.eg.db::org.Hs.egPATH)
genesReact <- base::as.list(reactome.db::reactomeEXTID2PATHID)
# Remove genes and pathways which are not from human pathways
genesReact <- lapply(genesReact, function(x) {
grep("R-HSA-", x, value = TRUE)
})
genesReact <- genesReact[lengths(genesReact) >= 1]
library("BioCor")
```
# Introduction
In this vignette we assume that the reader is already familiarized with the `r BiocStyle::Biocpkg("BioCor", "BioCor_1_basics.html", "introduction vignette")`, but wants to know how can it help to answer other questions in other situations
We follow the same convention about names of the objects used `genesReact` and `genesKegg` are the list with information about the pathways the genes are involved in.
# Merging similarities {#merging}
If one calculates similarities with KEGG data and Reactome or other input for the same genes or clusters BioCor provides a couple of functions to merge them.
We can set a weight to each similarity input with `weighted.sum`, multiply them also using a weight for each similarity (with `weighted.prod`), doing the mean or just adding them up. Similarities allow us to apply a function to combine the matrices of a list. Here we use some of the genes used in the first vignette:
```{r merging}
kegg <- mgeneSim(c("672", "675", "10"), genesKegg)
react <- mgeneSim(c("672", "675", "10"), genesReact)
## We can sum it adding a weight to each origin
weighted.sum(c(kegg["672", "675"], react["672", "675"]), w = c(0.3, 0.7))
## Or if we want to perform for all the matrix
## A list of matrices to merge
sim <- list("kegg" = kegg, "react" = react)
similarities(sim, weighted.sum, w = c(0.3, 0.7))
similarities(sim, weighted.prod, w = c(0.3, 0.7))
similarities(sim, prod)
similarities(sim, mean)
```
This functions are similar to `weighted.mean`, except that first the multiplication by the weights is done and then the `NA`s are removed:
```{r weighted}
weighted.mean(c(1, NA), w = c(0.5, 0.5), na.rm = TRUE)
weighted.mean(c(1, 0.5, NA), w = c(0.5, 0.25, 0.25), na.rm = TRUE)
weighted.sum(c(1, NA), w = c(0.5, 0.5))
weighted.sum(c(1, 0.5, NA), w = c(0.5, 0.25, 0.25))
weighted.prod(c(1, NA), w = c(0.5, 0.5))
weighted.prod(c(1, 0.5, NA), w = c(0.5, 0.25, 0.25))
```
# Assessing a differential study
In this example we will use the functional similarities in a classical differential study.
## Obtaining data
We start using data from the [RNAseq workflow](https://bioconductor.org/help/workflows/rnaseqGene/#differential-expression-analysis) and following the analysis comparing treated and untreated:
```{r simulate, fig.cap="Volcano plot. The airway data", fig.wide = TRUE, code_folding = "hide"}
#| fig.alt = "Volcano plot (log2FC on the horitzonal axis and log(p-value) on the vertical axis) of the airway dataset."
suppressPackageStartupMessages(library("airway"))
data("airway")
library("DESeq2")
dds <- DESeqDataSet(airway, design = ~ cell + dex)
dds$dex <- relevel(dds$dex, "untrt")
dds <- DESeq(dds)
res <- results(dds, alpha = 0.05)
summary(res)
plot(res$log2FoldChange, -log10(res$padj),
pch = 16, xlab = "log2FC",
ylab = "-log10(p.ajd)", main = "Untreated vs treated"
)
logFC <- 2.5
abline(v = c(-logFC, logFC), h = -log10(0.05), col = "red")
```
As we can see here there are around 4000 genes differentially expressed genes, some of them differentially expressed above 2^2.5.
## Selecting differentially expressed genes
Usually in such a study one selects genes above certain logFC or fold change threshold, here we use the absolute value of `r logFC`:
```{r BioCor}
fc <- res[abs(res$log2FoldChange) >= logFC & !is.na(res$padj), ]
fc <- fc[fc$padj < 0.05, ]
# Convert Ids (used later)
genes <- select(org.Hs.eg.db,
keys = rownames(res), keytype = "ENSEMBL",
column = c("ENTREZID", "SYMBOL")
)
genesFC <- genes[genes$ENSEMBL %in% rownames(fc), ]
genesFC <- genesFC[!is.na(genesFC$ENTREZID), ]
genesSim <- genesFC$ENTREZID
names(genesSim) <- genesFC$SYMBOL
genesSim <- genesSim[!duplicated(genesSim)]
# Calculate the functional similarity
sim <- mgeneSim(genes = genesSim, info = genesReact, method = "BMA")
```
Once the similarities for the selected genes are calculated we can now visualize the effect of each method:
```{r pval1, fig.cap="Functional similarity of genes with logFC above 2,5. Similar genes cluster together.", fig.width=15, fig.height=20}
#| fig.alt = "First two dimensions of a multi dimensional scaling method based on the similarity of genes. Colored if the fold change is above 2.5 (red)."
nas <- apply(sim, 1, function(x) {
all(is.na(x))
})
sim <- sim[!nas, !nas]
MDSs <- cmdscale(1 - sim)
plot(MDSs, type = "n", main = "BMA similarity", xlab = "MDS1", ylab = "MDS2")
up <- genes[genes$ENSEMBL %in% rownames(fc)[fc$log2FoldChange >= logFC], "SYMBOL"]
text(MDSs, labels = rownames(MDSs), col = ifelse(rownames(MDSs) %in% up, "black", "red"))
abline(h = 0, v = 0)
legend("top", legend = c("Up-regulated", "Down-regulated"), fill = c("black", "red"))
```
This plot illustrate that some differentially expressed genes are quite similar according to their pathways. Suggesting that there might be a relationship between them. Furthermore, some up-regulated genes seem functionally related to down-regulated genes indicating a precise regulation of the pathways where they are involved.
Note that here we are only using `r nrow(MDSs)` genes from the original `r nrow(fc)`.
## Are differentially expressed genes selected by their functionality?
In the previous section we have seen that some differentially expressed genes are functionally related and that they have a high logFC value. Are genes differentially expressed more functional related than those which aren't differential expressed?
For simplicity we will use a subset of 400 genes represented again in a volcano plot and we will look for the functional similarities between those genes:
```{r setting, fig.cap="Volcano plot of the subset of 400 genes. This subset will be used in the following sections", code_folding = "hide"}
#| fig.alt = "Volcano plot of a subset of 400 genes. "
set.seed(250)
subRes <- res[!is.na(res$log2FoldChange), ]
subs <- sample.int(nrow(subRes), size = 400)
subRes <- subRes[subs, ]
g <- genes[genes$ENSEMBL %in% rownames(subRes), "ENTREZID"]
gS <- mgeneSim(g[g %in% names(genesReact)], genesReact, "BMA")
deg <- rownames(subRes[subRes$padj < 0.05 & !is.na(subRes$padj), ])
keep <- rownames(gS) %in% genes[genes$ENSEMBL %in% deg, "ENTREZID"]
plot(subRes$log2FoldChange, -log10(subRes$padj),
pch = 16, xlab = "log2FC",
ylab = "-log10(p.ajd)", main = "Untreated vs treated"
)
abline(v = c(-logFC, logFC), h = -log10(0.05), col = "red")
```
We can answer this by testing it empirically:
```{r cluster2, fig.cap="Distribution of scores between differentially expressed genes and those who aren't. The line indicates the mean score of the similarity between differentially expressed genes and those which aren't differentially expressed.", fig.wide = TRUE}
#| fig.alt = "Histogram of the scores of similarity between several genes. A vertical red line indicates the score of between those differentially expressed and those which aren't."
library("boot")
# The mean of genes differentially expressed
(scoreDEG <- mean(gS[!keep, keep], na.rm = TRUE))
b <- boot(data = gS, R = 1000, statistic = function(x, i) {
g <- !rownames(x) %in% rownames(x)[i]
mean(x[g, i], na.rm = TRUE)
})
(p.val <- (1 + sum(b$t > scoreDEG)) / 1001)
hist(b$t, main = "Distribution of scores", xlab = "Similarity score")
abline(v = scoreDEG, col = "red")
```
Comparing the genes differentially expressed and those who aren't doesn't show that they are non-randomly selected (p-value `r p.val`). The genes with a p-value below the threshold are not more closely functionally related than all the other genes[^1].
[^1]: From 400 genes there are `r nrow(gS)` with pathway information and only `r sum(keep)` where significantly differentially expressed in this subset.
## Are functionally related the selected differentially expressed genes?
We have seen that the genes differentially expressed aren't selected by their functionality. However they could be more functionally related than the other genes. Are the differentially expressed genes more functionally similar than it would be expected ?
```{r pval2, fig.cap="Distribution of the similarity within differentially expressed genes. The line indicates the mean funtional similarity whitin them.", fig.wide = TRUE}
#| fig.alt = "Distribution of similarity scores between expressed genes. A vertical red line indicates the real similarity between those differentially expressed genes."
(scoreW <- combineScores(gS[keep, keep], "avg"))
b <- boot(data = gS, R = 1000, statistic = function(x, i) {
mean(x[i, i], na.rm = TRUE)
})
(p.val <- (1 + sum(b$t > scoreW)) / 1001) # P-value
hist(b$t, main = "Distribution of scores", xlab = "Similarity score")
abline(v = scoreW, col = "red")
```
If we selected randomly the genes from our pool we would expect a score around `r scoreW` with a probability of `r p.val`. That means that the differentially expressed genes is highly different compared to the other genes if we use a significance threshold of 0.05 [^r].
[^r]: Remember that this is a small subset.
## Influence of the fold change in the functionally similarity of the genes
We have seen that the genes differentially expressed are not selected by functional similarity but they are functionally related. Now we would like to know if selecting a fold change threshold affects the functional similarity between them.
To know the relationship between the fold change and the similarity between genes we have several methods:
```{r logfc1, fig.cap="Similarity of genes along abs(logFC). Assessing the similarity of genes according to their absolute log2 fold change.", fig.wide = TRUE, code_folding = "hide"}
#| fig.alt = "A line plot on the X axis the absolute log fold change of the threshold used,
#| on the vertical axis the similarity score. Three lines, in red the similarity within
#| genes above the threshold, in black those below the threshold, in green between above
#| the threshold and below the threshold."
s <- seq(0, max(abs(subRes$log2FoldChange)) + 0.05, by = 0.05)
l <- sapply(s, function(x) {
deg <- rownames(subRes[abs(subRes$log2FoldChange) >= x, ])
keep <- rownames(gS) %in% genes[genes$ENSEMBL %in% deg, "ENTREZID"]
BetweenAbove <- mean(gS[keep, keep], na.rm = TRUE)
AboveBelow <- mean(gS[keep, !keep], na.rm = TRUE)
BetweenBelow <- mean(gS[!keep, !keep], na.rm = TRUE)
c(
"BetweenAbove" = BetweenAbove, "AboveBelow" = AboveBelow,
"BetweenBelow" = BetweenBelow
)
})
L <- as.data.frame(cbind(logfc = s, t(l)))
plot(L$logfc, L$BetweenAbove,
type = "l", xlab = "abs(log2) fold change",
ylab = "Similarity score",
main = "Similarity scores along logFC", col = "darkred"
)
lines(L$logfc, L$AboveBelow, col = "darkgreen")
lines(L$logfc, L$BetweenBelow, col = "black")
legend("topleft",
legend = c(
"Between genes above and below threshold",
"Whitin genes above threshold",
"Whitin genes below threshold"
),
fill = c("darkgreen", "darkred", "black")
)
```
The functional similarity of the genes above the threshold increases with a more restrictive threshold, indicating that a logFC threshold selects genes by their functionality, or in other words that genes differentially expressed tend to be of related pathways. The similarity between those genes above the threshold and below remains constant as well as within genes below the threshold.
```{r logfc2, fig.cap = "Functional similarity between the up-regulated and down-regulated genes.", code_folding = "hide"}
#| fig.alt = "Similarity score between genes up and down-regulated along the threshold of log fold change."
l <- sapply(s, function(x) {
# Names of genes up and down regulated
degUp <- rownames(subRes[subRes$log2FoldChange >= x, ])
degDown <- rownames(subRes[subRes$log2FoldChange <= -x, ])
# Translate to ids in gS
keepUp <- rownames(gS) %in% genes[genes$ENSEMBL %in% degUp, "ENTREZID"]
keepDown <- rownames(gS) %in% genes[genes$ENSEMBL %in% degDown, "ENTREZID"]
# Calculate the mean similarity between each subgrup
between <- mean(gS[keepUp, keepDown], na.rm = TRUE)
c("UpVsDown" = between)
})
L <- as.data.frame(cbind("logfc" = s, "UpVsDown" = l))
plot(L$logfc, L$UpVsDown,
type = "l",
xlab = "abs(log2) fold change threshold",
ylab = "Similarity score",
main = "Similarity scores along logFC"
)
legend("topright",
legend = "Up vs down regulated genes",
fill = "black"
)
```
The maximal functional similarity between genes up-regulated and down-regulated are at `r L[which.max(L$UpVsDown), "logfc"]` log2 fold change.
# Assessing a new pathway
Sometimes the top differentially expressed genes or some other key genes are selected as a signature or a potential new group of related genes. In those cases we can test how does the network of genes change if we add them. Here we create a new pathway named `deg`, and we see the effect on the functional similarity score for all the genes:
```{r newPathway, fig.wide=TRUE, eval=FALSE}
# Adding a new pathway "deg" to those genes
genesReact2 <- genesReact
diffGenes <- genes[genes$ENSEMBL %in% deg, "ENTREZID"]
genesReact2[diffGenes] <- sapply(genesReact[diffGenes], function(x) {
c(x, "deg")
})
plot(ecdf(mgeneSim(names(genesReact), genesReact)))
curve(ecdf(mgeneSim(names(genesReact2), genesReact2)), color = "red")
```
This would take lot of time, for a illustration purpose we reduce to some genes:
```{r newPathway2, fig.wide=TRUE, fig.cap="The effect of adding a new pathway to a functional similarity. In red the same network as in black but with the added pathway.", warning=FALSE, message=FALSE}
#| fig.alt = "Empirical cumulative distribution of the functional similarity
#| with the original data (colored in red) and with an added pathway
#| (in black)."
library("Hmisc")
genesReact2 <- genesReact
diffGenes <- genes[genes$ENSEMBL %in% deg, "ENTREZID"]
# Create the new pathway called deg
genesReact2[diffGenes] <- sapply(genesReact[diffGenes], function(x) {
c(x, "deg")
})
ids <- unique(genes[genes$ENSEMBL %in% rownames(subRes), "ENTREZID"])
Ecdf(c(mgeneSim(ids, genesReact, method = "BMA"),
mgeneSim(ids, genesReact2, method = "BMA")
),
group = c(rep("Reactome", length(ids)^2), rep("Modified", length(ids)^2)),
col = c("black", "red"), xlab = "Functional similarities",
main = "Empirical cumulative distribution")
```
# Merging sources of information
Sometimes we have several origin of information, either several databases, or information from other programs...
We can merge this in the single object required by the function in `BioCor` using the function `combineSources`[^2].
This functions helps to evaluate what happens when we add more pathway information. For instance here we add the information in Kegg and the information in Reactome and we visualize it using the same procedure as previously:
[^2]: See the help page of `combineSources`
```{r combineSource, fig.cap = "Comparison of functional similarity in different databases."}
#| fig.alt = "Comparing the functional similarity by looking at the
#| empirical cumulative distribution. Kegg in black, Reactome in blue,
#| and both mixed in red."
genesKegg <- as.list(org.Hs.egPATH)
gSK <- mgeneSim(rownames(gS), genesKegg)
mix <- combineSources(genesKegg, genesReact)
gSMix <- mgeneSim(rownames(gS), mix)
Ecdf(c(gS, gSK, gSMix),
group = c(
rep("Reactome", length(gS)), rep("Kegg", length(gSK)),
rep("Mix", length(gSMix))
),
col = c("black", "red", "blue"), xlab = "Functional similarities",
main = "Empirical cumulative distribution."
)
```
When mixed, there is a huge increase in the genes that share a pathway using the `max` method.
Observe in next figure (\@ref(fig:combineSource2)) how does the method affects to the results:
```{r combineSource2, fig.cap = "Comparison of functional similarity in different gene sets."}
#| fig.alt = "Empirical cumulative distribution of pathways according to
#| Kegg (black), Reactome (blue) and a mix of both (red)."
gSK2 <- mgeneSim(rownames(gS), genesKegg, method = "BMA")
gS2 <- mgeneSim(rownames(gS), genesReact, method = "BMA")
gSMix2 <- mgeneSim(rownames(gS), mix, method = "BMA")
Ecdf(c(gS2, gSK2, gSMix2),
group = c(
rep("Reactome", length(gS)), rep("Kegg", length(gSK)),
rep("Mix", length(gSMix))
),
col = c("black", "red", "blue"), xlab = "Functional similarities (BMA)", main = "Empirical cumulative distribution."
)
```
Now we can appreciate that most of the functional similarity is brought by Reactome database
# miRNA analysis
miRNAs are RNAs that interact with many genes and transcripts and is subject to change with location and time, thus defining the effect of an miRNA is difficult. In this section we try to answer how functionally similar are miRNA between them to provide a help for potentially closely related miRNA.
First we look for human miRNAs and prepare them for using them as the input for the cluster functions (Restricting to 10 miRNA with less than 150 genes).[^3]
[^3]:This preparation has been adapted from a [previous discussion](https://support.bioconductor.org/p/48138/#48236).
```{r miRNA1}
library("targetscan.Hs.eg.db")
## select human mirna
humanMirnaIdx <- grep("hsa", mappedkeys(targetscan.Hs.egMIRNA))
## select seed-based families for human mirna
humanMirna <- mappedkeys(targetscan.Hs.egMIRNA)[humanMirnaIdx]
## select targets of families
humanMirnaFamilies <- unlist(mget(humanMirna, targetscan.Hs.egMIRBASE2FAMILY))
humanMirnaTargets <- mget(humanMirnaFamilies, revmap(targetscan.Hs.egTARGETS))
names(humanMirnaTargets) <- humanMirna
# Restrict to miRNA with more than one target and less than 150
miRNAs <- sample(humanMirnaTargets[lengths(humanMirnaTargets) > 1 &
lengths(humanMirnaTargets) < 150], 10)
lengths(miRNAs)
```
Now we calculate the functional similarity of those miRNAs using a cluster approach.
```{r miRNA2}
cluster1 <- mclusterSim(miRNAs, genesReact, method = "BMA")
knitr::kable(round(cluster1, 2), caption = "The similarity between miRNA", format = "html")
```
So for instance `r m <- which(cluster1 == max(as.dist(cluster1)), arr.ind = TRUE); unique(rownames(m)[m[, 1] != m[, 2]])` are functionally related despite being from different families.
# Comparing with GO similarities
As suggested in the main vignette functional similarities can be compared to semantic similarities such as those based on GO. Here a comparison using the biological process from the gene ontologies is done:
```{r GOSemSim, fig.cap="Comparison of similarities. Functional similarities compared to biological process semantic similarity.",fig.wide=TRUE,include=TRUE}
#| fig.alt = "Functional similarities for the same genes compared between GO and Reactome annotation."
library("GOSemSim")
BP <- godata("org.Hs.eg.db", ont = "BP", computeIC = TRUE)
gsGO <- GOSemSim::mgeneSim(rownames(gS), semData = BP, measure = "Resnik", verbose = FALSE)
keep <- rownames(gS) %in% rownames(gsGO)
hist(as.dist(gS[keep, keep] - gsGO),
main = "Difference between functional similarity and biological process",
xlab = "Functional similarity - biological process similarity"
)
```
On this graphic we can observe that some genes have a large functional similarity and few biological similarity. They are present together in several pathways while they share few biological process, indicating that they might be key elements of the pathways they are in. On the other hand some other pairs of genes show higher biological process similarity than functional similarity, indicating specialization or compartmentalization of said genes.
# Session Info {.unnumbered}
```{r session, code_folding = "hide"}
sessionInfo()
```