-
Notifications
You must be signed in to change notification settings - Fork 2
/
R_RNA.Rmd
647 lines (525 loc) · 32.4 KB
/
R_RNA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
---
title: "R_RNA"
output:
html_document:
number_sections: yes
theme: cerulean
toc: yes
toc_depth: 5
toc_float: yes
pdf_document:
toc: yes
toc_depth: '5'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE, fig.path='figures/', fig.width=8, fig.height=5 )
```
by adding `fig.path = 'figures/'` we put all of the figures created when we knit this document into a directory called `figures`
# Differential Expression Testing
Read the docs: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
Installs:
```{r}
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager") # calls the package from the source
# BiocManager::install("GSEAbase")
# BiocManager::install("clusterProfiler")
# install.packages("devtools")
# install.packages("RColorBrewer")
# install.packages("pheatmap")
# devtools::install_github("karthik/wesanderson")
# BiocManager::install("org.Sc.sgd.db")
# BiocManager::install("GOstats")
# BiocManager::install("edgeR")
# BiocManager::install("tximport")
# BiocManager::install("DESeq2")
# install.packages("treemap")
# install.packages("tidyverse")
# install.packages("ggplot2")
```
Load Libraries:
```{r, warning = FALSE, message = FALSE}
library(tximport)
library(DESeq2)
library(tidyverse)
library("GSEABase")
library(clusterProfiler)
library(RColorBrewer)
library(pheatmap)
library(wesanderson)
library(org.Sc.sgd.db)
library(GOstats)
library(edgeR)
library(treemap)
library(tidyverse)
library(ggplot2)
```
Import sample metadata:
```{r, warning = FALSE, message = FALSE, cache=TRUE}
# read in the file from url
samples <- read_csv("https://osf.io/cxp2w/download")
# look at the first 6 lines
samples
```
```{r}
samples <- samples %>% mutate(quant_file = str_sub(quant_file, 3))
samples
```
Import tx 2 gene file:
```{r}
#tx2gene_map <- read_tsv("https://osf.io/a75zm/download")
tx <- read.table("quant/ERR458493.qc.fq.gz_quant/quant.sf", header = TRUE)
tx2gene_map <- data.frame(tx$Name, tx$Name)
names(tx2gene_map) <- c("TXNAME", "GENEID")
txi <- tximport(files = samples$quant_file, type = "salmon", tx2gene = tx2gene_map)
colnames(txi$counts) <- samples$sample
```
Make DESeq2 object:
```{r}
dds <- DESeqDataSetFromTximport(txi = txi,
colData = samples,
design = ~condition)
dds$condition <- relevel(dds$condition, ref = "wt") # make wild-type the reference to which expression in treatment samples is compared to
```
Run DESeq2:
```{r, cache = TRUE}
dds <- DESeq(dds)
```
Check out results:
```{r}
res <- results(dds)
head(res)
```
Summarize results
```{r}
summary(res, alpha = 0.05) # default significance cut-off is 0.1, changing alpha to 0.05 changes the significance cut-off
```
**Why p-adj?** Multiple testing problems come up when doing multiple statistical tests simultaneously. Here, we are performing 9,474 individual statistical tests. We first need to think about the definifition p-value: the p-value is the probability of attaining the observed result by chance. So if we use a p-value cut off of 0.05, 5% of the time our results will represent chance outcomes rather than real effects. This is fine in one test. For that one test we are 95% confident that are results are real. But if we do 9,474 tests, 473 genes could be *significantly* differentially expressed, just by chance -- there is a false positive problem.
FDR (False Discovery Rate) adjusted p-values controls for the expected rate of false positives. The p-adj is almost always higher than the p-value. Controlling for multiple testing has the downside of increasing the possibility of false negatives.
Some researchers also set a cut-off for minimum expression and minimum log fold-changes.
ADD REFERENCES!!!
If you want to use a log fold-change cut off:
```{r}
res_lfcut <- results(dds, lfcThreshold = 1, altHypothesis = "greaterAbs")
summary(res_lfcut, alpha = 0.05)
```
Lets say you wanted the comparison to be in the other direction:
```{r}
res_rev <- results(dds, contrast=c("condition","wt", "snf2"))
head(res_rev)
```
# Visualizing RNA-seq results
## Normalization
**Count Data Transformations:**
for ranking and visualizations (e.g. PCA plots and heatmaps)
**rlog**: "transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. The rlog transformation produces a similar variance stabilizing effect as varianceStabilizingTransformation, though rlog is more robust in the case when the size factors vary widely. The transformation is useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis." -- from function documentation
This is computationally very time intensive.
```{r, cache=TRUE}
rld <- rlog(dds, blind=TRUE)
head(assay(rld), 3)
```
** Variance stabilizing transformation (so much faster than rlog):**
"This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis."" – from function documentation
```{r, cache = TRUE}
vsd <- vst(dds, blind = TRUE)
head(assay(vsd), 3)
```
## Ordination
rlog PCA:
```{r pca_rld}
data1 <- plotPCA(rld, returnData=TRUE)
data1$group<-gsub(" : ","_",as.character(data1$group))
percentVar1 <- round(100 * attr(data1, "percentVar"))
PCA<-ggplot(data1, aes(PC1, PC2, color = condition))+ theme_bw()+
geom_point(size=9, alpha = 0.8) + scale_colour_manual(values = c("#44aabb","#bbbbbb"))+
xlab(paste0("PC1: ",percentVar1[1],"% variance")) +
ylab(paste0("PC2: ",percentVar1[2],"% variance")) +
theme(text = element_text(size=20)) + ggtitle("rlog PCA")
PCA
#ggsave("figures/vsd_PCA.png", device="png") # to save the plot
```
variance stabilized PCA:
```{r pca_vst}
data1 <- plotPCA(vsd, returnData=TRUE)
data1$group<-gsub(" : ","_",as.character(data1$group))
percentVar1 <- round(100 * attr(data1, "percentVar"))
PCA<-ggplot(data1, aes(PC1, PC2, color = condition))+ theme_bw()+
geom_point(size=9, alpha = 0.8) + scale_colour_manual(values = c("#44aabb","#bbbbbb"))+
xlab(paste0("PC1: ",percentVar1[1],"% variance")) +
ylab(paste0("PC2: ",percentVar1[2],"% variance")) +
theme(text = element_text(size=20)) + ggtitle("vst PCA")
PCA
#ggsave("figures/vsd_PCA.png", device="png") # to save the plot
```
## HeatMaps
rlog HeatMap:
```{r heatmap_rld}
df <- as.data.frame(colData(rld)[,c("condition", "sample")])
mat_colors1<-list(sample = brewer.pal(12, "Paired")[0:6])
names(mat_colors1$sample)<- df$sample
mat_colors <- list(condition = brewer.pal(12, "Paired")[7:8])
names(mat_colors$condition) <- c("wt", "snf2")
genes <- order(res$padj)[1:1000]
pheatmap(assay(rld)[genes, ], cluster_rows=TRUE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df, annotation_colors = c(mat_colors1, mat_colors), fontsize = 12)
```
variance stabilized HeatMap:
```{r heatmap_vst}
df <- as.data.frame(colData(vsd)[,c("condition", "sample")])
pheatmap(assay(vsd)[genes, ], cluster_rows=TRUE, show_rownames=FALSE, show_colnames = FALSE,
cluster_cols=FALSE, annotation_col=df, annotation_colors = c(mat_colors1, mat_colors), fontsize = 12)
```
Another option for heat maps:
plot the difference from the mean normalized count across samples
(and optionally change default colors)
With Rlog transformed data:
```{r heatmap_rld_meandiff}
library(wesanderson)
pal <- wes_palette(name = "Zissou1", n=2000 , type= "continuous")
mat_colors1<-list(sample = wes_palette("IsleofDogs1", 6))
names(mat_colors1$sample)<- df$sample
mat_colors <- list(condition = wes_palette("Cavalcanti1")[4:5])
names(mat_colors$condition) <- c("wt", "snf2")
mat <- assay(rld)[genes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("condition", "sample")])
pheatmap(mat, cluster_rows=TRUE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df, annotation_colors = c(mat_colors1, mat_colors), fontsize = 12, color = pal)
```
Same but with variance stabilizing function:
```{r heatmap_vst_meandiff}
mat <- assay(vsd)[genes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(vsd)[,c("condition", "sample")])
pheatmap(mat, cluster_rows=TRUE, show_rownames=FALSE, show_colnames = FALSE,
cluster_cols=FALSE, annotation_col=df, annotation_colors = c(mat_colors1, mat_colors), fontsize = 12, color = pal)
```
Heatmap of sample-to-sample distances
```{r heatmap_sampledistance}
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
```
## MA plot
```{r}
DESeq2::plotMA(res, alpha = 0.05, ylim = c(-10,10))
#default alpha is 0.1
```
# Gene Set Enrichment Testing
If you remember, we had 774 significantly upregulated genes and 1194 significantly down regulated genes in this data set (this is pretty typical). That is a lot to try to make sense of. If you know you are interested in a specific gene or a specific pathway, you can look for that in your data, but if you are trying to figure out what is generally different between treatments, it helps to categaorize and summarize genes by what they do. Two common ways to do this are GO terms and KEGG pathways.
```{r}
summary(res, alpha = 0.05)
```
## Annotations
we need to line up our pfam and GOannotatations with the differential expression results
```{r}
#get dataframe of significantly differentially expressed genes
DEres <- as.data.frame(res)
DEres$trinity <- row.names(DEres)
sigDEgenes <- subset(DEres, padj < 0.05)
sigDEgenes$trinity <- row.names(sigDEgenes)
```
Import and wrangle Pfam anotations
```{r}
pfamAnnotation <- read_csv("Trinity.fasta.x.pfam-A.csv") %>% group_by(query_name) %>% filter(rank(full_evalue, ties.method="first")==1) %>% arrange(query_name) %>% separate(target_accession, c("Pfam", "version"), sep = "\\.") %>% dplyr::select(Pfam, target_name, description, query_name)
length(pfamAnnotation$Pfam)
```
4939 out of 9462 "genes" have good Pfam annotations
GO term to pfam mapping is available from: [http://current.geneontology.org/ontology/external2go/pfam2go](http://current.geneontology.org/ontology/external2go/pfam2go).
import and wrangle pfam2go mapping file:
```{r}
pfam2go <- read.delim("pfam2go4R.txt", header = FALSE) %>%
separate(V1, c('V1_1', 'V1_2'), sep = '>') %>%
separate(V1_1, c("Pfam", "name"), sep = " ") %>%
separate(V1_2, c("GO_desc", "GO"), sep = ";") %>% mutate(GO = str_sub(GO, 2)) %>% mutate(Pfam= str_sub(Pfam, 6))
```
Merger pfam annotations and pfam2GO mapping:
```{r}
pfamGO <- left_join(as_tibble(pfamAnnotation), as_tibble(pfam2go))
#how many unique pfam annotations with GO terms?
length(unique(pfamGO$Pfam))
```
2296 have GO terms -- 27%
import and wrangle trinity name to dammit rename file:
```{r}
namemap<-read_csv("Trinity.fasta.dammit.namemap.csv") %>% separate(original, c("trinity"), sep = " ") %>% rename(query_name = "renamed")
finalmapping<- right_join(namemap, pfamGO)
```
```{r}
#(format transcript to GO mapping for GOstats)
GOdf <- data.frame(finalmapping$trinity, finalmapping$GO)
GOdf$evidence <- "ISS"
names(GOdf) <- c("isoform", "GO", "evidence")
#reorder columns
GOdf <- GOdf[,c("GO","evidence","isoform")]
GOdf$GO <- as.character(GOdf$GO)
GOdf$isoform<- as.character(GOdf$isoform)
goframe=GOFrame(GOdf)
goAllFrame=GOAllFrame(goframe)
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
universe <-namemap$trinity
```
```{r}
sigDEgenes <- subset(DEres, padj < 0.05)
sigAnnotated <- left_join(sigDEgenes, finalmapping)
#make list of upregulated genes
sigDEup <- sigDEgenes[sigDEgenes$log2FoldChange >0,]
uplist <- sigDEup$trinity
#make list of downregulated genes
sigDEdown<- sigDEgenes[sigDEgenes$log2FoldChange <0,]
downlist <- sigDEdown$trinity
```
## GO term enrichment
"A GO annotation is a statement about the function of a particular gene. GO annotations are created by associating a gene or gene product with a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. Hence, GO annotations capture statements about how a gene functions at the molecular level, where in the cell it functions, and what biological processes (pathways, programs) it helps to carry out.
Different pieces of knowledge regarding gene function may be established to different degrees, which is why each GO annotation always refers to the evidence upon which it is based. All GO annotations are ultimately supported by the scientific literature, either directly or indirectly. In GO, the supporting evidence is presented in the form of a GO Evidence Codes and either a published reference or description of the methodology used to create the annotation. The GO evidence codes describe the type of evidence and reflect how far removed the annotated assertion is from direct experimental evidence, and whether this evidence was reviewed by an expert biocurator." -- http://geneontology.org/docs/go-annotations/
```{r}
#GOterm enrichment in up-regulated genes:
upregulated = hyperGTest(
GSEAGOHyperGParams(name = "snf2 upregged",
geneSetCollection=gsc,geneIds = uplist,
universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = FALSE,testDirection = "over"))
upregulated
htmlReport(upregulated, file="enrichedUPgostats.html")
upgoterms <-data.frame(summary(upregulated))
#write.csv(upgoterms,"upgo.csv")
```
```{r}
upgoterms$Term
```
**GOterm enrichment among significantly DOWNreguated genes:**
```{r, echo=FALSE}
downregulated = hyperGTest(
GSEAGOHyperGParams(name = "Symbiosis DownRegged",
geneSetCollection=gsc,geneIds = downlist,
universeGeneIds=universe,ontology = "BP",pvalueCutoff = 0.05,conditional = FALSE,testDirection = "over"))
downregulated
#htmlReport(downregulated, file = "enrichedDOWNgostats.html")
downgoterms <- data.frame(summary(downregulated))
#write.csv(downgoterms,"downgo.csv")
```
##ReviGO Plot to visualize GO term enrichment results
Results were imported into [ReviGO]("http://revigo.irb.hr/"), summarizes long lists of GO terms by removing redundant terms and clustering terms with similar meanings.
```{r, echo= FALSE}
Down4revi <-as.data.frame(downgoterms$GOBPID)
names(Down4revi) <- c("id")
Down4revi$updown <- 0
up4revi <-as.data.frame(upgoterms$GOBPID)
names(up4revi)<-c("id")
up4revi$updown <- 1
revi<-rbind(up4revi, Down4revi)
write.csv(revi, "revi.csv")
```
```{r}
revigo.names <- c("term_ID","description","frequency_%","plot_X","plot_Y","plot_size","value","uniqueness","dispensability");
revigo.data <- rbind(c("GO:0000726","non-recombinational repair", 0.050,-2.710, 3.637, 3.811, 1.0000,0.837,0.000),
c("GO:0008152","metabolic process",75.387, 0.614,-2.095, 6.986, 0.0000,0.997,0.000),
c("GO:0009117","nucleotide metabolic process", 4.166,-4.553,-3.763, 5.728, 0.0000,0.466,0.000),
c("GO:0034220","ion transmembrane transport", 3.528, 6.208,-2.001, 5.656, 1.0000,0.774,0.000),
c("GO:0051179","localization",18.495, 0.415,-0.039, 6.375, 1.0000,0.992,0.000),
c("GO:0065008","regulation of biological quality", 3.395, 1.278,-5.548, 5.639, 1.0000,0.970,0.000),
c("GO:0071840","cellular component organization or biogenesis", 8.568, 1.853,-3.991, 6.041, 1.0000,0.990,0.000),
c("GO:0007029","endoplasmic reticulum organization", 0.032, 4.009, 4.791, 3.617, 1.0000,0.941,0.021),
c("GO:0009056","catabolic process", 4.820, 2.523,-1.180, 5.791, 0.0000,0.964,0.022),
c("GO:0071554","cell wall organization or biogenesis", 0.950, 2.900,-2.495, 5.086, 1.0000,0.946,0.027),
c("GO:0009058","biosynthetic process",31.611, 2.398,-2.009, 6.608, 0.0000,0.956,0.033),
c("GO:0042737","drug catabolic process", 0.001, 0.782, 3.197, 2.286, 0.0000,0.900,0.042),
c("GO:0090407","organophosphate biosynthetic process", 4.110,-5.521, 2.779, 5.722, 1.0000,0.580,0.055),
c("GO:0017144","drug metabolic process", 0.058, 0.462,-1.061, 3.868, 0.0000,0.929,0.056),
c("GO:0046486","glycerolipid metabolic process", 0.593,-2.347,-7.356, 4.881, 1.0000,0.774,0.072),
c("GO:1901135","carbohydrate derivative metabolic process", 6.319, 1.524, 0.788, 5.909, 1.0000,0.913,0.075),
c("GO:0006733","oxidoreduction coenzyme metabolic process", 1.273, 0.639, 3.927, 5.213, 0.0000,0.865,0.079),
c("GO:0006091","generation of precursor metabolites and energy", 1.940, 2.105, 1.217, 5.396, 0.0000,0.903,0.084),
c("GO:0010467","gene expression",19.671,-0.620, 6.991, 6.402, 0.0000,0.881,0.093),
c("GO:0005975","carbohydrate metabolic process", 5.260, 0.741, 2.005, 5.829, 0.0000,0.902,0.098),
c("GO:0006000","fructose metabolic process", 0.022,-3.320,-6.041, 3.458, 1.0000,0.762,0.116),
c("GO:0006388","tRNA splicing, via endonucleolytic cleavage and ligation", 0.040,-2.601, 4.599, 3.706, 1.0000,0.828,0.157),
c("GO:0006112","energy reserve metabolic process", 0.168,-1.526,-7.706, 4.334, 0.0000,0.809,0.195),
c("GO:0016051","carbohydrate biosynthetic process", 1.079,-6.138,-3.253, 5.141, 0.0000,0.688,0.204),
c("GO:0072524","pyridine-containing compound metabolic process", 1.351,-5.667,-0.357, 5.239, 0.0000,0.764,0.228),
c("GO:0043603","cellular amide metabolic process", 6.879,-5.153, 0.910, 5.946, 0.0000,0.804,0.242),
c("GO:0032507","maintenance of protein location in cell", 0.057, 3.601,-5.348, 3.868, 1.0000,0.672,0.255),
c("GO:1901137","carbohydrate derivative biosynthetic process", 3.651,-5.366, 3.669, 5.671, 1.0000,0.764,0.255),
c("GO:1901564","organonitrogen compound metabolic process",17.886,-5.069, 1.674, 6.361, 0.0000,0.810,0.264),
c("GO:1901576","organic substance biosynthetic process",30.365,-5.821, 3.462, 6.591, 0.0000,0.780,0.275),
c("GO:0052803","imidazole-containing compound metabolic process", 0.420,-6.364,-0.441, 4.732, 0.0000,0.788,0.289),
c("GO:0006839","mitochondrial transport", 0.182, 6.220,-0.937, 4.369, 1.0000,0.874,0.296),
c("GO:0015931","nucleobase-containing compound transport", 0.198, 5.878,-1.374, 4.404, 1.0000,0.833,0.299),
c("GO:0055114","oxidation-reduction process",15.060,-3.403,-7.152, 6.286, 0.0000,0.769,0.310),
c("GO:0019538","protein metabolic process",18.489,-0.950, 6.815, 6.375, 0.0000,0.869,0.339),
c("GO:0072521","purine-containing compound metabolic process", 2.673,-5.665, 0.029, 5.535, 0.0000,0.744,0.357),
c("GO:0072522","purine-containing compound biosynthetic process", 1.502,-6.315, 0.777, 5.285, 0.0000,0.669,0.361),
c("GO:0071555","cell wall organization", 0.709, 4.128, 4.999, 4.959, 1.0000,0.915,0.375),
c("GO:0044281","small molecule metabolic process",15.138,-3.256,-6.959, 6.288, 0.0000,0.769,0.415),
c("GO:0006418","tRNA aminoacylation for protein translation", 1.099,-5.437,-1.496, 5.149, 0.0000,0.579,0.417),
c("GO:0019439","aromatic compound catabolic process", 1.164,-2.172, 0.819, 5.174, 0.0000,0.775,0.425),
c("GO:0044267","cellular protein metabolic process",14.293,-2.084, 5.969, 6.263, 0.0000,0.840,0.426),
c("GO:0010256","endomembrane system organization", 0.189, 3.448, 4.288, 4.385, 1.0000,0.933,0.434),
c("GO:0051236","establishment of RNA localization", 0.101, 6.361,-2.918, 4.111, 1.0000,0.836,0.472),
c("GO:0006090","pyruvate metabolic process", 0.817,-3.803,-6.807, 5.021, 0.0000,0.669,0.475),
c("GO:0042866","pyruvate biosynthetic process", 0.005,-5.049,-5.389, 2.780, 0.0000,0.737,0.488),
c("GO:0006487","protein N-linked glycosylation", 0.076,-4.680,-1.784, 3.992, 1.0000,0.754,0.493),
c("GO:0006403","RNA localization", 0.118, 5.789,-2.639, 4.179, 1.0000,0.856,0.497),
c("GO:0045229","external encapsulating structure organization", 0.795, 3.535, 4.653, 5.009, 1.0000,0.925,0.498),
c("GO:1901566","organonitrogen compound biosynthetic process",14.064,-6.317, 0.855, 6.256, 0.0000,0.682,0.506),
c("GO:0055086","nucleobase-containing small molecule metabolic process", 4.917,-4.920,-4.419, 5.800, 0.0000,0.608,0.518),
c("GO:0015849","organic acid transport", 1.024, 5.627,-3.839, 5.119, 1.0000,0.738,0.527),
c("GO:0071166","ribonucleoprotein complex localization", 0.097, 6.117,-3.030, 4.094, 1.0000,0.845,0.529),
c("GO:0006302","double-strand break repair", 0.211,-3.379, 4.862, 4.432, 1.0000,0.818,0.540),
c("GO:0019725","cellular homeostasis", 1.253,-0.668,-7.416, 5.206, 1.0000,0.765,0.552),
c("GO:0034645","cellular macromolecule biosynthetic process",19.291,-5.062, 3.332, 6.394, 0.0000,0.748,0.555),
c("GO:0009059","macromolecule biosynthetic process",19.548,-5.044, 4.271, 6.399, 0.0000,0.779,0.558),
c("GO:0006810","transport",17.616, 6.563,-2.324, 6.354, 1.0000,0.805,0.560),
c("GO:0006556","S-adenosylmethionine biosynthetic process", 0.050,-2.714, 5.972, 3.810, 0.0000,0.838,0.562),
c("GO:0015718","monocarboxylic acid transport", 0.155, 6.002,-1.867, 4.299, 1.0000,0.734,0.565),
c("GO:0046394","carboxylic acid biosynthetic process", 4.159,-5.321,-4.037, 5.727, 0.0000,0.550,0.575),
c("GO:0009072","aromatic amino acid family metabolic process", 0.719,-4.716,-4.801, 4.965, 0.0000,0.640,0.577),
c("GO:0006414","translational elongation", 0.777,-6.028, 1.758, 4.999, 0.0000,0.721,0.577),
c("GO:1901136","carbohydrate derivative catabolic process", 0.423,-0.138, 4.014, 4.735, 0.0000,0.814,0.582),
c("GO:0044271","cellular nitrogen compound biosynthetic process",22.502,-6.173, 1.273, 6.460, 0.0000,0.709,0.587),
c("GO:0006913","nucleocytoplasmic transport", 0.237, 5.489,-2.450, 4.483, 1.0000,0.813,0.587),
c("GO:0046500","S-adenosylmethionine metabolic process", 0.090, 1.208, 5.818, 4.062, 0.0000,0.880,0.591),
c("GO:0006082","organic acid metabolic process", 9.086,-3.977,-6.024, 6.067, 0.0000,0.606,0.592),
c("GO:0051235","maintenance of location", 0.129, 5.215,-4.639, 4.219, 1.0000,0.853,0.593),
c("GO:0006850","mitochondrial pyruvate transport", 0.015, 6.471,-1.564, 3.282, 1.0000,0.762,0.594),
c("GO:0046434","organophosphate catabolic process", 0.365,-1.444, 0.855, 4.671, 0.0000,0.662,0.596),
c("GO:0016043","cellular component organization", 7.239, 4.098, 5.023, 5.968, 1.0000,0.910,0.602),
c("GO:0050657","nucleic acid transport", 0.100, 6.703,-0.731, 4.108, 1.0000,0.829,0.604),
c("GO:0043038","amino acid activation", 1.124,-4.483,-4.976, 5.159, 0.0000,0.628,0.605),
c("GO:0006811","ion transport", 5.344, 6.702,-2.153, 5.836, 1.0000,0.828,0.609),
c("GO:0000041","transition metal ion transport", 0.344, 6.793,-1.351, 4.645, 1.0000,0.817,0.615),
c("GO:0046474","glycerophospholipid biosynthetic process", 0.266,-5.351,-3.787, 4.534, 1.0000,0.590,0.635),
c("GO:0016052","carbohydrate catabolic process", 1.078,-1.478,-2.051, 5.141, 0.0000,0.826,0.643),
c("GO:0006826","iron ion transport", 0.133, 6.716,-1.102, 4.233, 1.0000,0.820,0.654),
c("GO:0044283","small molecule biosynthetic process", 5.677,-5.610,-3.730, 5.862, 0.0000,0.590,0.654),
c("GO:0006643","membrane lipid metabolic process", 0.382,-2.290,-7.585, 4.690, 1.0000,0.783,0.657),
c("GO:0044249","cellular biosynthetic process",30.048,-6.423, 2.450, 6.586, 0.0000,0.762,0.658),
c("GO:0033365","protein localization to organelle", 0.609, 6.157,-2.805, 4.893, 1.0000,0.806,0.684),
c("GO:0051169","nuclear transport", 0.239, 5.501,-2.340, 4.486, 1.0000,0.829,0.684),
c("GO:0055085","transmembrane transport", 8.916, 6.686,-2.407, 6.058, 1.0000,0.819,0.684),
c("GO:0009312","oligosaccharide biosynthetic process", 0.248,-6.027,-3.322, 4.502, 0.0000,0.720,0.687),
c("GO:0044272","sulfur compound biosynthetic process", 1.235,-4.762, 4.724, 5.200, 0.0000,0.829,0.689),
c("GO:0000105","histidine biosynthetic process", 0.360,-5.721,-2.720, 4.664, 0.0000,0.617,0.691),
c("GO:1901605","alpha-amino acid metabolic process", 3.625,-4.749,-4.604, 5.668, 0.0000,0.580,0.698));
one.data <- data.frame(revigo.data);
names(one.data) <- revigo.names;
one.data <- one.data [(one.data$plot_X != "null" & one.data$plot_Y != "null"), ];
one.data$plot_X <- as.numeric( as.character(one.data$plot_X) );
one.data$plot_Y <- as.numeric( as.character(one.data$plot_Y) );
one.data$plot_size <- as.numeric( as.character(one.data$plot_size) );
one.data$frequency <- as.numeric( as.character(one.data$frequency) );
one.data$uniqueness <- as.numeric( as.character(one.data$uniqueness) );
one.data$dispensability <- as.numeric( as.character(one.data$dispensability) );
ex <- one.data [ one.data$dispensability < 0.15, ]
```
```{r}
one.data$value <- gsub('0', 'Down', one.data$value)
one.data$value <- gsub('1', 'Up', one.data$value)
reviGOplot <- ggplot( data = one.data ) +
geom_point( aes( plot_X, plot_Y, colour = value), alpha = I(0.6), size =7) +
scale_colour_manual(values =c("#3B9AB2", "red"), labels= c("Down", "Up")) +
geom_point( aes(plot_X, plot_Y), shape = 21, fill = "transparent", colour = I (alpha ("black", 0.6) ), size = 7) + scale_size_area() + scale_size( range=c(5, 30)) + theme_bw() + theme(legend.key = element_blank()) + theme(text = element_text(size=14)) + theme(legend.title=element_blank()) +labs (y = "Axis 2", x = "Axis 1") +geom_text( data = ex, aes(plot_X, plot_Y, label = description), colour = I(alpha("black", 0.85)), size = 3 )
#+geom_label_repel(data = ex, aes(plot_X, plot_Y, label = description), colour = I(alpha("black", 0.85)), size = 4, nudge_x = 0 , point.padding = 0.2, label.padding = 0.1)+ labs (y = "Axis 2", x = "Axis 1")
# need library ggrepel
# install.packages("ggrepel")
#library(ggrepel
reviGOplot
```
### treemap
```{r, warning=FALSE}
revigo.names <- c("term_ID","description","freqInDbPercent","value","uniqueness","dispensability","representative");
revigo.data <- rbind(c("GO:0000726","non-recombinational repair",0.050,1.0000,0.867,0.000,"non-recombinational repair"),
c("GO:0006302","double-strand break repair",0.211,1.0000,0.854,0.540,"non-recombinational repair"),
c("GO:0006388","tRNA splicing, via endonucleolytic cleavage and ligation",0.040,1.0000,0.864,0.157,"non-recombinational repair"),
c("GO:0072522","purine-containing compound biosynthetic process",1.502,1.0000,0.743,0.361,"non-recombinational repair"),
c("GO:0072521","purine-containing compound metabolic process",2.673,1.0000,0.825,0.153,"non-recombinational repair"),
c("GO:0034220","ion transmembrane transport",3.528,1.0000,0.609,0.000,"ion transmembrane transport"),
c("GO:0050657","nucleic acid transport",0.100,1.0000,0.699,0.604,"ion transmembrane transport"),
c("GO:0055085","transmembrane transport",8.916,1.0000,0.682,0.684,"ion transmembrane transport"),
c("GO:0032507","maintenance of protein location in cell",0.057,1.0000,0.557,0.255,"ion transmembrane transport"),
c("GO:0051235","maintenance of location",0.129,1.0000,0.740,0.593,"ion transmembrane transport"),
c("GO:0051236","establishment of RNA localization",0.101,1.0000,0.712,0.472,"ion transmembrane transport"),
c("GO:0000041","transition metal ion transport",0.344,1.0000,0.679,0.615,"ion transmembrane transport"),
c("GO:0006810","transport",17.616,1.0000,0.659,0.560,"ion transmembrane transport"),
c("GO:0006811","ion transport",5.344,1.0000,0.699,0.609,"ion transmembrane transport"),
c("GO:0006826","iron ion transport",0.133,1.0000,0.685,0.654,"ion transmembrane transport"),
c("GO:0071166","ribonucleoprotein complex localization",0.097,1.0000,0.727,0.529,"ion transmembrane transport"),
c("GO:0006850","mitochondrial pyruvate transport",0.015,1.0000,0.635,0.594,"ion transmembrane transport"),
c("GO:0006839","mitochondrial transport",0.182,1.0000,0.775,0.296,"ion transmembrane transport"),
c("GO:0033365","protein localization to organelle",0.609,1.0000,0.661,0.684,"ion transmembrane transport"),
c("GO:0019725","cellular homeostasis",1.253,1.0000,0.766,0.552,"ion transmembrane transport"),
c("GO:0015849","organic acid transport",1.024,1.0000,0.618,0.527,"ion transmembrane transport"),
c("GO:0006403","RNA localization",0.118,1.0000,0.744,0.497,"ion transmembrane transport"),
c("GO:0015931","nucleobase-containing compound transport",0.198,1.0000,0.706,0.299,"ion transmembrane transport"),
c("GO:0015718","monocarboxylic acid transport",0.155,1.0000,0.599,0.565,"ion transmembrane transport"),
c("GO:0006913","nucleocytoplasmic transport",0.237,1.0000,0.672,0.587,"ion transmembrane transport"),
c("GO:0051169","nuclear transport",0.239,1.0000,0.699,0.684,"ion transmembrane transport"),
c("GO:0051179","localization",18.495,1.0000,0.985,0.000,"localization"),
c("GO:0065008","regulation of biological quality",3.395,1.0000,0.946,0.000,"regulation of biological quality"),
c("GO:0071840","cellular component organization or biogenesis",8.568,1.0000,0.983,0.000,"cellular component organization or biogenesis"),
c("GO:0007029","endoplasmic reticulum organization",0.032,1.0000,0.925,0.021,"endoplasmic reticulum organization"),
c("GO:0071555","cell wall organization",0.709,1.0000,0.891,0.375,"endoplasmic reticulum organization"),
c("GO:0016043","cellular component organization",7.239,1.0000,0.897,0.602,"endoplasmic reticulum organization"),
c("GO:0045229","external encapsulating structure organization",0.795,1.0000,0.909,0.498,"endoplasmic reticulum organization"),
c("GO:0010256","endomembrane system organization",0.189,1.0000,0.917,0.434,"endoplasmic reticulum organization"),
c("GO:0071554","cell wall organization or biogenesis",0.950,1.0000,0.948,0.027,"cell wall organization or biogenesis"),
c("GO:0090407","organophosphate biosynthetic process",4.110,1.0000,0.611,0.055,"organophosphate biosynthesis"),
c("GO:1901137","carbohydrate derivative biosynthetic process",3.651,1.0000,0.794,0.255,"organophosphate biosynthesis"),
c("GO:0006487","protein N-linked glycosylation",0.076,1.0000,0.782,0.493,"organophosphate biosynthesis"),
c("GO:0017144","drug metabolic process",0.058,1.0000,0.938,0.056,"drug metabolism"),
c("GO:0046486","glycerolipid metabolic process",0.593,1.0000,0.798,0.072,"glycerolipid metabolism"),
c("GO:0006643","membrane lipid metabolic process",0.382,1.0000,0.805,0.657,"glycerolipid metabolism"),
c("GO:0046474","glycerophospholipid biosynthetic process",0.266,1.0000,0.624,0.635,"glycerolipid metabolism"),
c("GO:0006000","fructose metabolic process",0.022,1.0000,0.833,0.116,"glycerolipid metabolism"),
c("GO:1901135","carbohydrate derivative metabolic process",6.319,1.0000,0.934,0.075,"carbohydrate derivative metabolism"));
stuff <- data.frame(revigo.data);
names(stuff) <- revigo.names;
stuff$uniqueness <- as.numeric( as.character(stuff$uniqueness) );
stuff$freqInDbPercent <- as.numeric( as.character(stuff$freqInDbPercent) );
stuff$uniqueness <- as.numeric( as.character(stuff$uniqueness) );
stuff$dispensability <- as.numeric( as.character(stuff$dispensability) );
#colors
j1<- wes_palette("Darjeeling1")
Dj1<- wes_palette("Darjeeling2")
Cv <- wes_palette("Cavalcanti1")
Gb<- wes_palette("GrandBudapest1")
Gb2<- wes_palette("GrandBudapest2")
Br<- wes_palette("BottleRocket2")
Rm<- wes_palette("Rushmore1")
R2<-wes_palette("Royal2")
mr3<- wes_palette("Moonrise3")
wpcolor<- c( Dj1[1:4], "blue", j1[1:4], R2[3],Cv[1:3],Br[1:4], Rm[2:4], Gb2,mr3,Gb[2:4], "white" )
#pdf( file="revigo_treemap.pdf", width=16, height=9 ) # width and height are in inches
# check the tmPlot command documentation for all possible parameters - there are a lot more
tmPlot(
stuff,
index = c("representative","description"),
vSize = "uniqueness",
type = "categorical",
vColor = "representative",
title = "REVIGO Gene Ontology treemap",
inflate.labels = FALSE, # set this to TRUE for space-filling group labels - good for posters
palette = wpcolor,
lowerbound.cex.labels = 0, # try to draw as many labels as possible (still, some small squares may not get a label)
bg.labels = "#CCCCCCAA", # define background color of group labels
# "#CCCCCC00" is fully transparent, "#CCCCCCAA" is semi-transparent grey, NA is opaque
position.legend = "none"
)
#dev.off()
```
## GUI Options
Need EnsembleIDs (or sometimes another standard gene ID format) - easiest to use if you have mapped to an annotated published genome or transcriptome, especially when working with model organisms
David: https://david.ncifcrf.gov/gene2gene.jsp
GOrilla:
http://cbl-gorilla.cs.technion.ac.il/
## KEGG
KEGG annotation and pathway enrichment is another type of functional enrichment that can be more useful than GO terms.
https://www.kegg.jp/
some useful KEGG tools:
- function `kegga` in edgeR package (bioconductor)
- pathview package (bioconductor)
- iPath3, interactive: https://pathways.embl.de/