/
goBayesCor.Rmd
146 lines (95 loc) · 5.2 KB
/
goBayesCor.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
title: "GO-BayesC R2 Comparison to GO-TBLUP"
output:
workflowr::wflow_html:
toc: true
latex_engine: "xelatex"
code_folding: "hide"
editor_options:
chunk_output_type: console
---
```{r 0-setup, include=FALSE, warning=FALSE}
library(dplyr)
library(data.table)
library(ggplot2)
library(cowplot)
library(qqman)
library(viridis)
library(scales)
library(tidyverse)
library(ggcorrplot)
library(melt)
library(reshape2)
#options
options(bitmapType = "cairo")
options(error = function() traceback(3))
#seed
set.seed(123)
#ggplot holder list
gg <- vector(mode='list', length=12)
```
```{r 0-functions}
#plotmaker functions----
ggMake3 <- function(data, sex, psize, custom.title, custom.Xlab, custom.Ylab){
plothole <- ggplot(data,aes(y=cor,x=method, fill=method))+
labs(x=custom.Xlab,y=custom.Ylab, tag=sex, title=custom.title) +
geom_violin(color = 'blue', width = 0.65) +
geom_boxplot(color='#440154FF', width = 0.15) +
theme_minimal()+
scale_fill_viridis(discrete=TRUE, begin=0.2)+
theme(axis.text.x = element_text(angle = 90),
text=element_text(size=15),
plot.tag = element_text(size=10),
legend.position = 'none') +
stat_summary(fun=mean, color='#440154FF', geom='point',
shape=18, size=3, show.legend=FALSE)
return(plothole)
}
methodCorMake <- function(data, sex, custom.title, custom.Xlab, custom.Ylab){
plothole <- ggplot(data, aes(x=method, y=cor, color=index))+
labs(x=custom.Xlab,y=custom.Ylab, tag=sex, title=custom.title, color='Method Correlation') +
geom_point() +
theme_minimal()+
scale_color_viridis(discrete = TRUE, begin = 0.2,
labels=c('Bayes Unlimited : Bayes 0.8','Bayes 0.8 : TBLUP','Bayes Unlimited : TBLUP')) +
theme(axis.text.x = element_text(angle = 90),
text=element_text(size=15),
plot.tag = element_text(size=10))
return(plothole)
}
```
```{r plotMakes, warning=FALSE, eval=TRUE}
load('data/rmdTables/goRcomp/allTables.Rdata')
gg[[1]] <- ggMake3(fPreds, 'F', 1, 'Prediction Accuracy by Method in Females', 'Method', 'Prediction Accuracy')
gg[[2]] <- ggMake3(mPreds, 'M', 1, 'Prediction Accuracy by Method in Males', 'Method', 'Prediction Accuracy')
gg[[3]] <- methodCorMake(fCors, 'F', 'Correlation of Top GO Terms in Females', '# of terms included', 'Correlation')
gg[[4]] <- methodCorMake(mCors, 'M', 'Correlation of Top GO Terms in Males', '# of terms included', 'Correlation')
```
### Rundown
The goal of this assessment was to determine the effectiveness of R2 selection retroactively. We had previously generated data using GO terms to improve prediction accuracy for two methods:
+ GO-BayesC a Bayesian penalized linear regression method
+ GO-TBLUP, a linear mixed model method
In our initial trials, we set the expected proportion of variance explained(POVE), R2, to 0.8 based on [parameter optimization findings.](lasso-OP.html) The original method had a had a single fixed effect while the GO-annotated method had two fixed effects in the model.
After testing back and forth, we settled on removing the R2 limitations completely to allow the model to internally gerenate R2. In comparison, the original GO-BayesC had a hard limit on R2 alongside a relative limit between effects.
### Prediction Accuracy Histogram: Both Sexes
```{r allData, warning=FALSE}
plot_grid(gg[[1]], gg[[2]], ncol = 2)
```
In both females and males, we see that the distributions of GO-BayesC with Unlimited R2 and GO-TBLUP are more similar to each other than to GO-BayesC with 0.8 R2. R2_Unlimited contains the maximum and minimum prediction accuracies across all methods for both sexes, followed by GO-TBLUP.
The following graphs were generated by finding the correlation of top subsections between methods after ordering GO terms by 'prediction accuracy' from Bayes Unlimited. This was done to assess the effectiveness of constraining R2 or letting the model generate R2 internally by comparing the two cases to each other along with GO-TBLUP, an 'outgroup' that has shown to select for biologically relevatn GO terms.
### Females
```{r fData, warning=FALSE}
plot_grid(gg[[3]], ncol=1)
```
Expectedly, the Bayesian methods are most similar to each other across the interior section.
In females, the relationships of each Bayesian method to GO-TBLUP are indistinguishable for top terms.
or
In females, it is unclear which Bayesian method has a higher correlation with GO-TBLUP.
### Males
```{r mData, warning=FALSE}
plot_grid(gg[[4]], ncol=1)
```
There is a striking difference in males. The top terms of Bayes Unlimited and TBLUP outperform the correlation of Bayesian methods by a slim margin. This effect deteriorates around the top 250 terms mark. This suggests that the top terms are correlated while the inclusion of less predictive terms decreases the correlation coefficient.
Bayes Unlimited has a distinctly higher correlation to GO-TBLUP than Bayes R2 = 0.8.
### Takeaways
Leaving BayesC to internally decide R2 improves maximum prediction accuracy. This selection has a higher correlation to proven method GO-TBLUP, suggesting that the models predict similar terms. This strengthens our gene enrichment analysis as top terms are more consistent across models than previously described.