analysis/goBayesCor.Rmd

---
title: "GO-BayesC R2 Comparison to GO-TBLUP"
output:
  workflowr::wflow_html:
    toc: true
    latex_engine: "xelatex"
    code_folding: "hide"
editor_options:
  chunk_output_type: console
---

```{r 0-setup, include=FALSE, warning=FALSE}

library(dplyr)
library(data.table)
library(ggplot2)
library(cowplot)
library(qqman)
library(viridis)
library(scales)
library(tidyverse)
library(ggcorrplot)
library(melt)
library(reshape2)

#options
options(bitmapType = "cairo")
options(error = function() traceback(3))

#seed
set.seed(123)

#ggplot holder list
gg <- vector(mode='list', length=12)

```

```{r 0-functions}
#plotmaker functions----

ggMake3 <- function(data, sex, psize, custom.title, custom.Xlab, custom.Ylab){
  plothole <- ggplot(data,aes(y=cor,x=method, fill=method))+
    labs(x=custom.Xlab,y=custom.Ylab, tag=sex, title=custom.title) +
    geom_violin(color = 'blue', width = 0.65) +
    geom_boxplot(color='#440154FF', width = 0.15) +
    theme_minimal()+
    scale_fill_viridis(discrete=TRUE, begin=0.2)+
    theme(axis.text.x = element_text(angle = 90),
          text=element_text(size=15),
          plot.tag = element_text(size=10),
          legend.position = 'none') +
    stat_summary(fun=mean, color='#440154FF', geom='point', 
                 shape=18, size=3, show.legend=FALSE)
  return(plothole)
}

methodCorMake <- function(data, sex, custom.title, custom.Xlab, custom.Ylab){
  plothole <- ggplot(data, aes(x=method, y=cor, color=index))+
    labs(x=custom.Xlab,y=custom.Ylab, tag=sex, title=custom.title, color='Method Correlation') +
    geom_point() +
    theme_minimal()+
    scale_color_viridis(discrete = TRUE, begin = 0.2,
                        labels=c('Bayes Unlimited : Bayes 0.8','Bayes 0.8 : TBLUP','Bayes Unlimited : TBLUP')) +
    theme(axis.text.x = element_text(angle = 90),
          text=element_text(size=15),
          plot.tag = element_text(size=10))
  return(plothole)
}

```

```{r plotMakes, warning=FALSE, eval=TRUE}

load('data/rmdTables/goRcomp/allTables.Rdata')

gg[[1]] <- ggMake3(fPreds, 'F', 1, 'Prediction Accuracy by Method in Females', 'Method', 'Prediction Accuracy')

gg[[2]] <- ggMake3(mPreds, 'M', 1, 'Prediction Accuracy by Method in Males', 'Method', 'Prediction Accuracy')

gg[[3]] <- methodCorMake(fCors, 'F', 'Correlation of Top GO Terms in Females', '# of terms included', 'Correlation')

gg[[4]] <- methodCorMake(mCors, 'M', 'Correlation of Top GO Terms in Males', '# of terms included', 'Correlation')

```


### Rundown

The goal of this assessment was to determine the effectiveness of R2 selection retroactively. We had previously generated data using GO terms to improve prediction accuracy for two methods:

+ GO-BayesC a Bayesian penalized linear regression method
+ GO-TBLUP, a linear mixed model method

In our initial trials, we set the expected proportion of variance explained(POVE), R2, to 0.8 based on [parameter optimization findings.](lasso-OP.html) The original method had a had a single fixed effect while the GO-annotated method had two fixed effects in the model.

After testing back and forth, we settled on removing the R2 limitations completely to allow the model to internally gerenate R2. In comparison, the original GO-BayesC had a hard limit on R2 alongside a relative limit between effects.

### Prediction Accuracy Histogram: Both Sexes

```{r allData, warning=FALSE}

plot_grid(gg[[1]], gg[[2]], ncol = 2)

```

In both females and males, we see that the distributions of GO-BayesC with Unlimited R2 and GO-TBLUP are more similar to each other than to GO-BayesC with 0.8 R2. R2_Unlimited contains the maximum and minimum prediction accuracies across all methods for both sexes, followed by GO-TBLUP.

The following graphs were generated by finding the correlation of top subsections between methods after ordering GO terms by 'prediction accuracy' from Bayes Unlimited. This was done to assess the effectiveness of constraining R2 or letting the model generate R2 internally by comparing the two cases to each other along with GO-TBLUP, an 'outgroup' that has shown to select for biologically relevatn GO terms.

### Females

```{r fData, warning=FALSE}

plot_grid(gg[[3]], ncol=1)

```

Expectedly, the Bayesian methods are most similar to each other across the interior section. 

In females, the relationships of each Bayesian method to GO-TBLUP are indistinguishable for top terms.

or

In females, it is unclear which Bayesian method has a higher correlation with GO-TBLUP.


### Males

```{r mData, warning=FALSE}

plot_grid(gg[[4]], ncol=1)

```

There is a striking difference in males. The top terms of Bayes Unlimited and TBLUP outperform the correlation of Bayesian methods by a slim margin. This effect deteriorates around the top 250 terms mark. This suggests that the top terms are correlated while the inclusion of less predictive terms decreases the correlation coefficient.

Bayes Unlimited has a distinctly higher correlation to GO-TBLUP than Bayes R2 = 0.8. 

### Takeaways

Leaving BayesC to internally decide R2 improves maximum prediction accuracy. This selection has a higher correlation to proven method GO-TBLUP, suggesting that the models predict similar terms. This strengthens our gene enrichment analysis as top terms are more consistent across models than previously described.