analysis/yield_link.Rmd

---
title: "Linking yield with NLR PAV"
author: "Philipp Bayer"
date: "2020-09-22"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---

```{r setup}
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
library(tidyverse)
library(patchwork)
library(ggsci)
library(dabestr)
library(dabestr)
library(cowplot)
library(ggsignif)
library(ggforce)

theme_set(theme_cowplot())
```


# Data loading

```{r}
npg_col = pal_npg("nrc")(9)
col_list <- c(`Wild-type`=npg_col[8],
   Landrace = npg_col[3],
  `Old cultivar`=npg_col[2],
  `Modern cultivar`=npg_col[4])

pav_table <- read_tsv('./data/soybean_pan_pav.matrix_gene.txt.gz')
```


```{r}
nbs <- read_tsv('./data/Lee.NBS.candidates.lst', col_names = c('Name', 'Class'))
nbs
# have to remove the .t1s 
nbs$Name <- gsub('.t1','', nbs$Name)
nbs_pav_table <- pav_table %>% filter(Individual %in% nbs$Name)
```


```{r}
names <- c()
presences <- c()

for (i in seq_along(nbs_pav_table)){
  if ( i == 1) next
  thisind <- colnames(nbs_pav_table)[i]
  pavs <- nbs_pav_table[[i]]
  presents <- sum(pavs)
  names <- c(names, thisind)
  presences <- c(presences, presents)
}
nbs_res_tibb <- new_tibble(list(names = names, presences = presences))
```


```{r}
groups <- read_csv('./data/Table_of_cultivar_groups.csv')
groups <- groups %>% 
  mutate(`Group in violin table` = str_replace_all(`Group in violin table`, 'landrace', 'Landrace')) %>%
  mutate(`Group in violin table` = str_replace_all(`Group in violin table`, 'Old_cultivar', 'Old cultivar')) %>%
  mutate(`Group in violin table` = str_replace_all(`Group in violin table`, 'Modern_cultivar', 'Modern cultivar'))

groups$`Group in violin table` <-
  factor(
    groups$`Group in violin table`,
    levels = c('Wild-type',
               'Landrace',
               'Old cultivar',
               'Modern cultivar')
  )

nbs_joined_groups <-
  inner_join(nbs_res_tibb, groups, by = c('names' = 'Data-storage-ID'))
```

# Linking with yield

Can we link the trajectory of NLR genes with the trajectory of yield across the history of soybean breeding? let's make a simple regression for now

## Yield

```{r yield_join}

yield <- read_tsv('./data/yield.txt')
yield_join <- inner_join(nbs_res_tibb, yield, by=c('names'='Line'))
```

```{r}
yield_join %>% ggplot(aes(x=presences, y=Yield)) + geom_hex() + geom_smooth() +
  xlab('NLR gene count')
```


## Protein

```{r protein_join}
protein <- read_tsv('./data/protein_phenotype.txt')
protein_join <- left_join(nbs_res_tibb, protein, by=c('names'='Line')) %>% filter(!is.na(Protein))
```

```{r}
protein_join %>% ggplot(aes(x=presences, y=Protein)) + geom_hex() + geom_smooth() +
  xlab('NLR gene count')
```


```{r}
summary(lm(Protein ~ presences, data = protein_join))
```

## Seed weight

Let's look at seed weight:

```{r seed_join}
seed_weight <- read_tsv('./data/Seed_weight_Phenotype.txt', col_names = c('names', 'wt'))
seed_join <- left_join(nbs_res_tibb, seed_weight) %>% filter(!is.na(wt))
```

```{r}
seed_join %>% filter(wt > 5) %>%  ggplot(aes(x=presences, y=wt)) + geom_hex() + geom_smooth() +
  ylab('Seed weight') +
  xlab('NLR gene count')
```


```{r}
summary(lm(wt ~ presences, data = seed_join))
```

## Oil content
And now let's look at the oil phenotype:

```{r oil_join}
oil <- read_tsv('./data/oil_phenotype.txt')
oil_join <- left_join(nbs_res_tibb, oil, by=c('names'='Line')) %>% filter(!is.na(Oil))
```


```{r}
oil_join %>%  ggplot(aes(x=presences, y=Oil)) + geom_hex() + geom_smooth() +
  xlab('NLR gene count')
```

```{r}
summary(lm(Oil ~ presences, data = oil_join))
```

OK there are many, many outliers here. Clearly I'll have to do something fancier - for example, using the first two PCs as covariates might get rid of some of those outliers. 


# Boxplots per group

## Yield
```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(yield, by=c('names'='Line')) %>% 
  ggplot(aes(x=`Group in violin table`, y=Yield, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Protein') +
  xlab('Accession group')
```

And let's check the dots:


```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(yield_join, by = 'names') %>% 
  ggplot(aes(y=presences.x, x=Yield, color=`Group in violin table`)) +
  geom_point() + 
  scale_color_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +  
  ylab('NLR gene count')

```

```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(yield_join, by = 'names') %>% 
  filter(`Group in violin table` != 'Landrace') %>% 
  ggplot(aes(x=presences.x, y=Yield, color=`Group in violin table`)) +
  geom_point() + 
  scale_color_manual(values = col_list) + 
  theme_minimal_hgrid() +
  geom_smooth() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +  
  xlab('NLR gene count')

```
## Protein

protein vs. the four groups:

```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(protein, by=c('names'='Line')) %>% 
  ggplot(aes(x=`Group in violin table`, y=Protein, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Protein') +
  xlab('Accession group')
```

## Seed weight
And seed weight:

```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(seed_join) %>% 
  ggplot(aes(x=`Group in violin table`, y=wt, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Seed weight') +
  xlab('Accession group')
```

Wow, that's breeding!

## Oil content

And finally, Oil content:

```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(oil_join, by = 'names') %>% 
  ggplot(aes(x=`Group in violin table`, y=Oil, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Oil content') +
  xlab('Accession group')
```

Oha, a single star. That's p < 0.05!

Let's redo the above hexplot, but also color the dots by group.

```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(oil_join, by = 'names') %>% 
  ggplot(aes(x=presences.x, y=Oil, color=`Group in violin table`)) +
  geom_point() + 
  scale_color_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +  
  xlab('NLR gene count')
```

Oha, so it's the wild-types that drag this out a lot.

Let's remove them and see what it looks like:

```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(oil_join, by = 'names') %>% 
  filter(`Group in violin table` %in% c('Old cultivar', 'Modern cultivar')) %>% 
  ggplot(aes(x=presences.x, y=Oil, color=`Group in violin table`)) +
  geom_point() + 
  scale_color_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +  
  xlab('NLR gene count') +
  geom_smooth()
```

Let's remove that one outlier:
```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(oil_join, by = 'names') %>% 
  filter(`Group in violin table` %in% c('Old cultivar', 'Modern cultivar')) %>% 
  filter(Oil > 13) %>% 
  ggplot(aes(x=presences.x, y=Oil, color=`Group in violin table`)) +
  geom_point() + 
  scale_color_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +  
  xlab('NLR gene count') +
  geom_smooth()
```

Does the above oil content boxplot become different if we exclude the one outlier? I'd bet so


```{r}
nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(oil_join, by = 'names') %>% 
  filter(names != 'USB-393') %>% 
  ggplot(aes(x=`Group in violin table`, y=Oil, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Oil content') +
  xlab('Accession group')
```

Nope, still significantly higher in modern cultivars!