Skip to content

Redundancy Analysis allowing to both estimate the percent of genomic variation explained by each variable as well as to detect the putative genomic signal under divergent selection

Notifications You must be signed in to change notification settings

laurabenestan/RDA_outlier

Repository files navigation

RDA

RDA analysis

RDA analysis adapted from the nice workflow of Brenna Forester available in popgennescent

Developed by Laura Benestan in Stephanie Manel's laboratory for the Reservebenefit project.

Concept and definition

We used of Redundancy Analysis (RDA) as a genotype-environment association (GEA) method to simultaneously assess the percent of genomic variation explained by environmental variables and to detect loci under selection (see the relevant paper of Forester et al. 2018. RDA is a two-step analysis in which genetic and environmental data are analyzed using multivariate linear regression. Then PCA of the fitted values is used to produce canonical axes, which are linear combinations of the predictors (Legendre & Legendre, 2012). Here, we performed the RDA on a individual-based sampling design rather than population-based, see Benestan et al. 2016.

Application to our dataset

Here, we apply RDA to genomic data from 276 individuals of the white seabream (Diplodus sargus) sampled across the Mediterranean sea.

Then, we analyse Results of the RDA at the full set of 18,512 single nucleotide polymorphism (SNP) markers. The goal of this study is to understand which environmental factor may drive the genomic divergence observed in this species. In this case, the data is individual-based, and inputs are allele counts (i.e. 0/1/2) for each locus and individual fish.

1. Preparing inputs files and loading libraries

Download R packages First, we install the necessary packages and then download the corresponding libraries.

library(psych)    
library(vegan)
library(adegenet)
library(dplyr)
library(fmsb)
library(gsl)

Obtain raw format from vcf file using VCFTOOLS and PlINK

Transform the vcf format into a tped plink format

vcftools --vcf yournameoffile.vcf --plink-tped --out yournameoffile

Then transform the plink format to a raw format by re-running the following command on your terminal.

plink --tped yournameoffile.tped --tfam yournameoffile.tfam --recodeA --out yournameoffile

2. Start to download the genomic and environment data in R

plink_diplodus <- read.table("yournameoffile.raw", header=TRUE, sep=" ", row.names=1)
dim(plink_diplodus)

Prepare genomic data.

We remove the names of the samples as we don't consider these values when we will fill the missing values. Yet other way to handle the missing values and before doing this step visualize patterns of missingness and perform map-independent imputations of missing genotypes using grur package

gen <- plink_diplodus[,6:37029]

We calculate the percent of missing data. Multivariate analysis are very sensitive to missing data and these missing needs to be filled.

sum(is.na(gen))# 553,232 NAs in the matrix (~3% missing data)

Missing values are then filled using the overall average of the allele frequency variation.

gen.imp <- apply(gen, 2, function(x) replace(x, is.na(x), as.numeric(names(which.max(table(x))))))
sum(is.na(gen.imp)) # No NAs

We download environmental data in R.

env <- read.table("276ind-24env-variables.txt",sep="\t",header=T) # for diplodus
str(env)

Prepare the environmental data.

We change the class of the labels.

env$labels <- as.character(env$labels)

We remove one of the highly correlated predictors by visually and quantitatively checking the correlation among predictors using the function called pairs.panels.

pairs.panels(env[,2:25], scale=T)

Correlations among our predictors.

We create a correlation matrix.

matrix_cor <- cor(env[,2:25])

We select the non-correlated variables.

env_selected <- select(env, salinity_surface_repro_min,chlo_surface_max,chlo_benthic_min) 

3. Run the Redundancy Analysis (RDA)

This step can take a while depending on your dataset.

diplodus.rda <- rda(gen.imp ~ ., data=pred, scale=T)

The R^2 inform about the percent of genomic variation that can be explained by one of the predictor. Here for instance, we found a adjusted R^2 of 0.0002, which means that our constrained ordination explains about 0.02% of the variation.

This low explanatory power is not surprising given that we expect that most of the SNPs in our dataset will not show a relationship with the environmental predictors (e.g., most SNPs will be neutral).

We calculate the adjusted R^2.

RsquareAdj(diplodus.rda)

We observed the eigenvalues for the constrained axes reflect the variance explained by each canonical axis.

summary(eigenvals(diplodus.rda, model = "constrained")
screeplot(diplodus.rda)

We load the species scores for the first three constrained axes.

load.rda <- scores(diplodus.rda, choices=c(1:3), display="species")  

Then we tested the significance of the RDA.

signif.full <- anova.cca(diplodus.rda, parallel=getOption("mc.cores")) # default is permutation=999
signif.full

4. Visualize the RDA outcomes in a nice graph

We plot the RDA results.

plot(diplodus.rda, scaling=3) 
plot(diplodus.rda, choices = c(1, 3), scaling=3)

We add the grouping information, here the Marine Protected Areas info.

mpa_group <- read.table("population-map-276ind-diplodus-mpa.txt",header=TRUE,sep="\t")
mpa_env <- cbind(env, mpa_group)
eco <- mpa_group$STRATA

We select a palette of colors to use for each MPA.

### 6 nice colors for the MPA
bg <- c("#ff7f00","#1f78b4","#ffff33","#a6cee3","#33a02c","#e31a1c")

We save the nice RDA plot.

pdf("RDA_outside_inside.pdf",width=10,height=10)
plot(diplodus.rda, type="n", scaling=3)
points(diplodus.rda, display="species", pch=20, cex=0.7, col="gray32", scaling=3)           # the SNPs
points(diplodus.rda, display="sites", cex=1.3, scaling=3, col = bg[as.numeric(eco)],pch =21) 
text(diplodus.rda, scaling=3, display="bp", col="#0868ac", cex=1)                           # the predictors
legend("bottomright", legend=levels(eco), bty="n", col="gray32", pch=21, cex=1, pt.bg=bg)
dev.off()

RDA visualisation

We extract individuals information and save it for publication needs.

rda_indv <- as.data.frame(scores(diplodus.rda, display=c("sites")))
write.table(rda_indv, "rda_diplodus_276ind.txt", quote=FALSE, row.names=TRUE)

About

Redundancy Analysis allowing to both estimate the percent of genomic variation explained by each variable as well as to detect the putative genomic signal under divergent selection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages