RDA

RDA analysis

RDA analysis adapted from the nice workflow of Brenna Forester available in popgennescent

Developed by Laura Benestan in Stephanie Manel's laboratory for the Reservebenefit project.

Concept and definition

We used of Redundancy Analysis (RDA) as a genotype-environment association (GEA) method to simultaneously assess the percent of genomic variation explained by environmental variables and to detect loci under selection (see the relevant paper of Forester et al. 2018. RDA is a two-step analysis in which genetic and environmental data are analyzed using multivariate linear regression. Then PCA of the fitted values is used to produce canonical axes, which are linear combinations of the predictors (Legendre & Legendre, 2012). Here, we performed the RDA on a individual-based sampling design rather than population-based, see Benestan et al. 2016.

Application to our dataset

Here, we apply RDA to genomic data from 276 individuals of the white seabream (Diplodus sargus) sampled across the Mediterranean sea.

Then, we analyse Results of the RDA at the full set of 18,512 single nucleotide polymorphism (SNP) markers. The goal of this study is to understand which environmental factor may drive the genomic divergence observed in this species. In this case, the data is individual-based, and inputs are allele counts (i.e. 0/1/2) for each locus and individual fish.

1. Preparing inputs files and loading libraries

Download R packages First, we install the necessary packages and then download the corresponding libraries.

library(psych)    
library(vegan)
library(adegenet)
library(dplyr)
library(fmsb)
library(gsl)

Obtain raw format from vcf file using VCFTOOLS and PlINK

Transform the vcf format into a tped plink format

vcftools --vcf yournameoffile.vcf --plink-tped --out yournameoffile

Then transform the plink format to a raw format by re-running the following command on your terminal.

plink --tped yournameoffile.tped --tfam yournameoffile.tfam --recodeA --out yournameoffile

2. Start to download the genomic and environment data in R

plink_diplodus <- read.table("yournameoffile.raw", header=TRUE, sep=" ", row.names=1)
dim(plink_diplodus)

Prepare genomic data.

We remove the names of the samples as we don't consider these values when we will fill the missing values. Yet other way to handle the missing values and before doing this step visualize patterns of missingness and perform map-independent imputations of missing genotypes using grur package

gen <- plink_diplodus[,6:37029]

We calculate the percent of missing data. Multivariate analysis are very sensitive to missing data and these missing needs to be filled.

sum(is.na(gen))# 553,232 NAs in the matrix (~3% missing data)

Missing values are then filled using the overall average of the allele frequency variation.

gen.imp <- apply(gen, 2, function(x) replace(x, is.na(x), as.numeric(names(which.max(table(x))))))
sum(is.na(gen.imp)) # No NAs

We download environmental data in R.

env <- read.table("276ind-24env-variables.txt",sep="\t",header=T) # for diplodus
str(env)

Prepare the environmental data.

We change the class of the labels.

env$labels <- as.character(env$labels)

We remove one of the highly correlated predictors by visually and quantitatively checking the correlation among predictors using the function called pairs.panels.

pairs.panels(env[,2:25], scale=T)

We create a correlation matrix.

matrix_cor <- cor(env[,2:25])

We select the non-correlated variables.

env_selected <- select(env, salinity_surface_repro_min,chlo_surface_max,chlo_benthic_min)

3. Run the Redundancy Analysis (RDA)

This step can take a while depending on your dataset.

diplodus.rda <- rda(gen.imp ~ ., data=pred, scale=T)

The R^2 inform about the percent of genomic variation that can be explained by one of the predictor. Here for instance, we found a adjusted R^2 of 0.0002, which means that our constrained ordination explains about 0.02% of the variation.

This low explanatory power is not surprising given that we expect that most of the SNPs in our dataset will not show a relationship with the environmental predictors (e.g., most SNPs will be neutral).

We calculate the adjusted R^2.

RsquareAdj(diplodus.rda)

We observed the eigenvalues for the constrained axes reflect the variance explained by each canonical axis.

summary(eigenvals(diplodus.rda, model = "constrained")
screeplot(diplodus.rda)

We load the species scores for the first three constrained axes.

load.rda <- scores(diplodus.rda, choices=c(1:3), display="species")

Then we tested the significance of the RDA.

signif.full <- anova.cca(diplodus.rda, parallel=getOption("mc.cores")) # default is permutation=999
signif.full

4. Visualize the RDA outcomes in a nice graph

We plot the RDA results.

plot(diplodus.rda, scaling=3) 
plot(diplodus.rda, choices = c(1, 3), scaling=3)

We add the grouping information, here the Marine Protected Areas info.

mpa_group <- read.table("population-map-276ind-diplodus-mpa.txt",header=TRUE,sep="\t")
mpa_env <- cbind(env, mpa_group)
eco <- mpa_group$STRATA

We select a palette of colors to use for each MPA.

### 6 nice colors for the MPA
bg <- c("#ff7f00","#1f78b4","#ffff33","#a6cee3","#33a02c","#e31a1c")

We save the nice RDA plot.

pdf("RDA_outside_inside.pdf",width=10,height=10)
plot(diplodus.rda, type="n", scaling=3)
points(diplodus.rda, display="species", pch=20, cex=0.7, col="gray32", scaling=3)           # the SNPs
points(diplodus.rda, display="sites", cex=1.3, scaling=3, col = bg[as.numeric(eco)],pch =21) 
text(diplodus.rda, scaling=3, display="bp", col="#0868ac", cex=1)                           # the predictors
legend("bottomright", legend=levels(eco), bty="n", col="gray32", pch=21, cex=1, pt.bg=bg)
dev.off()

We extract individuals information and save it for publication needs.

rda_indv <- as.data.frame(scores(diplodus.rda, display=c("sites")))
write.table(rda_indv, "rda_diplodus_276ind.txt", quote=FALSE, row.names=TRUE)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
276ind-24env-variables.txt		276ind-24env-variables.txt
Correlation_predictor.png		Correlation_predictor.png
RDA_diplodus.png		RDA_diplodus.png
README.md		README.md
population-map-276ind-diplodus-mpa.txt		population-map-276ind-diplodus-mpa.txt
rda_diplodus_276ind.txt		rda_diplodus_276ind.txt
script-rda-multiloci.R		script-rda-multiloci.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

276ind-24env-variables.txt

276ind-24env-variables.txt

Correlation_predictor.png

Correlation_predictor.png

RDA_diplodus.png

RDA_diplodus.png

README.md

README.md

population-map-276ind-diplodus-mpa.txt

population-map-276ind-diplodus-mpa.txt

rda_diplodus_276ind.txt

rda_diplodus_276ind.txt

script-rda-multiloci.R

script-rda-multiloci.R

Repository files navigation

RDA

RDA analysis

Concept and definition

Application to our dataset

1. Preparing inputs files and loading libraries

2. Start to download the genomic and environment data in R

3. Run the Redundancy Analysis (RDA)

4. Visualize the RDA outcomes in a nice graph

About

Releases

Packages

Languages

laurabenestan/RDA_outlier

Folders and files

Latest commit

History

Repository files navigation

RDA

RDA analysis

Concept and definition

Application to our dataset

1. Preparing inputs files and loading libraries

2. Start to download the genomic and environment data in R

3. Run the Redundancy Analysis (RDA)

4. Visualize the RDA outcomes in a nice graph

About

Resources

Stars

Watchers

Forks

Languages