Codes for the paper : "Seascape genomics provides evidence for thermal adaptation and current‐mediated population structure in American lobster (Homarus americanus)"

Laura Benestan

Québec, 2016

Published in Molecular Ecology, 2016, see Benestan et al

Description

Bayescan is a command-line program that aim to identifying putative candidate loci under natural selection from genetic data, using differences in allele frequencies between populations.

BayeScan is based on the Multinomial-Dirichlet model.

This program can define three categories of putative candidate loci:

under diversifying selection
under balancing selection
under neutrality

For each locus, BayeScan calculates a posterior probability (Posterior odds) - available through the parameter pr_odds - for the model including selection. These posterior probabilities indicate how more likely the model with selection is compared to the neutral model. For instance a pr_odds of 10 means that there is 1/10 probability for a marker to be under selection. This number would be too high when considering a dataset with up to 10,000 markers.

In the context multiple testing such as large number of markers (up to 10,000), run BAYESCAN with appropriate parameter as recommended in Whitlock and Lotterhos, 2015. To do so, you should consider the number of loci in your dataset. You can also consult the bayescan exercice to learn more about how to interpret Bayescan files and outputs.

1. Prepare dataset in the .geste format

Most of the Next Generation Sequencing (NGS) projects generate a VCF (.vcf) or a PLINK file (.tped and .tfam) after aligning the sequences in STACKS.

The first step is to prepare files in an appropriate .geste format for BAYESCAN. Convert the .vcf file to a .geste dataset with the function genomic_converter available in the elegant radiatorpackage in R (see Thierry Gosselin github page for more details).

lobster <- genomic_converter(
  data = "13688snps-562ind.recode.vcf", strata = "population_map_lobster.txt",
  output = c("bayescan"), filename="13688snps-562ind.geste")

2. Run Bayescan in a terminal

Preparing .geste file

BayeScan uses its own input file formats that have the extension .geste. To transform the vcf file (or other formats such as genepop and structure) in a .geste file, you can easily use PGDSpider2

To run Bayescan in bash, use the following command:

./BayeScan2.1_macos64bits -n 5000 -burn 50000 -pr_odds 10000

By running Bayescan, you should check three output files: *Verif.txt: Use this file to verify that BayeScan read the input files properly *.sel: Record the parameters estimated by the model *fst.txt: Use to identify outliers

3. Use R to identify outliers from Bayescan analyses

Using R to visualize the outputs First, download libraries

library(ggplot2)

Open the bayescan output file with the "_fst.txt" extension.

bayescan=read.table("bayescan-13688snps-562ind.g_fst.txt")

The first column of the bayescan object is a SNP ID. The next three (prob, log10(P0), and qval) are related to the test of local adaptation considering the logarithm of the posterior odds - log10(PO) - and the q-value for the model with selection. The fifth column gives the size of the locus-specific effect (alpha parameter). The last one provides the locus-specific FST averaged over all populations.

Download the list of SNPs in the right order. The .este format has SNPs in the same order than the vcf used to produce the .geste format. Tehrefore, you can use this command in bash to extract the third column containing the ID info of each SNPs in your vcf:

 grep -v "#" 13688snps-562ind.recode.vcf | cut -f 3 > 13688snps-562ind_id_vcf.txt

Then

SNPb=read.table("13688snps-562ind_id_vcf.txt",header=FALSE)

Merge the name of the outliers with the results from bayescan.

bayescan=cbind(SNPb, bayescan) 
colnames(bayescan)=c("SNP","PROB","LOG_PO","Q_VALUE","ALPHA","FST") 
write.table(bayescan, "13688snps-bayescan-results.txt", quote=FALSE, sep="\t", row.names=FALSE)

Change the value of the Q_VALUE column: 0 == 0.0001.

attach(bayescan)
class(bayescan$Q_VALUE)  
bayescan$Q_VALUE <- as.numeric(bayescan$Q_VALUE) 
bayescan[bayescan$Q_VALUE<=0.0001,"Q_VALUE"]=0.0001

Round the values.

bayescan$LOG_PO <- (round(bayescan$LOG_PO, 4)) 
bayescan$Q_VALUE <- (round(bayescan$Q_VALUE, 4)) 
bayescan$ALPHA <- (round(bayescan$ALPHA, 4)) 
bayescan$FST <- (round(bayescan$FST, 6))

Add a column for the type of selection grouping based on a Q-VALUE < 0.05 (you can also choose a Q-VALUE < 0.01).

bayescan$SELECTION <- ifelse(bayescan$ALPHA>=0&bayescan$Q_VALUE<=0.05,"diversifying",ifelse(bayescan$ALPHA>=0&bayescan$Q_VALUE>0.05,"neutral","balancing")) 
bayescan$SELECTION<- factor(bayescan$SELECTION)
levels(bayescan$SELECTION)

Save the results of the SNPs potentially under positive (divergent) and balancing selection (qvalue < 0.05).

positive <- bayescan[bayescan$SELECTION=="diversifying",] 
neutral <- bayescan[bayescan$SELECTION=="neutral",] 
balancing <- bayescan[bayescan$SELECTION=="balancing",]

Check the number of SNPs belonging to each category.

xtabs(data=bayescan, ~SELECTION)

Write the results of the SNPs potentially under selection (qvalue < 0.05).

write.table(neutral, "neutral.txt", row.names=F, quote=F)  
write.table(balancing, "balancing.txt", row.names=F, quote=F) 
write.table(positive, "positive.txt", row.names=F, quote=F)

4. Use R to visualize Bayescan results

Transformation Log of the Q value in order to create te ggplot graph.

range(bayescan$Q_VALUE) 
bayescan$LOG10_Q <- -log10(bayescan$Q_VALUE)

Create title for the ggplot graph.

x_title="Log(q-value)" 
y_title="Fst"

Make the ggplot graph.

graph_1<-ggplot(bayescan,aes(x=LOG10_Q,y=FST, label=bayescan$POS)) 
graph_1+geom_point(aes(fill=bayescan$SELECTION), pch=21, size=2)+ 
  #geom_text()+ 
  scale_fill_manual(name="Selection",values=c("white","red","orange"))+ 
  labs(x=x_title)+ 
  labs(y=y_title)+ 
  theme(axis.title=element_text(size=12, family="Helvetica",face="bold"), legend.position="none")+ 
  theme(axis.text.x=element_text(colour="black"))+ 
  theme(axis.text.y=element_text(colour="black",size=12))+ 
  theme(axis.text.x=element_text(colour="black",size=12))+ 
  theme(panel.border = element_rect(colour="black", fill=NA, size=3),  
        axis.title=element_text(size=18,colour="black",family="Helvetica",face="bold")) +
        theme_classic()

Save the file in a pdf format

ggsave("bayescan_13688_562ind.pdf", dpi=600, width=5, height=5) 
dev.off()

You can also simply use the function already available in Bayescan. First load the function in R.

source("plot_R.r")

Make a nice graph using this plot_bayescan function.

plot_bayescan("bayescan-13688snps-562ind.g_fst.txt")

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
13688snps-562ind_id_vcf.txt		13688snps-562ind_id_vcf.txt
Bayescan_Benestan_2016.png		Bayescan_Benestan_2016.png
README.md		README.md
balancing.txt		balancing.txt
bayescan-13688snps-562ind.g_fst.txt		bayescan-13688snps-562ind.g_fst.txt
bayescan.Rmd		bayescan.Rmd
bayescan.html		bayescan.html
bayescan.pdf		bayescan.pdf
bayescan_13688_562ind.pdf		bayescan_13688_562ind.pdf
filtered_3699snps_californicus.vcf		filtered_3699snps_californicus.vcf
list-13688snps.txt		list-13688snps.txt
neutral.txt		neutral.txt
plot_R.r		plot_R.r
population_map_sea_cucumber.txt		population_map_sea_cucumber.txt
positive.txt		positive.txt

laurabenestan/Bayescan

Folders and files

Latest commit

History

Repository files navigation

Codes for the paper : "Seascape genomics provides evidence for thermal adaptation and current‐mediated population structure in American lobster (Homarus americanus)"

Laura Benestan

Description

1. Prepare dataset in the .geste format

2. Run Bayescan in a terminal

3. Use R to identify outliers from Bayescan analyses

4. Use R to visualize Bayescan results

About

Resources

Stars

Watchers

Forks

Languages