This repository contains files and scripts from dadi v1.6.3 (Gutenkunst et al. 2009) that were modified to improve optimization, as well as custom models that account for semi-permeability (i.e by including variable effective migration rates across the genome), as introduced by Tine et al 2014. The repository also contains data from Rougemont et al. 2016 who applied these models to European lamprey ecotypes using both Folded and Unfolded JSFS. The approach implemented in dadi relies on the comparison of alternative demographic models of divergence to genome-wide polymorphism data.
This pipeline requires scipy
v0.13.3 or older versions.
The simulated annealing implemented in model_demo.py
is deprecated in newer version of scipy.
python2.7
scipy v0.13
or older
numpy
matplotlib
pylab
To extract population SFS using utility_scripts
:
vcftools
bgzip
tabix
Gutenkunst RN, Hernandez RD, Williams SH, Bustamante CD (2009) Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet 5: e1000695.
Tine M, Kuhl H, Gagnaire P-A et al. (2014) European sea bass genome and its variation provide insights into adaptation to euryhalinity and speciation. Nature Communications, 5:5770,doi:10.1038/ncomms6770
Rougemont, Q., Gagnaire, P.-A., Perrier, C., Genthon, C., Besnard, A.-L., Launey, S. and Evanno, G. (2016), Inferring the demographic history underlying parallel genomic divergence among pairs of parasitic and non-parasitic lamprey ecotypes. Mol Ecol. Accepted Author Manuscript. doi:10.1111/mec.13664
See also: https://popgensealab.wordpress.com/dadi-inference/
The modified version of dadi contain 4 improvements:
-
The possibility to explictly test alternative models of speciation such as models of divergence with continuous gene flow (IM models), models of secondary contact (SC models), models of strict alloptatric divergence (SI) or models of divergence with initial gene flow followed by isolation without gene flow (AM model). The models are implemented in the model_demo.py
-
The possibility to integrate semipermeability in the demographic models of divergence with gene flow (IM2m, AM2m) and secondary contacts (SC2m).
It is also possible to include various effect of linked selection (i.e. background selection, selective sweeps and hitchhiking effects) locally reducing Ne (IM2N, SI2N, AM2N, SC2N). These models can be combined to integrate both the effect of linked selection and of barriers to gene flow (IM2N2m, AM2N2mn SC2N2m).
Modelling variable migration rates across the genome is a simple way to capture the effect of genetic barriers to gene flow (due to genetic incompatibilities or local adaptation loci), that locally reduce gene flow and generate heterogeneous patterns of genomic divergence.
Here, two categories of loci are considered to take into account these variable rates of gene flow across the genome, those that are "neutrally" exchanged (occurring in proportion P) and those with reduced effective migration rates (occurring in proportion 1-P).
The same idea is used to define loci undergoing higher rates of drift.
Details of the models are available here: 02-modifs_v2/unfolded/modeledemo_new_models.py
It is also possible to model population expansion and contraction but I ussually do not use these models to avoid overfitting.
- An improvement of the method that search for the global optimum of the likelihood function.
Optimization is performed using a Simulated Annealing (SA) optimization procedure (implemented in the inference.py module) before BFGS optimization.
In practice, one hot and one cold SA optimization are performed before one additional round of BFGS optimization. See the script script_inference_anneal2_newton.py for further details.
Of course, this does not preclude to perform multiple runs to check for convergence in likelihood estimations among independent runs.
- The error rate for ancestral allelic state identification in unfolded JSFS is taken into account by a misorientation parameter (O).
These modifications were originally developed by Tine et al. (2014). Yoann Anciaux, Christelle Fraïsse, Pierre-Alexandre Gagnaire, and Nicolas Bierne (see Tine et al. 2014).
In dadi, when an outgroup is available, it is possible to determine the ancestral state of the mutations and then to count the number of derived mutations segregating in each population (See Gutenkunst et al. 2009 for more information). In the implementation by Tine et al., it is possible to infer the proportion of misoriented mutations, which is included as a model parameter (O). In Rougemont et al. (2016) the Sea lamprey outgroup (Petromyzon marinus) was a too distant to determine the ancestral state of all mutations, so the majority of mutations remained unoriented motivating the use of the folded JSFS.
The folded (J)SFS includes all SNPs without distinguishing ancestral and derived alleles. The folded SFS is therefore less informative with regards to the directionality of migration and drift effects. However, it is still informative enough to make inferences using simple demographic divergence models as the ones presented before. In particular, the folded JSFS can retain the signal of a secondary contact if the duration of divergence was long enough compared to the duration of secondary gene flow. Moreover, heterogeneous gene flow can be detected from an unfolded JSFS, using models incorporating variation in migration rate across the genome.
To circumvent the loss of information and validate the accuracy of parameters estimates from the folded JSFS, our strategy was to re-integrate fixed values for parameters estimates such as Ne1, Ne2, T, Tsc in the unfolded version of the JSFS and then to re-estimate migration rate parameters (M12,M21,Me12,Me21) as well as the proportion of loci displaying a reduced migration rate (1-P).
Models are defined here:
02-modifs_v2/unfolded/modeledemo_new_models.py
and here for the folded version:
02-modifs_v2/folded/modeledemo_new_models_folded.py
The unfolded prior are in :
02-modifs_v2/unfolded/script_inference_demo_new_models.py
and the folded priors are in:
02-modifs_v2/folded/script_inference_demo_new_models_folded.py
Then the scripts:
01-scripts/00.run_dadi_parallel_v2.sh
and 01-run_model_iteration_v2.sh
provides example for running all models in parallel. These should run easily everywhere if you have parallel module installed.
I used them on the ComputeCanada super computer(s) with examples of launcher provided in graham_cedar/
folder and colosse
folder