gimmeSAD - Joint modelling of abundance and genetic diversity.
gimmeSAD is an integrated model of population genetics and community ecology. It allows one to model abundance in an ecological community under a neutral process while simultaneously modelling neutral population genetic variation. Briefly, the typical workflow would look like this:
- Identify model parameter ranges for investigation
- Simulate many, many joint abundance/genetic diversity distributions for communities given random draws from the parameter ranges.
- Import observed abundance distributions and fasta data for all species in the target community and convert the data for downstream inference.
- Perform an ABC parameter estimation procedure to select the summary simulations that most closely resemble the observed data, and generate posterior distributions on the model parameters.
After installing conda for python2.7, you can create a new conda env and install gimmeSAD inside it:
conda create -n py27-gimmeSAD python=2.7 source activate py27-gimmeSAD conda install -c iovercast gimmesad
The gimmeSAD conda package is tested and known to install and run on a clean conda install, in a clean conda env on Ubuntu Linux 16.04. It should work on any modern linux distro, and it should also work on Mac os, but you may experience some turbulance with installing some dependencies in conda.
In the basic usage of
gimmeSAD the most fundamental arguments are the size of the local community (
-k, which is measured in number of individuals), and the colonization rate (
-c, which is the probability per timestep of a colonization event).
gimmeSAD -k 1000 -c 0.01
This command will run a model with community size 1000, and colonization rate 0.01 and store the results in a directory called
output. This simulation will run for 50000 time steps (very short). Duration of simulations can be controlled with the
-n flag to specify the number of steps (0 will simply run the simulation to equilibrium):
gimmeSAD -k 1000 -c 0.01 -n 0 -f
Additionally here we've introduced the
-f flag to force overwriting the output directory. Alternatively one might explicitly specify an output directory for each simulation with the
gimmeSAD -k 1000 -c 0.01 -n 0 -o new_sim1
At any time the
-h flag can be passed to the
gimmeSAD command to display copious usage information, as well as a few example command line uses.
usage: gimmeSAD [-h] [-a] [-c colrate] [-C colonizers] [-k K] [-m meta] [-n nsims] [-o outdir] [--empirical_dir empiricaldir] [--model model] [-p mode] [-r recording_period] [-i invasion_time] [-I invasiveness] [-e] [-b bottleneck] [--do_plots] [--curves] [--plot_models] [-q] [-v] [-f] [--version] optional arguments: -h, --help show this help message and exit -a Print species abundances in octaves -c colrate Set colonization rate -C colonizers Switch mode to clustered colonization and set the # of colonizers per event -k K Carrying capacity of the island (max # individuals in the local community -m meta Source metacommunity from file or generate uniform -n nsims Number of demographic events to simulate -o outdir Output directory for log/data files and pngs --empirical_dir empiricaldir Directory with the empirical data --model model Data model for writing output files. -p mode Select mode for prepopulating the island (volcanic/landbridge) -r recording_period Length of timeslice between samples for logging -i invasion_time Timestep of invasion of invasive species -I invasiveness Invasiveness of the invasive species -e Do exponential growth, otherwise constant -b bottleneck Strength of the bottleneck --do_plots Generate a bunch of plots and animations. Default is not to do plots. --curves Plot rank abundance as curves rather than points --plot_models Add expectations under different statistical models to the rank abundance plot -q, --quiet do not print to stderror or stdout. -v, --verbose print lots of debug information -f, --force force overwrite of existing data --version show program's version number and exit * Example command-line usage: gimmeSAD -n 100000 ## Run the simulation for 100000 steps gimmeSAD -o outdir ## Directory where tons of junk will get written gimmeSAD -r recording_period ## Num steps to sample and write out after gimmeSAD -m metacommunity_LS4.txt ## Source metacommunity from a file of ints ## the other option is to pass "uniform" ## which will generate a random metacommunity ## with all species at uniform frequency gimmeSAD -p volcanic ## Set the mode for prepopulating (or not) ## the local community. gimmeSAD -i 4 ## Set # of colonizing inds per colonization event ## if unset basic immigration is assumed (1 individual) gimmeSAD -c 0.001 ## Set colonization rate gimmeSAD -a ## Output abundances in octaves gimmeSAD -q ## Don't be so chatty gimmeSAD -v ## Be very chatty
Running simulations in parallel
Several scripts are provided in the
scripts directory to facilitate running multiple simulations in parallel, and to explore ranges of parameter space.
All simulations and analyses performed in the manuscript can be reproduced using the jupyter notebooks in the
Importing observed data
gimmeSAD allows to import any combination of community abundance (counts per species) and genetic variation in the form of fasta files per species. A script is provided for this conversion:
scripts/make_obs.py. Empirical data for the La Reunion spider community analysed in the manuscript is included as example data. The
make_obs.py script can take up to 3 arguments:
./make_obs.py -h usage: make_obs.py [-h] [-a abund_file] [-f fasta_files] [-o outfile] optional arguments: -h, --help show this help message and exit -a abund_file File with observed abundances. -f fasta_files Directory containing observed fasta files, one per species. -o outfile Directory containing observed fasta files, one per species.
Abundance data format
Abundance data should be formatted as a single file with abundance per species in a single line as a comma separated list. For example (from a terminal in the
$ cat ../empirical_data/abunds.txt 5493, 1042, 471, 387, 374, 343, 263, 239, 228, 210, ..., 6, 5, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1 $ ./make_obs.py -a ../empirical_data/abunds.txt -o abund_only.txt` $ cat abund_only.txt shannon 2.24620615577
Genetic data format
Genetic data should be formatted as a directory with one fasta file per species. NB: These fasta files must be aligned. If they aren't aligned per species then you'll get really wacky, inflated values of pi. Not good.
$ ls ../empirical_data/spider-fasta/ GL-01a.fasta GL-04.fasta GL-09.fasta <output_clipped> GL-39.fasta GL-44.fasta GL-48.fasta GL-53.fasta $ ./make_obs.py -f ../empirical_data/spider-fasta/ -o pis_only.txt $ cat pis_only.txt bin_0 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 bin_7 bin_8 bin_9 8 26 8 5 1 2 1 3 1
The output format is a file that contains Shannon entropy (if abundances are provided) and the 1D-SGD constructed from pi values calculated from observed sequence data (if those are provided). The observed data for the spider community generates a file that looks like this:
$ ./make_obs.py -f ../empirical_data/spider-fasta/ -a ../empirical_data/abunds.txt -o full_obs.txt $ cat full_obs.txt shannon bin_0 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 bin_7 bin_8 bin_9 2.24620615577 8 26 8 5 1 2 1 3 1 2
This output file can now be used in an ABC framework, examples of which can be found in the
Both forward time and backward time processes are stochastic, so we can only give a sketch of average runtimes. Simulation runtime will be heavily dependent on the
c parameters, with larger
c) increasing runtime. In typical use, one would perform thousands to millions of simulations under a range of parameter values, so parallelizing across multiple cores of a workstation is highly recommended.
Average runtime (across 1000 replicates) for different values of
|K||c = 0.005||c = 0.01|
|1000||~10 seconds||~3 seconds|
|5000||~1 minute||~30 seconds|
|10000||~5 minutes||~3 minutes|
The manuscript describing the simulation method and an approximate Bayesian computation framework for parameter estimation may be cited as:
Overcast I, Emerson BC, Hickerson MJ. An integrated model of population genetics and community ecology. J Biogeogr. 2019;00:1–14. https://doi.org/10.1111/jbi.13541