Skip to content
Example of tracking the ARG during forward simulations. Implementation is a Python package using fwdpp, fwdpy11, msprime, and pybind11.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmarks/cpp
fwdpy11_arg_example
practice
profiles
tests
COPYING
README.rst
SEEDS
TODO.md
benchmarking.py
bulk_evolve.py
compare.py
setup.py
test_evolve.py

README.rst

fwdpy11_arg_example

This project shows what is possible in terms of tracking the ancestral recombination graph (ARG) during a foward-time simulation implemented using fwdpp and msprime. The interface to the implementation is in terms of a Python package implemented using fwdpy11 and pybind11.

License

GPLv3 or later (see COPYING)

Overview

We define a C++ class called "ancestry_tracker", which stores nodes and edges as they appear forwards in time. These data structures, and their updating, are non-intrusive, meaning that they don't care about any of the fwdpp internals. Rather, we simply have to define a new "iterate a generation" function that uses both fwdpp machinery and updates an ancestry_tracker as appropriate.

Using pybind11, we make ancestry_trackers visible to Python as an AncestryTracker class. The Python class has access to the nodes and edges as NumPy structured arrays, which can be viewed "for free", meaning that no copy from C++ to Python is required to look at them.

We define an ArgSimplifier class to bridge the AncestryTracker and msprime. The __call__ operator of ArgSimplifier will accept an AncestryTracker as an argument and use the msprime API to simplify the input data into a set of trees.

Caution

This is a proof of principle implementation only. It has been tested that it gives the correct sample properties in distribution and that internal data structures are sane during simulation. The API is not guaranteed to be neither ideal nor idiomatic. Further, various safety checks are missing on the C++ side. We simply do not check for integer overflow or other possible things that may happen if garbage collection occurs too infrequently. These limitations, and others, will be dealt with (and tested for) when we port these ideas into fwdpy11.

Usage instructions

This section will get you to a point where you can run the C++ simulations described in the paper. These instructions have been confirmed to work in a clean conda environment using Python3. We strongly recommend that this package is compiled within a clean conda env.

First, install a Python3 environment for your user. It doesn't matter if you use a "full" installer or a "miniconda" installer. At this point, you may need to follow the conda installation instructions.

Note

Linux is the preferred environment here. If you insist on using Apple's OS X, then you must add a --gcc flag to the last command in the following code block. I (KRT) make no guarantees about success on OS X.

Instructions for conda on Linux:

conda create --name ftprime_ms python
source activate ftprime_ms
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install gcc fwdpy11==0.1.4 msprime==0.5.0 pybind11==2.2.1 pandas
git clone https://github.com/molpopgen/fwdpy11_arg_example
cd fwdpy11_arg_example
python setup.py build_ext -i

Test simulation

For example, let's run a simulation with the following parameters:

  • N=5e4 diploids
  • a region size of theta = rho = 1e4
  • deleterious mutation rate equal to one percent of the neutral mutation rate
  • simplify ("GC", or garbage-collect) every 1,000 generations
  • apply mutations to a sample of 100 individuals at the end of the simulation
  • write the timings to a file called timings.txt

The command line for the above is:

python benchmarking.py --popsize 50000 --theta 10000 --rho 10000 --pdel 0.01 --gc 1000 --nsam 100 --outfile1 timings.txt.gz \
--seed $RANDOM

Note

The output file is gzip compressed!

Please be mindful of running these simulations on machines with little RAM! In general, forward simulations are intended to be run on HPC-strength hardware. While tree sequence simplification results in very efficient run times, we are sometimes still using a substantial amount of RAM.

An example of the output is:

prepping    sorting appending       simplifying     fwd_sim_runtime N       theta   rho     simplify_interval
0.05370585599999733 4.384206619999995       0.2173980950000004      2.9446647440000016      5.174604999999977       1000    1000.0  1000.0  100

The fields are:

  • prepping: cumulative time spent preparing data for a copy from the C++ side to msprime
  • sorting: cumulative time spent sorting tables, which is a requirement for simplification
  • simplifying: cumulative time spent simplifying tables
  • fwd_sim_runtime: The total time spent simulating

The remaining four columns are the command-line parameters.

Source code overview

The package consists of a mix of C++ and Python code. All source code is in the fwdpy11_arg_example subdirectory of thie main repository.

C++ code

We define nodes and edges as simple structs, meaning that they are "C-like", consisting only of POD and no constructors or other C++ stuff. This simple design allows C++ vectors of these structs to be treated as NumPy record arrays visible fom Python without needing to make a copy.

  • node.hpp defines a node as a simple C-like struct.
  • edge.hpp defines and edge as a simple C-like struct.
  • ancestry_tracker.hpp defines a C++ struct/class called ancestry_tracker to accumulate nodes and edges during a simulation.
  • evolve_generation.hpp handles the details of updating a Wright-Fisher population with an ancestry_tracker.
  • handle_recombination.cc/.hpp handles the conversion of fwdpp's recombination breakpoints into types use to make edges.
  • wfarg.cc defines a Python module (called wfarg) implemented in C++ via pybind11. It exposes our C++ back-end to Python. The most important user-facing type defined is AncestryTracker, which wraps the C++ ancestry_tracker.

Python code

  • argsimplifier.py defines ArgSimplifier, which is the bridge between the C++ code to evolve a population and the msprime functionality to simplify the simulated nodes and edges.
  • evolve_arg.py defines a function that evolves a population while tracking its ancestry. It integrates concepts from fwdpy11 with the types defined in this package.
You can’t perform that action at this time.