forwards_paper.tex

\documentclass{article}
\usepackage{fullpage}
\usepackage{amsmath, amssymb}
\usepackage[hidelinks]{hyperref}
\usepackage[utf8]{inputenc}
\usepackage{natbib}
\usepackage{graphicx}
\usepackage{enumitem}
\usepackage{listings}
\bibliographystyle{plainnat}
% \usepackage[T1]{fontenc}

\include{macros}
\newcommand{\simupop}{\texttt{simuPOP}}
\newcommand{\fwdpp}{\texttt{fwdpp}}
\newcommand{\fwdpy}{\texttt{fwdpy11}}
\newcommand{\justc}{\texttt{C}}
\newcommand{\cpp}{\texttt{C++}}
\newcommand{\msprime}{\texttt{msprime}}
\newcommand{\ftprime}{\texttt{ftprime}}
\newcommand{\nodes}{\texttt{nodes}}
\newcommand{\edgesets}{\texttt{edgesets}}
\newcommand{\sites}{\texttt{sites}}
\newcommand{\mutations}{\texttt{mutations}}
\newcommand{\setdiff}{\smallsetminus}
\newcommand{\Nt}{\mathcal{N}}  % node table
\newcommand{\Et}{\mathcal{E}}  % edge table
\newcommand{\St}{\mathcal{S}}  % site table
\newcommand{\Mt}{\mathcal{M}}  % mutation table
\newcommand{\Al}{\mathcal{A}}  % list of ancestors
\newcommand{\priorityq}[1]{\mbox{\textbf{PriorityQueue}}({#1})}
\newcommand{\listnew}[1]{\mbox{\textbf{List}}(#1)}
\newcommand{\pqpush}[2]{#1.\mbox{\textbf{push}}(#2)}
\newcommand{\pqpop}[1]{#1.\mbox{\textbf{pop}}()}
\newcommand{\listappend}[2]{#1.\mbox{\textbf{append}}(#2)}
\newcommand{\ancsegment}[1]{\mbox{\textbf{Segment}}(#1)}
\newcommand{\taddrow}[2]{#1.\mbox{\textbf{addrow}}(#2)}
\newcommand{\attrparent}[1]{#1.\mbox{parent}}
\newcommand{\attrchild}[1]{#1.\mbox{child}}
\newcommand{\attrleft}[1]{#1.\mbox{left}}
\newcommand{\attrright}[1]{#1.\mbox{right}}
\newcommand{\attrnode}[1]{#1.\mbox{node}}
% mbox to avoid hyphenating
\newcommand{\tsimplify}[1]{\mbox{\textbf{simplify}}(#1)}
\newcommand{\tsort}[1]{\mbox{\textbf{sort}}(#1)}
\newcommand{\tnodetable}[1]{\mbox{\textbf{NodeTable}}(#1)}
\newcommand{\tedgetable}[1]{\mbox{\textbf{EdgeTable}}(#1)}

\usepackage{color}
\newcommand{\krt}[1]{{\em \color{green} #1}}
\newcommand{\plr}[1]{{\em \color{blue} #1}}
\newcommand{\jk}[1]{{\em \color{red} #1}}
\newcommand{\jda}[1]{{\em \color{cyan} #1}}

\begin{document}

\title{Efficient pedigree recording for fast population genetics simulation}
\author{Jerome Kelleher,
        Kevin R. Thornton,
        Jaime Ashander, and
        Peter L. Ralph}
\maketitle


\begin{abstract}
    In this paper we describe how to
    efficiently record the entire genetic history of a population
    during a forwards-time, individual-based population genetics simulation.
    This dramatically reduces the computational burden of tracking individual genomes
    by simulating only those loci that may affect reproduction (those having non-neutral variants),
    and recording history as a \emph{succinct tree sequence} as introduced in the software package \msprime,
    on which neutral mutations can be quickly placed afterwards.
    We implement the method in two popular forwards-time simulation frameworks,
    which speeds up large, whole-genome simulations by one to two orders of magnitude.
    In addition to speed, the method has several advantages:
    (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes.
    (2) A population of $N$ individuals with $M$ polymorphic sites
    can be stored in $O(N \log N + M)$ space, making it feasible to store a simulation's entire final generation (and its' history)
    rather than only a subset.
    (3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history.
    To make these possible,
    we introduce the \msprime{} Tables API, which allows for the efficient interchange
    of tree sequences between modular program components,
    and provide an algorithm to quickly simplify a tree sequence by removing history irrelevant to a given set of genomes.
\end{abstract}

% NOTE: I tend to use semantic linebreaks.
% Please excuse the ragged right margins in the source.


%%%%%%%%%%%%%%%%%%%%%%
\section*{Introduction}

Since the 1980's, coalescent theory has enabled computer simulation of the results of population genetics models
identical to that which would be produced by large, randomly mating populations over long periods of time
without actually requiring simulation of so many generations or meioses.
Coalescent theory thus had three transformative effects on population genetics:
first, giving researchers better conceptual tools to describe \emph{gene trees} and thus bringing within-population trees into better focus;
second, producing analytical methods to estimate parameters of interest from genetic data (e.g., $\theta = 4N_e \mu$);
and finally, providing a computationally feasible method to produce computer simulations of population genetics processes.
However, these powerful advances came with substantial caveats:
the backwards-in-time processes that are described by coalescent theory
are only \emph{Markovian}, and thus feasible to work with,
thanks to the important assumptions of
(a) random mating,
(b) neutrality,
and (c) small sample size relative to the population size.
The first two assumptions can be side-stepped to a limited extent \citep{hudson1990gene, Neuhauser1997-nn},
but it remains a challenge to map results of coalescent models
onto species that are distributed across continuous geographical
space~\citep{barton2010new,kelleher2014coalescent}
and/or have large numbers of loci under various sorts of selection.
Furthermore, the relationship between the life history of a species --
fecundity and mortality schedules, density-dependent effects on fitness, and demographic fluctuations --
are all absorbed into a single compound parameter, the coalescence rate.
The third assumption is no longer safe, either --
for example, a recent study~\citep{martin2017human}
simulated 600,000 samples of human chromosome 20 to examine biases in GWAS.
Several studies have now shown that in samples approaching the size of the population,
genealogical properties may be distorted relative to the coalescent expectation
\citep{wakeley2003gene,maruvka2011recovering,bhaskar2014distortion}.
These considerations, and increasing computational power, have led to a resurgence of
interest in large forwards-time, individual-based simulations.
For instance, \citet{harris2016genetic} used SLiM \citep{slim} to simulate ten thousand human exomes
to assess the impact of genetic load and Neanderthal introgression on human genetic diversity.
\cite{Sanjak2017-ko} used
\fwdpp{} \citep{fwdpp} to simulate a series of models of quantitative traits under mutation-selection balance with
population sizes of $2 \times 10^4$ diploids in stable populations and populations growing up to $\sim 5
\times 10^5$ individuals, using the output to explore the relationship between the genotype/phenotype model and GWAS
outcomes.

Modern computing power easily allows simulations of birth, death and reproduction
in a population having even hundreds of millions of individuals.
However, if our interest lies in the resulting genetic patterns of variation
-- and often, the point of such simulations is to compare to real data --
then such simulations must record each individual's genome.
As samples of most species' genomes harbor tens or hundreds of millions of variant sites,
carrying full genotypes for even modest numbers of individuals through a simulation
can quickly become prohibitive.
To make matters worse,
a population of size $N$ must be simulated across many multiples of $N$ generations
to produce stable genetic patterns \citep{wright1931evolution, wakeley2005coalescent}.
Because of this computational burden, even the fastest simulation frameworks such as
SLiM 2~\citep{haller2017flexible} and fwdpp~\citep{fwdpp}
can ``only'' simulate tens of megabases of sequence in tens of thousands of individuals
for tens of thousands of generations.
%% plr: not sure why we need to say these things here?
% and there are still near-order-of-magnitude performance differences between these implementations~\citep{haller2017flexible}.
In practice, current state-of-the-art simulation software may take on the order of
weeks to simulate models of large genomic regions without selection~\citep{fwdpp,Hernandez2015-wf},
and existing simulation engines differ in how efficiently they
calculate fitnesses in models with selection~\citep{fwdpp}.
These population and region sizes are still substantially short of whole genomes
(hundreds to thousands of megabases)
for many biological population sizes of interest.

However, it is thought that most genetic variation is selectively neutral (or nearly so).
By definition, neutral alleles carried by individuals in a population
do not affect the population process.
For this reason, if one records the entire genealogical history of a population over the course of a simulation,
simply laying down neutral mutations on top of that history afterwards
is equivalent to having generated them during the simulation:
it does not matter if we generate each generation's mutations during the simulation, or afterwards.
Precisely, we need to know the genealogical trees relating all sampled individuals
at each position along the genome.
Combined with ancestral genotypes and the origins of new mutations,
these trees completely specify the genomic sequence of any individual in the population at any time.
To obtain this information, we record from forward simulation the \emph{population pedigree} --
the complete history of parent-offspring relationships of an entire population
going back to a remote time -- and the genetic outcomes of each ancestral meiosis,
periodically discarding all information irrelevant to the genetic history
of the extant population.
The information in this embellished pedigree is stored as a \emph{succinct tree sequence}
(or, for brevity, ``tree sequence''),
which contains all the information necessary
to construct the genealogical tree that relates each individual to every other
at each position on the genome.

In this paper, we describe a storage method for \emph{succinct tree sequences}
(and hence, genome sequence) as well as an algorithm for simplifying these.
The data structure is \emph{succinct} in the sense that its' space usage is close to optimal,
while still allowing efficient retrieval of information (see, e.g.,
\citet{gog2014theory}).
The storage method and simplification algorithm are both implemented
as improvements to the algorithmic tools and data structures in \msprime{}.
We also describe how these tools can efficiently record,
and later process, the embellished population pedigree from a forwards-time simulation.
While providing substantial savings in computational time and space, our methods provide much
more information than simply simulating the genomes -- the tree sequence
encodes all marginal genealogies of individuals living at the end of the simulation.
Although we were motivated by a need for more efficient genomic simulations,
these tools may prove more widely useful.


%%%%%%%%%%%%%%%%%%%%%
\section*{Results}


The strategy described above is only of interest if it is computationally feasible.
Therefore,
we begin by benchmarking the performance improvement achieved by this method,
implemented using the forwards-time simulation library \fwdpp{} \citep{fwdpp}
and tree sequence tools implemented in \msprime{}.
Then, we describe the conceptual and algorithmic foundations for the method:
(a) a format, implemented in the \msprime{} Python API,
for recording tree sequences efficiently in several \emph{tables};
(b) an algorithm to record these tables during a forwards-time simulation;
and (c) an algorithm to \emph{simplify} a tree sequence, i.e., remove redundant information.
Finally, we analyze the run time and space complexity of our general-purpose method.


%%%%%%
\subsection*{Simulation benchmarks}

To measure the performance gains from recording the pedigree we ran simulations
both with and without recording.
(Although we record more than just the parent--offspring relationships of the pedigree,
for brevity we refer to the method as ``pedigree recording''.)
All simulations used \fwdpp{} to implement a discrete-time Wright-Fisher population of $N$ diploid individuals,
simulated for $10N$ generations (details below).
Simulations without pedigree recording introduced neutral mutations at a rate
equal to the recombination rate,
so $\mu = r$, where $\mu$ and $r$ are the expected per-generation number of mutations per gamete
and recombination breakpoints per diploid, respectively.
Simulations with pedigree recording introduced neutral mutations at the same rate
retrospectively, as described below, resulting in statistically identical simulation results.
% Both mutations and recombinations occur as Poisson processes with respective means $\mu$ and $r$,
% and we used the infinitely-many sites mutation model \citep{Kimura1969-uc}.
We ran simulations with different values of $N$ and varied the size of the genomic region according to the scaled recombination
parameter $\rho = 4Nr$.
%% this was said already:
% When simulating neutral mutations, we kept the scaled mutation parameter, $\theta = 4N\mu$, equal to $\rho$.

Deleterious mutations were introduced at rate $\rho/100$ per generation, drawing scaled selection
coefficients ($2Ns$)
from a Gamma distribution with a mean of -5 and a shape parameter of 1.  This distribution of fitness effects results in
thousands of weakly-deleterious mutations segregating in the population, many of which drift to intermediate
frequencies.  The case of many mutations with selection is a non-trivial violation of exchangeability assumptions of the
coalescent \citep{Neuhauser1997-nn}.  Therefore, these selected mutations must be explicitly tracked in our forward simulation
and the time savings due to pedigree recording come from not having to record \textit{neutral} mutations.

% When tracking neutral mutations (instead of the pedigree), run times increase dramatically with increasing region size ($4Nr = 4Nu$ for these simulations).
Pedigree tracking dramatically reduced runtimes, as shown in Figure~\ref{fig:runtimes_selection},
producing a relative speedup of up to $\approx 50$ fold relative to standard simulations that track neutral mutations
(Figure~\ref{fig:relative_speedup_selection}).
Pedigree tracking results in greater relative speedups for larger $N$
and we observe increasing relative speedups as $4Nr$ increases for a given $N$
(Figure~\ref{fig:relative_speedup_selection}).
Importantly, runtimes are approximately linear in region size $\rho$ when pedigree tracking
(partially obscured by the log scale of the horizontal axis in Figure~\ref{fig:runtimes_selection}).
%% don't know what an 'edge' is yet
% The reason for this behavior is that, when tracking pedigrees, the simulations are generating a constant number of new edges on average
% each generation as the number of new recombination breakpoints is Poisson distributed with expectation $\rho$.
In a more limited set of neutral simulations we found the same qualitative behavior,
and a numerically larger speedup by using pedigree tracking (see Appendix~\ref{ss:timing_nosel}).

In our implementation, simulations with pedigree recording
used substantially more RAM than simple forward simulations (see Appendix~\ref{ss:memuse}).
This is unsurprising:
unsimplified tree sequences grow quickly, and so storing history can use arbitrarily much memory.
However, this is not a requirement of the method, only a straightforwards consequence of a speed--memory tradeoff:
the amount of required memory is mostly determined by the interval between simplification steps,
but less frequent simplification reduces overall computation time (see Appendix~\ref{ss:gcinterval}).
In fact, our method could in some situations \emph{reduce} the amount of memory required,
if memory usage in the forwards simulation was dominated by the cost of maintaining neutral genetic variants.

\begin{figure}
    \includegraphics[]{sims/rawspeed}
    \caption{
        \label{fig:runtimes_selection}
    Total run time per single simulation replicate as a function of region length.
    % measured as the scaled recombination parameter $\rho = 4Nr$.
    Line color represents different diploid population sizes ($N$).
    The left figure shows run times for standard simulations including neutral mutations.
    The right column of panels shows run times of simulations that recorded the pedigree
    and added neutral mutations afterwards.
    The dashed line in the right panel shows results for an implementation using \fwdpy{}
    where the pedigree simplification steps were handled in a separate thread of execution
    and fitness calculations were parallelized.
    Simulations with $N=5 \times 10^4$ timed out for region sizes larger than $10^3$.
    }
\end{figure}


\begin{figure}
    \includegraphics[]{sims/speedup}
    \caption{
        \label{fig:relative_speedup_selection}
        Relative speedup of simulations due to pedigree recording.
        Each line shows the ratio of total run times of standard simulations to
        those of simulations with pedigree recording.
        Data points are taken from Figure~\ref{fig:runtimes_selection}
        for simulations that ran to completion in both cases.
    }
\end{figure}

% Maybe an estimate of how long \emph{just} the pedigree recording and simplification takes,
% so that then we can say how fast the simulator would have to be to do $10^6$
% whole chromosomes for $10^7$ generations in a day.


%%%%%%
\subsection*{Tables for succinct tree sequences}

We now explain what we actually did to achieve a $50\times$ speedup.
The ``pedigree recording'' simulations above recorded information about each new individual
in a collection of tables that together define a \emph{succinct tree sequence}
(or, simply ``tree sequence'').
A {tree sequence} is an encoding for a sequence of correlated trees,
such as those describing the history of a sexual population.
Tree sequences are efficient because branches that are shared by adjacent trees are stored once,
rather than repeatedly for each tree.
The topology of a tree sequence is defined via its \emph{nodes} and \emph{edges},
while information about variants is recorded as \emph{sites} and \emph{mutations};
we give an example in Figure~\ref{fig:example_tree_sequence}.
This formulation is derived from the ``coalescence records'' encoding of tree
sequences~\citep{kelleher2016efficient}, normalised to remove redundancy
and generalised to include a more general class of tree topologies.

The idea of a tree sequence is closely related to the \emph{ancestral recombination graph},
or {ARG} \citep{griffiths1991two,griffiths1997ancestral},
which also describes the embellished pedigree.
The ARG has been the subject of substantial study
under the assumptions of coalescent
theory~\citep{wiuf1997number,wiuf1999ancestry,marjoram2006coalescent,wilton2015smc}.
However, the properties of the ARG as a computational structure have not
been studied and, despite several efforts to standardise a common
format~\citep{morin2006netgen,mcgill2013graphml}, % TODO check these refs against others in msprime paper
ARGs are rarely used in practise.
In contrast, the algorithmic properties of tree sequence
algorithms have been explored in detail~\citep{kelleher2016efficient},
contributing substantially to the efficiency of the \msprime{} coalescent simulator.

The \emph{nodes} of a tree sequence
correspond to the vertices in the individual genealogies along the sequence.
Each node refers to a specific, distinct ancestor,
and so has a unique ``time'',
thought of as the node's birth time, which determines the height of any vertices
the node is associated with.
(Note that since each node time is equal to the amount of time since the {birth} of the
corresponding parent, time is measured in clock time, not in meioses.)
The example of Figure~\ref{fig:example_tree_sequence} has five nodes:
nodes 0, 1 and 2 occur at time 0 and are the \emph{samples},
while nodes 3 and 4 represent those ancestors necessary to record their genealogy,
who were born one and two units of time in the past, respectively.

\begin{figure}
    \begin{center}
        \includegraphics[width=\textwidth]{example_tree_sequence}
    \end{center}
    \caption{
        An example tree sequence with three samples over a chromosome of length 10.
        The left-hand panel shows the tree sequence pictorially in two different ways:
        (top) as a sequence of tree topologies
        and (bottom) the spatial extent of the edges that define these topologies.
        The right-hand panels show the specific encoding
        of this tree sequence in the four tables (nodes, edges, sites and mutations)
        defined by \msprime.
        \label{fig:example_tree_sequence}
    }
\end{figure}

The \emph{edges} define how nodes relate to each other over specific genomic intervals.
Each edge records
% is a tuple $(\ell, r, p, c)$, where
the endpoints $[\ell, r)$ of the half-open genomic interval defining the
spatial extent of the edge;
and the identities $p$ and $c$ of the parent and child nodes
of a single branch that occurs in all trees in this interval.
The spatial extent of the edges defining the topology of Figure~\ref{fig:example_tree_sequence}
are shown in the bottom left panel.
For example, the branch joining nodes 1 to 3 appears in both trees,
and so is recorded as a single edge extending over the whole chromosome.
It is this method of capturing the shared structure between adjacent trees that makes the
tree sequence encoding compact and algorithmically efficient.

Recovering the sequence of trees from this information is straightforward:
each point along the genome at which the tree topology changes
is accompanied by the end of some {edges} and the beginning of others.
Since each {edge} records the genomic interval
over which a given node inherits from a particular ancestor,
to construct the tree at a certain point in the genome
we need only retrieve all edges overlapping that point
and construct the corresponding tree.
To modify the tree to reflect the genealogy at a nearby location,
we simply remove those edges whose intervals do not overlap that location,
and add those new edges whose intervals do.
Incidentally, this property that edges naturally encode \emph{differences}
between nearby trees (e.g., as ``subtree prune and regraft'' moves)
allows for efficient algorithms to compute statistics of the genome sequence that take advantage
of the highly correlated nature of nearby trees~\citep{kelleher2016efficient}.

Given the topology defined by the nodes and edges, \emph{sites} and \emph{mutations}
encode the sequence information for each sample in an efficient way. Each site
records two things: its position on the genome and an ancestral state.
For example,
in Figure~\ref{fig:example_tree_sequence} we have two sites, one at position
2.5 with ancestral state `A' and the other at position 7.5 with ancestral state `G'.
If no mutations occur at a given site, all nodes inherit the ancestral state.
Each mutation records three things: the site at which it occurs,
the first node to inherit the mutation, and the derived state.
Thus, all nodes below the mutation's node in the tree will inherit this state,
unless further mutations are encountered.
Three mutations are shown in Figure~\ref{fig:example_tree_sequence},
illustrated by red stars.
The first site, in the left-hand tree,
has a single mutation, which results in node $2$ inheriting the state `T'.
The second site, in the right hand tree, has two mutations:
one occurring over node $3$ changing the state to `C',
and a back mutation over node $1$ changing the state to `G'.

This encoding of a sequence of trees and accompanying mutational information is
very concise. To illustrate this, we used \msprime{} to simulate 500,000 samples of a
$200$ megabase chromosome with human-like parameters: $N_e=10^4$ and per-base mutation and
recombination rates of $10^{-8}$ per generation. This resulted
in about 1 million distinct marginal trees and $1.1$ million infinite-sites
mutations. The HDF5 file encoding the node, edge, site and mutation tables (as
described above) for this simulation consumed 157MiB of storage space. Using
the \msprime\ Python API, the time required to load this file into memory was
around 1.5 seconds, and the time required to iterate over all 1 million trees
was 2.7 seconds. In contrast, recording the topological information in Newick
format would require around 20 TiB and storing the genotype information
in VCF would require about 1 TiB (giving a compression factor of 144,000 in
this instance).
Working with either the Newick or VCF encoding
of this dataset would likely require several
days of CPU time just to read the information into memory.

\paragraph{Validity of a set of tables}
Given a set of node and edge tables as described above,
there are only two requirements that ensure the tables
describe a valid tree sequence.
% (There are essentially no such requirements on the site and mutation tables.)
These are:
\begin{enumerate}
    \item Offspring must be born after their parents (and hence, no loops).
    \item The set of intervals on which each individual is a child must be disjoint.
\end{enumerate}
A pair of node and edge tables that satisfy these two requirements
is guaranteed to uniquely describe at each point on the genome
a collection of directed, acyclic graphs -- in other words, a forest of trees.
For some applications it is necessary to check that at every point
there is only a \emph{single} tree.
Checking this is more difficult, but is implemented in \msprime{}'s API.
For efficiency, \msprime{} makes several other sortedness requirements on the tables,
that are not necessarily satisfied by tables emitted by a forwards-time simulation.
\msprime{}'s API includes tools to rectify this by first sorting % (using \texttt{sort\_tables})
and then using the \texttt{simplify} algorithm described below, which works on sorted tables
and is guaranteed to produce a valid, \msprime{}-ready tree sequence.


%%%%%%
\subsection*{The \msprime\ Tables API}

The facilities for working with succinct tree sequences are implemented as part
of the \msprime\ Python API, which provides a powerful platform for processing
tree topology and mutation data. The new portions of \msprime{} that we discuss
here are dedicated to tree sequence input and output using simple tables of
data, as described above, so we refer to this as the ``Tables API''.

The Tables API is primarily designed to facilitate efficient interchange of
data between programs or between different modules of the same program. We
adopted a `columnar' design, where all the values for a
particular column are stored in adjacent memory locations.
There are many advantages to columnar storage -- for example, since adjacent
values in memory are from the same column, they tend to compress well,
and suitable encodings can be chosen on a per-column basis~\citep{abadi2006integrating}.
A particular advantage of this approach is that it enables very
efficient copying of data, and in principle zero-copy data access
(where a data consumer reads directly from the memory of a producer).
Our implementation
% uses the NumPy C API~\citep{walt2011numpy} to efficiently copy
efficiently copies data from Python as a NumPy array \citep{walt2011numpy}
into the low-level C library used to manipulate tree sequences.
This architecture allows for data transfer rates of gigabytes per second
(impossible under any text-based approach), while retaining excellent portability.
NumPy's array interface provides a great deal of flexibility and efficiency,
and makes it straightforward to transfer data from sources
such as HDF5 \citep{hdf5} or Dask~\citep{dask}.
For small scale data and debugging purposes, a simple text based format is also supported.

The \msprime\ Python Tables API provides a general purpose toolkit for importing
and processing succinct tree sequences. Interoperation with Python simulators
is then straightforward.  The implementation we benchmark here uses
\texttt{pybind11} (\url{https://github.com/pybind/pybind11/}) to interface
with the \fwdpp{} \cpp{} API \citep{fwdpp}. No modifications were
required to the \fwdpp{} code base; rather, we simply need to bookkeep parent/offspring labels,
and perform simple processing of the recombination breakpoints from each mating
event to generate node and edge data. This information is then periodically copied
to the \msprime\ Tables API, where it is sorted and simplified.

\paragraph{Flexibility.}
To demonstrate the flexibility provided by the Tables API and provide an
implementation that decouples forward simulation internals from transfer of data
to \msprime, we also implemented a version of the simulations described in
``Simulation benchmarks'' separately in Python, described in Appendix~\ref{ss:simupop}.
In this proof-of-concept implementation, the simulation engine (we use \simupop{}, \citet{simupop})
invokes callbacks at critical points of the simulation, and we infer nodes and edges
from the information that is provided. Rows are appended to the tables
one-by-one, and the tables are periodically sorted and simplified to control
memory usage.
Benchmarking results from this implementation are shown (alongside results from \fwdpp{})
for simulations without selection in Appendix~\ref{ss:timing_nosel}:
a relatively modest speedup of around $5 \times$ is achieved, likely due to increased overhead.


%%%%%%
\subsection*{Recording the pedigree in forwards time}

To record the genealogical history of a forwards time simulation,
we need to record two things for each new chromosome:
the birth time; and the endpoints and parental IDs of each distinctly inherited segment.
These are naturally stored as the \emph{nodes} and \emph{edges} of a tree sequence.
To demonstrate the idea, we write out in pseudocode how to run a neutral Wright--Fisher simulation
that records genealogical history in this way.
The simulation will run for $T$ generations,
and has $N$ haploid individuals, each carrying a single chromosome of length $L$.
For simplicity, we sample exactly one crossover per generation.

We use $\randomuniform(A)$ to denote an element of the set $A$ chosen uniformly at random
(and all such instances are independent).
Given a node table $\Nt$, the function $\taddrow{\Nt}{t}$
adds a new node to the table $\Nt$ with time $t$
and returns the ID of this new node.
Similarly, the function $\taddrow{\Et}{\ell, r, p, c}$
adds a new edge $(\ell\text{eft}, r\text{ight}, p\text{arent}, c\text{hild})$ to the edge table $\Et$.
The function $\tsimplify{P, \Nt, \Et}$ (described below) % in section \ref{ss:simplify}
simplifies the history stored in the tables $\Nt$ and $\Et$
to the minimal information required to represent the genealogies of the list of node IDs $P$;
after simplification the nodes appearing in $P$ are relabeled $(0, 1, \ldots, |P|-1)$.
A step-by-step explanation follows the pseudocode.

\begin{taocpalg}{W}{Forwards-time tree sequence}
{Simulates a randomly mating population of $N$ haploid individuals with
chromosome of length $L$ for $T$ generations, and returns the node
and edge tables ($\Nt$ and $\Et$) recording the simulated history.
The tables are simplified every $s$ generations, removing genealogical
information from $\Nt$ and $\Et$ irrelevant to the current population $P$.
}

\algstep{W1.}{Initialisation.}{
    Set $\Nt \leftarrow \tnodetable{}$, $\mathcal{E}
    \leftarrow \tedgetable{}$, $t \leftarrow T$, and $j \leftarrow 0$.
 For $0 \leq k < N$, set $P_k \leftarrow \taddrow{\Nt}{T}$.
}

\algstep{W2.}{Generation loop head: new node.}{Set $u \leftarrow \taddrow{\Nt}{t}$ and $P'_j \leftarrow u$.
}

\algstep{W3.}{Choose parents.}{Set $a \leftarrow \randomuniform(\{0, \dots, N - 1\})$,
    $b \leftarrow \randomuniform(\{0, \dots, N - 1\})$ and $x \leftarrow \randomuniform((0, L))$.
}

\algstep{W4.}{Record edges.}{
Call $\taddrow{\Et}{0, x, P_a, u}$ and $\taddrow{\Et}{x, L, P_b, u}$.
}

\algstep{W5.}{Individual loop.}{Set $j \leftarrow j + 1$. If $j < N$ go to \algref{W2}.
Otherwise, if $t\bmod s \neq 0$ go to \algref{W7}.
}

\algstep{W6.}{Simplify.}{Call $\tsimplify{P', \Nt, \Et}$, and set $P'_k
    \leftarrow k $ for $0 \leq k < N$. } %% too cryptic: (Tables may need to be sorted.)

\algstep{W7.}{Generation loop.}{Set $t \leftarrow t - 1$. If $t = 0$ terminate.
Set $P \leftarrow P'$, $j \leftarrow 0$, and go to \algref{W2}.
}

\end{taocpalg}

We begin in~\algref{W1} by creating new node and edge tables, and setting
our population $P$ (a vector of $N$ node IDs) to the initial population.
This initial population is a set of $N$ nodes with birth time $T$ generations
ago. We also initialise our generation clock $t$ and individual index $j$.
Step~\algref{W2} replaces the $j^\text{th}$ individual (with node ID $P_j$)
by creating a new node with birth time $t$ (and ID $u$).
In step~\algref{W3} we determine the new node's ancestry
by choosing two indexes $a$ and $b$ uniformly,
giving us parental IDs $P_a$ and $P_b$,
and choose a chromosomal breakpoint $x$.
We record the effects of this event by storing two new edges: one recording that the parent of node $u$
from $0$ to $x$ is $P_a$, and another recording that the parent of $u$
from $x$ to $L$ is $P_b$. Step \algref{W5} then iterates these steps
for each of the $N$ individuals for each generation.
At the end of a generation, we then check
if we need to simplify (done every $s$ generations).
If simplification is required, we do this in step \algref{W6} by calling the simplify function
on the node and edge tables with the current set of population IDs $P'$ as the samples.
This updates the tables in-place to remove all redundant
information, and remaps the specified sample IDs to $0, \dots, N - 1$ in the updated tables.
Hence, we set our current population IDs to
$0, \dots N - 1$ after simplify has completed.
Step \algref{W7} loops these steps until the required number of generations have been simulated.
% then completes the algorithm by looping over generations;
% we decrement our clock $t$, and terminate if $t = 0$.
% Otherwise, we update our current population and individual index and loop
% back to \algref{W2}.


\begin{figure}
    \begin{center}
        \includegraphics{wf-before-after}
    \end{center}
    \caption{An example of a marginal genealogy from a Wright-Fisher simulation
    with $N=5$. \textbf{(A)} the original tree including all
    intermediate nodes and dead-ends, and \textbf{(B)} the minimal tree
    relating all of the currently-alive individuals (27--31).
    \label{fig:wf-trees}
    }
\end{figure}


This algorithm records only topological information about the simulated genealogies,
but it is straightforward to add mutational information.
Mutations that occur during the simulation can be recorded
by simply storing the node in which they first occur, the derived state,
and (if not already present) the genomic position of the site at which it occurs.
This allows selected mutations, that the forwards time simulation must generate,
to be recorded in the tree sequence.
Neutral mutations can be generated after the simulation has completed, thus
avoiding the cost of generating the many mutations that are lost in the population.
This is straightforward to do because we have access to the marginal genealogies.

Figure~\ref{fig:wf-trees} shows
an example of a marginal genealogy produced by a forwards-time Wright--Fisher
process like Algorithm~\algref{W}.
On the left is the tree showing all the edges output by the simulation,
while on the right
is the minimal tree representing the ancestry of the current set of samples.
Clearly there is a great deal of redundancy in the topological
information output by the simulation.
This redundancy comes from two sources.
First, there are a large number of nodes in the tree that have only one child.
In Algorithm~\algref{W} we do not attempt to identify coalescence events,
but simply record all parent-child
relationships in the history of the population.
As such, many of these edges
will record the simple passing of genealogical information from parent to child
and only some small subset will correspond to coalescences within the marginal
trees. The second source of redundancy in the (unsimplified) output of Algorithm~\algref{W}
is due to the fact that lineages die out: a large number of
individuals in the simulation leave no descendants in the present day population.
Node 26 in Figure~\ref{fig:wf-trees}a, for example, leaves no
ancestors in the current population, and so the entire path tracing back to
its common ancestor with 27 is redundant.

%%%%%%
\subsection*{Tree sequence simplification}
\label{ss:simplify}

It is desirable for many reasons to remove redundant information from a tree sequence.
To formalize this:
suppose that we are only interested in a subset of the nodes of a tree sequence
(which we refer to as our `samples'),
and wish to reduce this input tree sequence
to the smallest one that still completely describes the history of the specified samples,
having the following properties:
\begin{enumerate}

\item All marginal trees must match the subtree
        of the corresponding tree in the input tree sequence
        that is induced by the samples.

\item Within the marginal trees, all non-sample vertices must have at least
        two children (i.e., unary tree vertices are removed).

\item Any nodes and edges not ancestral to any of the sampled nodes are removed.

\item There are no adjacent redundant edges, i.e., pairs of edges
    $(\ell, x, p, c)$ and $(x, r, p, c)$ which can be represented with a single edge
    $(\ell, r, p, c)$.

\end{enumerate}


\begin{figure}
    \begin{center}
        \includegraphics{method_diagram}
    \end{center}
    \caption{
        An example of tree sequence simplification.
        \textbf{(A)} The augmented pedigree diagram on the left
        relates the ten homologous chromosomes of five diploid individuals (BC, DE, FG, HI, and JK)
        to each other and to a common ancestral chromosome (A);
        dotted lines connect the two chromosomes of each individual,
        and solid lines lead to the products of their meioses.
        The corresponding tables (right) have 11 node records (one for each chromosome)
        and 15 edge records (one for each distinctly inherited segment).
        Blue numbers denote crossing over locations --
        for instance, $D$ and $E$ were parents to $G$,
        who inherited the left 70\% of the chromosome from $E$ and the remainder from $D$.
        $B$, $C$, $D$, and $E$ inherit clonally from $A$.
        \textbf{(B)} The five distinct trees
        found across the chromosome (blue numbers denote locations on the chromosome).
        Labels after simplification are shown in red.
        \textbf{(C)} Tables recording the tree sequence after simplification
        with nodes $J$ and $K$ as samples.
        The mapping from labels in the forwards time simulation to nodes in the tree sequence
        is shown in red.
        % which allows additional records to be added as the simulation progresses.
        \label{fig:method_diagram}
    }
\end{figure}

Simplification is essential
not only for keeping the information recorded by forwards simulation manageable,
but also is useful for extracting subsets of a tree sequence representing a very large dataset.
% The tree sequences produced by forwards simulations
% record all of history for everyone alive at any time through the simulation.
% Simplification is essential to reduce this
% to a manageable quantity that still contains all
% the information that we are interested in.
% Simplification is also useful if we have a
% tree sequence representing a large dataset and wish to extract the
% information relevant to a subset of the samples.

Our approach to simplification is based on Hudson's algorithm for simulating
the coalescent with recombination~\citep{hudson1983properties},
paralleling the implementation in \msprime{} \citep{kelleher2016efficient};
an implementation in pseudocode is provided in Appendix~\ref{ss:simplify_algorithm}.
Conceptually, this works by
(a) beginning by painting the chromosome in each sample a distinct color;
(b) moving back through history,
copying the colors of each chromosome to the portions of its' parental chromosomes
from which it was inherited;
(c) each time we would paint two colors in the same spot (a coalescence),
record that information as an edge and instead paint a brand-new color;
and
(d) once all colors have coalesced on a given segment,
stop propagating it.
This ``paint pot'' description misses some details --
for instance, we must ensure that all coalescing segments in a given individual
are assigned the \emph{same} new color --
but is reasonably close.
Figure~\ref{fig:method_diagram} shows an example tree sequence,
before and after simplification,
and Figure~\ref{fig:simplify_state} depicts the ``paint pot'' state of the algorithm
during the process of simplifying this tree sequence.

\begin{figure}
    \begin{center}
        \includegraphics{simplify-state-diagram}
    \end{center}
    \caption{
        A depiction of the state of the simplification algorithm
        at each point in time,
        in the example of Figure~\ref{fig:method_diagram}A.
        Ancestral material from J (red) and K (blue) are traced up through the pedigree
        until they coalesce;
        the smaller colored chromosomes on either side of each solid arrow show the bits inherited from each of the two parental chromosomes;
        each time two colors overlap, a coalescence occurs, and two edges are output.
        For instance, both J and K inherit from H between 0.5 and 0.9,
        which resulted in the first two edges of the simplified table of Figure~\ref{fig:method_diagram}C.
        \label{fig:simplify_state}
    }
\end{figure}

More concretely,
the algorithm works by moving back through time,
processing each parent in the input tree sequence in chronological order.
The main state of the algorithm at each point in time is a set of ancestral lineages,
and each lineage is a linked list of ancestral segments.
An ancestral segment $(\ell, r, u)$ is found in a lineage
if the output node $u$ inherits the genomic interval $[\ell, r)$ from that lineage
(and so $u$ corresponds to a ``color'' in the description above).
% These segments are stored in a collection of linked lists,
% one list of segments for each ancestral lineage present at that time.
We also maintain a map from input nodes to lineages. % called $A_j$
Crucially, the time required to run the algorithm is
linear in the number of edges of the input tree sequence.

%%%%%%%%%%%%%%
\subsubsection*{Sequential simplification and prior history}
\label{ss:seq_simp}

Any simulation scheme that records data into tables,
as Algorithm~\algref{W} does,
has its genealogical history available at any time as a tree sequence.
This has two additional advantages:
First, simplification can be run periodically through the simulation,
if we take the set of samples to be the entire currently alive population.
This is important in practice as it keeps memory usage from growing linearly (and quickly) with time.
Second, the simulation can
be \emph{begun} with a tree sequence produced by some other method -- for
instance, by a coalescent simulation with \msprime,
providing an easy, efficient way to specify prior history.
A natural question is now: how often should simplification occur?
Limited testing (described in Appendix~\ref{ss:gcinterval})
found that different simplification intervals affect run times by approximately
25\%, with the lowest run time occurring when simplifying every $10^3$ generations.
Thus, there is a memory-versus-speed tradeoff
-- simplifying more often would keep fewer extinct nodes and edges in memory.

% As shown in the next section,
% there is no computational advantage to simplifying more often than is necessary
% to keep memory usage down.
%
% \krt{Empirically, I did find that simplifying more often than every 1,000 generations gave a run-time hit of about 25
% percent???}


%%%%%%%%%%%%
\subsubsection*{Estimates of run-time complexity}

The simulation results shown in Figures \ref{fig:runtimes_selection} and \ref{fig:relative_speedup_selection} show that
pedigree tracking greatly speeds up forward-time simulations.  Further, the run times with pedigree tracking become
near-linear in $\rho$ (Figure \ref{fig:runtimes_selection}).  In this section, we explore the run-time complexity using a
simplified Wright-Fisher model with recombination.

%% Memory reduction thanks to simplification
Consider a simulation of a Wright-Fisher population of $2N$ haploid individuals
using Algorithm~\algref{W} for $T$ generations,
with simplify interval $s = 1$ so that redundant information is removed after every generation.
Since every chromosome inherits material from both parents,
without simplification this would produce tables of
$2NT$ nodes and
$4NT$ edges.
Suppose that $T$ is large enough that all samples coalesce within the simulation with high probability,
($T = 20N$, say).
After simplification, we are left with the tree sequence describing the history
of only the current generation of $2N$ individuals.
Ignoring the possibility of coalescent events involving more than two lineages, this tree sequence has $4N-2$ edges to describe the leftmost tree;
and each time the marginal tree changes along the sequence,
four edges end and four new edges begin
(except for changes affecting the root, which require fewer; see \citet{kelleher2016efficient}).
Coalescent theory tells us that
the expected total length of the edges of a marginal tree is approximately $4N\log(2N)$,
which is also equal to the mean number of ancestral recombination events that occur on a branch of the marginal tree,
since we have taken one crossover per individual per generation.
Not all such recombinations actually change the tree topology,
so four times this gives us an upper bound on the expected number of additional edges.
Similarly, not every new edge derives from a never-before-seen node,
but the number of nodes is at most equal to the number of edges plus the sample size.
With $T=20N$, this reduces the initial storage of $120 N^2$ items to
% 4N - 2 + 2N + 4N log(2N) * 4 * 2 = 6N + 32 N log(2N)
$6N + 32 N \log(2N)$;
at $N=10^4$, a factor of 3,700.

%% Progressive memory reduction as a function of T
To get an idea of how required space depends on the length of time between simplification steps,
suppose instead that we have run a simulation of $2N$ individuals for $T$ generations,
begun with no prior history.
How many of the resulting $4NT$ edges are required after simplification?
As above, the expected number of edges is bounded by $2N-2$ plus four times the mean marginal tree length.
Roughly, the length of time for which a marginal tree has $k$ tips is
$2N/(k(k-1)) = 2N(1/(k-1) - 1/k)$ generations,
and so the time to go from $N$ to $n$ tips is $2N(1/(n-1) - 1/(N-1))$.
This implies that after running for time $T$ we expect to
have around $n(T)$ roots, where $n(T) \approx 1 + 2N/(T+2)$.
The total tree length over this time is
$2 N \sum_{k=n(T)+1}^{N-1} 1/k$, which
% is approximately
% \begin{align*}
%     2 N \log\left( \frac{N}{1 + 2 N/(T+2)} \right) .
% \end{align*}
leads to an upper bound on the number of edges of
\begin{align}
    \label{eqn:edge_bound}
    2 N \left( 1 + 4 \log\left( \frac{NT}{T + 2 N} \right)\right) .
\end{align}
These calculations hold up quite well in practice,
as shown in Figure~\ref{fig:simplify_complexity}.
% by comparison to
% the number of edges actually required for 50 independent Wright--Fisher simulations,
% as a function of time (obtained from Algorithm~\algref{W} with $s=1$)
% is shown in Figure~\ref{fig:simplify_complexity}, compared to this prediction.

\begin{figure}
    \begin{center}
        \includegraphics{sims/simplify-results}
    \end{center}
    \caption{
        Time and space complexity of simplify.
        \textbf{(A)}
        Number of edges in the simplified tree sequence
        for 10 replicate Wright--Fisher simulations with $N=50$ as a function
        of number of generations.
        Each line is one simulation, the heavy blue line gives the average,
        and the dashed line is the upper bound of equation \eqref{eqn:edge_bound}.
        \textbf{(B)}
        Time required to simplify the first $k$ edges of a large (4.2GiB)
        unsimplified tree sequence produced by a forwards-time simulation plotted
        against $k$. The time scales linearly with the number of input edges.
        \textbf{(C)}
        Time required to simplify the tree sequence resulting from a coalescent
        simulation of 500,000 samples of a 200 megabase human chromosome
        to a random subsample of $n$ samples, plotted against $n$
        (note the log scale; the time scales logarithmically with $n$).
        \label{fig:simplify_complexity}
    }
\end{figure}

%% Above was edges; same holds for memory usage savings due to mutations
If we were to add mutations in forwards time
under the infinite-sites model with total mutation rate per generation $\mu$,
there would also be around $\mu 2NT$ mutations (and the same number of sites).
Since mutations fall as recombinations do on the marginal trees,
adding them after simplification results in only around $\mu 4 N \log(2N)$ mutations,
and the same savings as we obtained from edges above,
even without considering the
computational burden of propagating mutations forwards across generations.


%% simplify run time
How does the computation \emph{time} required for simplification scale?
Simply because it must process each edge,
% below per discussion
% https://github.com/petrelharp/ftprime_ms/pull/37#discussion_r156617283
the simplification algorithm is at least linear in the number of edges of the input tree sequence.
Empirically the algorithm is exactly linear,
as seen in Figure~\ref{fig:simplify_complexity}B,
where we show shows the time required to simplify increasingly large subsets of a large tree sequence.
When simplifying the result of a forwards-time sequence, the number of edges is the main contributing factor.
Suppose on the other hand we want to
simplify an already-minimal but large tree sequence with $N$ nodes
to a subsample of size $n$.
How does the required time scale with $n$?
In this case, the computation is dominated by the size of the output tree sequence,
which grows with $\log(n)$, as shown in Figure~\ref{fig:simplify_complexity}C,


%% the below assumes naive propagation of genotypes
% Our method stores genealogies, and so records substantially more information
% than would a method only recording genotypes.
% However, since the simplification algorithm requires computational effort,
% it is informative to compare time complexity of the algorithm
% to one that propagates neutral genotypes.
% A typical individual differs at around $2 N \mu$ sites from the population's consensus sequence,
% so propagating these to offspring by simple copying will take $4 N^2 \mu$ operations per generation.
% On the other hand,
% in our scheme we must store $13N$ quantities per generation (two edges and one node per individual).
% The simplification algorithm is linear in the number of edges of the input tree sequence
% (because it must process each edge),
% and so multiplies this by a constant factor.
% These considerations imply that propagating neutral genotypes for $T$ generations has time complexity $O(\mu T N^2)$,
% while our implementation is only $O(T N \log(2N))$.


%%%%%%%%%%%%%%%%%%%%%%
\section*{Discussion}

In this paper, we have shown that storing pedigrees
and associated recombination events
in a forwards-time simulation
not only results in having available a great deal more information about the simulated population,
but also can speed up the simulation by orders of magnitude.
To make this feasible,
we have described how to efficiently store this information in numerical tables,
and have described a fundamental algorithm for simplification of tree sequences.
Conceptually, recording of genealogical and recombination events
can happen independently of the details of simulation;
for this reason, we provide a well-defined and well-tested API in Python
for use in other code bases (a C API is also planned).

The idea of storing genealogical information to speed up simulations is not new.
It was implemented completely in AnA-FiTS~\citep{aberer2013rapid},
but without the critical step of discarding irrelevant genealogical information.
A similar but much more limited method for
discarding this information also appears in \citet{padhukasahasram2008exploring}.

The tree sequences produced by default by this method
are very compact, storing genotype \emph{and} genealogical information
in a small fraction of the space taken by a compressed VCF file.
The format also allows highly efficient processing for downstream analysis.
Efficient processing is possible because many statistics of interest for population genetics
are naturally expressed in terms of tree topologies,
and so can be quickly computed from the trees underlying the tree sequence format.
For example, pairwise nucleotide diversity $\pi$, is the average density of
differences between sequences in the sample.
To compute this directly from sequence data at $m$ sites in $n$ samples
requires computing allele frequencies, taking $O(nm)$ operations.
By using the locations of the mutations on the marginal trees,
and the fact that these are correlated,
sequential tree algorithms similar to those in~\citep{kelleher2016efficient}
can do this in roughly $O(n + m + t \log n)$ operations, where
$t$ is the number of distinct trees.
The \msprime{} API provides a method to compute $\pi$ among arbitrary subsets of the
samples in a tree sequence, which took about 0.7 seconds when
applied to an example simulation of 100 megabases of human-like
sequence for 200,000 samples (about 500K sites). The corresponding
numeric genotype matrix required about 95GiB of RAM, and
calculating $\pi$ took about 66 seconds with NumPy.

% This is done by running $ python msprime-examples.py benchmark-pi

Another attractive feature of this set of tools
is that it makes it easy to incorporate \emph{prior history},
simply by seeding the simulation with a (relatively inexpensive) coalescent simulation.
This allows for incorporation of deep-time history beyond the reach of individual-based simulations.
This may not even negatively affect realism,
since geographic structure from times longer ago than the mixing
time of migration across the range has limited effect on modern genealogies
\citep{wilkins2004separation},
other than possibly changing effective population size \citep{barton2002neutral,cox2002stepping}.

Simulating very large populations and entire genomes will likely require parallelization.
The one-way nature of information flow here makes our scheme relatively easy to incorporate
into a parallel algorithm.
Recently, \citet{lawrie2017accelerating} used parallelization
to improve performance for the ``Poisson Random Field'' family of models~\citep{Sawyer1992-jw},
in which mutations are unlinked and there is no effect of genetic background on fitness.
However, thus far little attention has been paid to parallelization of more complex simulations.

\paragraph{Bayesian phylogenetics}
The methods described here for efficiently storing tree sequences may prove useful in other fields.
We have focused on the interpretation of tree sequences as the outcome of the process
of recombination, but in principle, we can efficiently encode any
sequence of trees which differ by subtree-prune-and-regraft
(SPR) operations. Since each SPR operation requires a constant amount of space to encode, the total
space required is $O(n + t)$ for $t$ trees with $n$ leaves~\citep{kelleher2016efficient}.
For instance, the large numbers of large, correlated trees produced by
MCMC samplers used in Bayesian phylogenetics \citep[e.g.,][]{drummond2012bayesian}
might be compactly stored as a tree sequence,
which would then allow highly efficient computation of properties of the posterior distribution.

% JK: This needs more context. Probably not worth putting in here.
% \paragraph{Next steps.}
% While our methods result in significant performance
% gains over current simulation methods, they are far from optimal. Because
% forwards time simulation is highly CPU intensive, most simulation engines
% are written in \justc{} or \cpp{} out of necessity. Therefore, the simplest and most
% efficient approach would be to call the \msprime{} \justc{} library directly as part of
% the core simulation logic, in the manner outlined in Algorithm \algref{W}. If designed
% in this way, the simulation engine would not need to keep copies of the node and edge information.
% To facilitate this usage, we plan to release a standalone, embeddable \justc{} library encompassing the core
% tree sequence functionality, which should make recording and
% outputting tree sequences from any \justc{}/\cpp{} code straightforward.

\paragraph{A final note:}
in preparing this manuscript,
we debated a number of possible terms for the embellished pedigree,
i.e., the ``pedigree with ancestral recombination information''.
Etymological consensus \citep{liberman2014little} has
``pedigree'' derived from the french ``pied de grue'' for the foot of a crane
(whose branching pattern resembles the bifurcation of a single parent-offspring relationship).
An analogous term for the embellished pedigree might then be \emph{nedigree},
from ``nid de grue'',
as the nest of a crane is a large jumble of (forking) branches.
We thought it unwise to use this term throughout the manuscript,
but perhaps it will prove useful elsewhere.


%%%%%%%%%%%%%%
\section*{Methods}

We implemented simulations and the connection to \msprime{} in \cpp{}, using \fwdpp{} library functions and
interface code using a continuum-sites model for both mutation
and recombination. Simulations were run using \fwdpy{} (version 0.13.a0), a Python package based
on \fwdpp{} (version 0.5.7).  The majority of results are presented based on a single-threaded implementation.  However,
we also implemented a parallelized version using Python's \texttt{queue.Queue} to run the simplification step in a separate Python
thread. Our implementation allows a maxmimum of four simplification intervals to be in the queue at once. This
parallelized version also performed fitness calculation in parallel using two threads of execution in C++.

Code for all simulations and figures is available at \url{https://github.com/petrelharp/ftprime_ms}.
These made use of the GNU Scientific Library \citep[version 1.16][]{galassi2018scientific},
pybind11 \citep[version 2.2.1][]{pybind11},
and GCC (version 4.8.5).
We ran all benchmarks on an Ubuntu Linux (version 16.04) system with two 2.6 GHz Intel E5-2650 CPU with
hyperthreading enabled.
We ran one simulation at a time and the machine was under minimal load otherwise.
We used GNU parallel \citep{Tange2011a} to kill any simulation that did not finish within 72 hours,
and the Linux \texttt{time} command to record run time and peak memory usage of each replicate.
%% Our simulation scripts separately recorded the time spent in various steps of each run.


%%%%%%%%%%%%%%%%%%%%%%
\section*{Acknowledgements}
Thanks to Jared Galloway, Brad Shaffer, and Evan McCartney--Melstad for useful discussions.
Work on this project was supported by funding from
the Sloan Foundation and the NSF (under DBI-1262645) to PR;
the Wellcome Trust (grant 100956/Z/13/Z to Gil McVean);
and XXX.

\bibliography{references}

\appendix

%%%%%%%%%%%%%%
\section{Simplification algorithm}
\label{ss:simplify_algorithm}

Here we provide a concrete implementation of the simplification algorithm.
The algorithm uses a few simple structures
besides the node and edge tables discussed earlier. As the algorithm needs to track the mapping
between input and output node IDs over specific genomic intervals, we use a $\ancsegment$
type to track these individual mappings. Thus, a segment $x$ has three attributes:
$\attrleft{x}$ and $\attrright{x}$ represent the genomic interval in the usual way,
and $\attrnode{x}$ represents the output node ID that is assigned to this interval.
These output node IDs correspond to colors in the ``paint pot'' analogy.
The per-interval segments are assigned to input IDs via the $\Al$ mapping,
which is the main state required by the algorithm. Each element $\Al[u]$
is a list of ancestral segments, describing the output node IDs mapped to input
node ID $u$ over a set of intervals.
For each input node, we need to process the ancestral segments
to find overlaps (and hence coalescences), and to update $\Al$.
We do this using a $\priorityq{}$ type, which maintains a list of ancestral
segments sorted by left coordinate. For a priority queue $Q$, the operation
$\pqpush{Q}{x}$ inserts the segment $x$ into the queue, and $\pqpop{Q}$
removes and returns the segment in $Q$ with the smallest left coordinate.
To obtain the segment currently in the queue with the smallest left value we
write $Q[0]$.

\begin{taocpalg}{S}{Simplify a tree sequence}
{Input consists of
    a list $S$ of sample IDs,
    a list $\Nt_I$ of nodes (ordered by birth time),
    a list $\Et_I$ of edges,
    and the genome length, $L$.
    The output is a
    nodes $\Nt_O$,
    and list of edges $\Et_O$.
}

\algstep{S1.}{Initialisation.}{
    Set $\Al_u \leftarrow \listnew{}$ for $0 \leq u < |\Nt_I|$. Then,
    for each $u$ in $S$, set $v \leftarrow \taddrow{\Nt_O}{\Nt_I[u]}$ and
    call $\listappend{\Al[u]}{\ancsegment{0, L, v}}$. Then set $Q \leftarrow \priorityq{}$
    and $u \leftarrow 0$.
}

\algstep{S2.}{Select parent edges.}{
    Set $E \leftarrow \{e \in \Et_I : \attrparent{e} = u \}$.
    Set $v \leftarrow -1$.  If $|E| = 0$, go to \algref{S9}.
}

\algstep{S3.}{Insert edge/ancestry intersections.}{
    For each $e\in E$ and each $x \in \Al[\attrchild{e}]$,
    if $\attrright{x} > \attrleft{e}$ and $\attrright{e} > \attrleft{x}$,
    set $\ell \leftarrow \max(\attrleft{x}, \attrleft{e})$,
    $r \leftarrow \min(\attrright{x}, \attrright{e})$,
    $y \leftarrow \ancsegment{\ell, r, \attrnode{x}}$ and call
    $\pqpush{Q}{y}$.
}

\algstep{S4.}{Find segments with minimum left coordinate.}{
    Set $\ell \leftarrow \attrleft{Q[0]}$, $r \leftarrow L$ and $X \leftarrow \mbox{List}()$.
    Then, while $|Q| > 0$ and $\attrleft{Q[0]} = \ell$, set $x \leftarrow \pqpop{Q}$,
    $r \leftarrow \min(r, \attrright{x})$ and call $\listappend{X}{x}$.
    Afterwards, if $|Q| > 0$, set $r \leftarrow \min(r, \attrleft{Q[0]})$. If
    $|X| > 1$ go to \algref{S6}.
}

\algstep{S5.}{No overlapping segments.}{
    Set $x \leftarrow X[0]$ and $\alpha \leftarrow x$. If $|Q| > 0$ and
    $\attrleft{Q[0]} < \attrright{x}$, set $\alpha \leftarrow
    \ancsegment{\attrleft{x}, \attrright{Q[0]}, \attrnode{x}}$,
    $\attrleft{x} \leftarrow \attrleft{Q[0]}$ and call $\pqpush{Q}{x}$.
    Go to \algref{S8}.
}

\algstep{S6.}{Overlap; new output node.}{
    If $v = -1$, set $v \leftarrow \taddrow{\Nt_0}{\Nt_I[u]}$.
}

\algstep{S7.}{Record edges.}{
    Set $\alpha \leftarrow \ancsegment{\ell, r, v}$. Then for each $x \in X$
    call $\taddrow{\Et_O}{l, r, v, \attrnode{x}}$ and, if $\attrright{x} > r$,
    set $\attrleft{x} \leftarrow r$ and call $\pqpush{Q}{x}$.
}

\algstep{S8.}{Left coordinate loop.}{
    Call $\listappend{\Al[u]}{\alpha}$. If $|Q| > 0$ return to \algref{S4}.
}

\algstep{S9.}{Parent loop.}{
    Set $u \leftarrow u + 1$. If $u < |\Nt_I|$, go to \algref{S2}.
}

\end{taocpalg}

The algorithm begins in \algref{S1} by allocating an empty list to store ancestral
segments for each of the $|\Nt_I|$ input nodes, and then creating the initial
state for the samples in $S$. For each input sample with node ID $u$, we
add a row in the output node table $\Nt_O$, and create a mapping for this
new node $v$. After this initial state has been created, we then allocate our
priority queue $Q$, and set the current input ID $u$ to $0$.
Recall that the function $\taddrow{\Nt}{t}$ adds a new node to the node table $\Nt$
with birth time $t$, and returns the ID (i.e., index) of that new node in $\Nt$.
So, the operation $v \leftarrow \taddrow{\Nt_O}{\Nt_I[u]}$ looks up the birth time
for input node ID $u$ in $\Nt_I$, adds a new node with this time to $\Nt_O$,
and stores the resulting (output) node ID in $v$.

To process ancestral information moving back up through the pedigree, steps \algref{S2}--\algref{S9}
consider each node $u$ in turn (note we are assuming that nodes are ordered by time-since-birth).
In \algref{S2} we first find all edges in the input where the parent is equal to
our current input ID $u$, and then set the corresponding output ID $v$ to $-1$.
Then, in \algref{S3} we find all intersections between these input edges and the
mapped ancestry segments of their child nodes.
For each of these intersections we
create a new segment $y$ and insert it into the priority queue.
(In the paint pot analogy,
this stores in $Q$ all segments of color that must be transferred to the ancestor
with input ID $u$.)

After finding all overlaps and filling the priority queue with the resulting ancestry
segments, we then need to process these segments. Steps \algref{S4}--\algref{S8}
loop over the segments in the queue, considering each in turn until the queue is empty.
Step \algref{S4} builds a list $X$ of all segments covering the current left-most coordinate
(along with some bookkeeping to keep track of where our next
right coordinate $r$ is), removing them from $Q$. If there is only one element in $X$
this means that we have no overlapping ancestral segments at this coordinate and
we proceed to \algref{S5}.

In this case, only a single segment covers the leftmost region of the segments,
so there is no change in the mapping between input and output IDs,
and the ancestry segment $\alpha$ is not changed (this is assigned to $\Al_u$ in \algref{S8}).
(The color in this segment of the offspring can simply be transferred to the parent.)
However, if other segments in $Q$ intersect with $x$
(since $x$ is the current left-most segment, these have left coordinate less then $\attrright{x}$),
then we can only transfer the portion of $x$ up until it overlaps with other segments.
Therefore, we change the left coordinate
of $x$ to this coordinate and reinsert it into $Q$ where it will be processed again
later, effectively cutting off (and propagating) the leftmost segment of $x$.
After doing this, we proceed to \algref{S8}, update the ancestry mapping for
the input node $u$, and loop back to \algref{S4} if $Q$ is not empty.

On the other hand, if there is more than one segment in $X$, we know that an
overlap among the ancestral segments has occurred and we must update the output
nodes and edges accordingly. In step \algref{S6} we first allocate a new output
node $v$ for input node $u$, if it has not been done already (we may have many overlapping
segments along the genome). Step \algref{S7} then continues by creating a new
ancestry segment $\alpha$ mapping the current interval to the output node $v$.
We then iterate over all segments in $X$, adding the appropriate output edges
and editing and reinserting the ancestry segment $x$ into the priority queue if
necessary, as before.


\section{Memory usage with selection}
\label{ss:memuse}
\renewcommand{\thefigure}{B\arabic{figure}}
\setcounter{figure}{0}

The peak RAM use in gigabytes (GB) is shown for the simulations with selection in Figure~\ref{sfig:ramsel}.  For our parameters, which involved simplifying every
$10^3$ generations, simulation while recording pedigrees uses more RAM than a standard simulation.
This is because with pedigree recording, a non-recombinant
individual still requires one new edge and two new nodes to be allocated.  Without pedigree recording, no new memory is allocated for such an offspring by \fwdpp{}.
Thus, pedigree recording accumulates a large amount of data in RAM until
simplification.  This extra RAM consumption is the trade off for reduced run times and the degree of extra RAM needed
can be reduced by simplifying more often.  The simulations where we run the simplification step in a separate thread
require even more RAM (dashed line in Figure \ref{sfig:ramsel}) because up to four sets of unsimplified data are
stored in RAM.  In practice, the RAM consumption can be minimized by simplifying more often.

%% \krt{I've avoided getting into how this much RAM means parallelization is required to make more progress.}

\begin{figure}
    \includegraphics[]{sims/memuse}
    \caption{\label{sfig:ramsel}Peak RAM use in gigabytes (GB) for simulations with selection.  The plots are from the
        same data as in Figure \ref{fig:runtimes_selection}. Note that the y-axis scales differ considerably across
    panels.}
\end{figure}

\newpage

\section{Effect of simplification interval on run time and memory use}
\label{ss:gcinterval}
\renewcommand{\thefigure}{C\arabic{figure}}

We performed a limited set of simulations to explore the effect of the simplification interval on run times and memory
use. Figure~\ref{sfig:gctime} shows that run times increase for simplification intervals less than 100 generations.
For intervals longer than 100 generations, run times are very similar, with no appreciable difference between $10^3$ and $10^4$
generations.  As expected, the simplification interval has a near-linear effect on peak RAM use
(Figure~\ref{sfig:gcmem}).  Taken together Figures~\ref{sfig:gctime} and~\ref{sfig:gcmem} suggest that one can tune the
simplification interval to available RAM with little effect on run times provided that simplification doesn't occur too
often.

\begin{figure}[!h]
    \includegraphics{sims/gc_interval/GCtime}
    \caption{\label{sfig:gctime}The effect of simplification interval on run times.  A single replicate was done for the
        parameters shown in the figure.  These simulations were done with the same selection parameters as in
    Figure~\ref{fig:runtimes_selection}.}
\end{figure}

\begin{figure}[!h]
    \includegraphics{sims/gc_interval/GCmem}
    \caption{\label{sfig:gcmem}Peak RAM use for the simulations shown in Figure~\ref{sfig:gctime}.}
\end{figure}


\section{Timing and memory use with no selection}
\label{ss:timing_nosel}
\renewcommand{\thefigure}{D\arabic{figure}}
\setcounter{figure}{0}

We also performed a more limited set of simulations without natural selection.
For these we used \cpp{} as in the main text, but also provide results from our
proof-of-concept implementation that uses \simupop{} (see
Appendix~\ref{ss:simupop}).

The total run times are shown in
Figure~\ref{sfig:rawspeed_nosel} and for \cpp{} show the same qualitative behavior as simulations with selection
(Figure~\ref{fig:runtimes_selection}).  The relative improvement due to pedigree tracking is again substantial in these
simulations (Figure~\ref{sfig:speedup_nosel}).  In fact, we see more of a benefit to pedigree tracking here with \fwdpy{}
than we did with selection (Figure~\ref{fig:relative_speedup_selection}).  The reason is that the fitness function exits
close to instantly in simulations without selection that are based on \fwdpp{}, meaning that \fwdpy{} is doing little
more than generating random numbers and book-keeping in Figures~\ref{sfig:rawspeed_nosel}
and~\ref{sfig:speedup_nosel}.
For \simupop{}, the results show a qualitatively similar pattern but (1) the
simulations for $N = 10000$ or greater did no finish in the 72 hour limit or
exceed memory limits (more than 500 GB), and (2) the magnitude of the speedup is much less.
The total RAM use for
simulations without selection is shown in Figure~\ref{sfig:ramnosel}.
% Pedigree recording again used more RAM with \fwdpp{} and less
% RAM with \simupop{}. The reason for reduced RAM use with \simupop{} is unclear;
% with \fwdpy{}/\fwdpp{}, however the reason for the extra RAM use is as in
% Appendix~\ref{ss:memuse}.

\begin{figure}[!h]
    \includegraphics[]{sims/rawspeed_nosel}
    \caption{\label{sfig:rawspeed_nosel}Total run time for a single simulation replicate as a function of region
        length, measured as the scaled recombination parameter $\rho = 4Nr$.  The simulations here involve no natural
    selection.}
\end{figure}

\begin{figure}
    \includegraphics[]{sims/speedup_nosel}
    \caption{\label{sfig:speedup_nosel}Relative performance improvement due to pedigree tracking for simulations without
    selection.  Data points are taken from Figure~\ref{sfig:rawspeed_nosel} and show the ratio of run times tracking
neutral variation over the run times tracking the pedigree.}
\end{figure}
\begin{figure}
    \includegraphics[]{sims/memuse_nosel}
    \caption{\label{sfig:ramnosel}Peak RAM use in gigabytes (GB) for simulations without selection.  The plots are from
    the same data as in Figure \ref{sfig:rawspeed_nosel}.}
\end{figure}


\section{The \simupop{} implementation}
\label{ss:simupop}

To use the methods with \simupop{}, we created a python module
(available at \url{https://github.com/ashander/ftprime})
that manages communication between the forward simulator and \msprime{}.
We implemented simulations in a Python script, usable as a command-line utility,
taking input parameters for population size $N$ and scaled recombination rate $\rho$.
As for the \fwdpp{} simulations above, we set scaled mutation rate $\theta=\rho$,
used a simplify interval of 1000 generations and simulated for $T=10N$ generations in each
simulation.
% introducing deleterious mutations at a rate of $\rho/100$ per generation.
% Because \simupop{} requires explicit locations for loci, we set the number of
% loci $n=\rho / (4Nr)$ and for these simulations without selection marked $0$ of
% these as selected loci.

We coupled \simupop{} to \msprime{} using a Python class \texttt{RecombCollector} that 1)
tracks time, and 2) collects and parses recombination information provided by
the \texttt{Recombinator} class of \simupop{}, adding them to a \tnodetable{} and
\tedgetable{} upon which simplification is performed every 1000 generations (using
\texttt{msprime.sort\_tables()} and \texttt{msprime.simplify\_tables}).
The tables are initialized using the \texttt{TreeSequence} returned from
\texttt{msprime.simulate($2N$, Ne=$N$, recombination\_rate=$r/2$, length=$n$)}.
The \texttt{simuPOP.Population()} is initialized and tagged with individual IDs,
a mapping of these IDs to the \tnodetable{} IDs is passed to
an instance of \texttt{RecombCollector} and used internally to update the tables.
The \texttt{simuPOP.Population()} is then evolved.
Before mating:
% selected mutations are created (using
% \texttt{simuPOP.SNPMutator}) and multiplicative selection occurs (using
% \texttt{simuPOP.PyMlSelector}) only on the selected loci;
the internal time of the \texttt{RecombCollector} instance is updated.
During mating: male and female parents are drawn randomly, recombination data is
output by \simupop{} to the \texttt{RecombCollector} instance.
After mating: every 1000 generations the tables are simplified.

\texttt{RecombCollector} uses the Python class \texttt{ARGrecorder} internally
to add data to the tables and perform simplification.
This class contains the \tnodetable{} and \tedgetable{} and also the mapping from
(haploid) ids to ids in the tables.
It also provides methods (\texttt{add\_individual} and \texttt{add\_record})
to append data to the tables.
The only functions of
\texttt{RecombCollector} not delegated to this internal class are computing
haploid IDs from \simupop{}'s diploid IDs and parsing recombination data into
data suitable for use with use with \texttt{add\_individual} and
\texttt{add\_record}; these functions are used to stream data into the tables as
it is received by an instance of \texttt{RecombCollector}.

% \paragraph{Requirements.}
% The simulations as implemented depend on the class \texttt{RecombCollector} provided by
% the Python package \ftprime{} 0.0.6rc0,
% available at \url{https://www.github.com/ashander/ftprime}.
% Its requirements are encoded in an \texttt{environment.yml} file, compatible
% with conda, but \msprime{} must be updated to the version listed below via pip
% from GitHub \jda{TODO add command?}.
%
% In brief the requirements are:
%
% \begin{enumerate}
%     \item \msprime{} 0.5.0
%     \item \simupop{} 1.1.8.3
%     \item python 3.5
% \end{enumerate}

\end{document}