-
Notifications
You must be signed in to change notification settings - Fork 2
/
forwards_paper.tex
1397 lines (1242 loc) · 74.7 KB
/
forwards_paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage{fullpage}
\usepackage{amsmath, amssymb}
\usepackage[hidelinks]{hyperref}
\usepackage[utf8]{inputenc}
\usepackage{natbib}
\usepackage{graphicx}
\usepackage{enumitem}
\usepackage{listings}
\bibliographystyle{plainnat}
% \usepackage[T1]{fontenc}
\include{macros}
\newcommand{\simupop}{\texttt{simuPOP}}
\newcommand{\fwdpp}{\texttt{fwdpp}}
\newcommand{\fwdpy}{\texttt{fwdpy11}}
\newcommand{\justc}{\texttt{C}}
\newcommand{\cpp}{\texttt{C++}}
\newcommand{\msprime}{\texttt{msprime}}
\newcommand{\ftprime}{\texttt{ftprime}}
\newcommand{\nodes}{\texttt{nodes}}
\newcommand{\edgesets}{\texttt{edgesets}}
\newcommand{\sites}{\texttt{sites}}
\newcommand{\mutations}{\texttt{mutations}}
\newcommand{\setdiff}{\smallsetminus}
\newcommand{\Nt}{\mathcal{N}} % node table
\newcommand{\Et}{\mathcal{E}} % edge table
\newcommand{\St}{\mathcal{S}} % site table
\newcommand{\Mt}{\mathcal{M}} % mutation table
\newcommand{\Al}{\mathcal{A}} % list of ancestors
\newcommand{\priorityq}[1]{\mbox{\textbf{PriorityQueue}}({#1})}
\newcommand{\listnew}[1]{\mbox{\textbf{List}}(#1)}
\newcommand{\pqpush}[2]{#1.\mbox{\textbf{push}}(#2)}
\newcommand{\pqpop}[1]{#1.\mbox{\textbf{pop}}()}
\newcommand{\listappend}[2]{#1.\mbox{\textbf{append}}(#2)}
\newcommand{\ancsegment}[1]{\mbox{\textbf{Segment}}(#1)}
\newcommand{\taddrow}[2]{#1.\mbox{\textbf{addrow}}(#2)}
\newcommand{\attrparent}[1]{#1.\mbox{parent}}
\newcommand{\attrchild}[1]{#1.\mbox{child}}
\newcommand{\attrleft}[1]{#1.\mbox{left}}
\newcommand{\attrright}[1]{#1.\mbox{right}}
\newcommand{\attrnode}[1]{#1.\mbox{node}}
% mbox to avoid hyphenating
\newcommand{\tsimplify}[1]{\mbox{\textbf{simplify}}(#1)}
\newcommand{\tsort}[1]{\mbox{\textbf{sort}}(#1)}
\newcommand{\tnodetable}[1]{\mbox{\textbf{NodeTable}}(#1)}
\newcommand{\tedgetable}[1]{\mbox{\textbf{EdgeTable}}(#1)}
\usepackage{color}
\newcommand{\krt}[1]{{\em \color{green} #1}}
\newcommand{\plr}[1]{{\em \color{blue} #1}}
\newcommand{\jk}[1]{{\em \color{red} #1}}
\newcommand{\jda}[1]{{\em \color{cyan} #1}}
\begin{document}
\title{Efficient pedigree recording for fast population genetics simulation}
\author{Jerome Kelleher,
Kevin R. Thornton,
Jaime Ashander, and
Peter L. Ralph}
\maketitle
\begin{abstract}
In this paper we describe how to
efficiently record the entire genetic history of a population
during a forwards-time, individual-based population genetics simulation.
This dramatically reduces the computational burden of tracking individual genomes
by simulating only those loci that may affect reproduction (those having non-neutral variants),
and recording history as a \emph{succinct tree sequence} as introduced in the software package \msprime,
on which neutral mutations can be quickly placed afterwards.
We implement the method in two popular forwards-time simulation frameworks,
which speeds up large, whole-genome simulations by one to two orders of magnitude.
In addition to speed, the method has several advantages:
(1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes.
(2) A population of $N$ individuals with $M$ polymorphic sites
can be stored in $O(N \log N + M)$ space, making it feasible to store a simulation's entire final generation (and its' history)
rather than only a subset.
(3) A simulation can easily be initialized with a more efficient coalescent simulation of deep history.
To make these possible,
we introduce the \msprime{} Tables API, which allows for the efficient interchange
of tree sequences between modular program components,
and provide an algorithm to quickly simplify a tree sequence by removing history irrelevant to a given set of genomes.
\end{abstract}
% NOTE: I tend to use semantic linebreaks.
% Please excuse the ragged right margins in the source.
%%%%%%%%%%%%%%%%%%%%%%
\section*{Introduction}
Since the 1980's, coalescent theory has enabled computer simulation of the results of population genetics models
identical to that which would be produced by large, randomly mating populations over long periods of time
without actually requiring simulation of so many generations or meioses.
Coalescent theory thus had three transformative effects on population genetics:
first, giving researchers better conceptual tools to describe \emph{gene trees} and thus bringing within-population trees into better focus;
second, producing analytical methods to estimate parameters of interest from genetic data (e.g., $\theta = 4N_e \mu$);
and finally, providing a computationally feasible method to produce computer simulations of population genetics processes.
However, these powerful advances came with substantial caveats:
the backwards-in-time processes that are described by coalescent theory
are only \emph{Markovian}, and thus feasible to work with,
thanks to the important assumptions of
(a) random mating,
(b) neutrality,
and (c) small sample size relative to the population size.
The first two assumptions can be side-stepped to a limited extent \citep{hudson1990gene, Neuhauser1997-nn},
but it remains a challenge to map results of coalescent models
onto species that are distributed across continuous geographical
space~\citep{barton2010new,kelleher2014coalescent}
and/or have large numbers of loci under various sorts of selection.
Furthermore, the relationship between the life history of a species --
fecundity and mortality schedules, density-dependent effects on fitness, and demographic fluctuations --
are all absorbed into a single compound parameter, the coalescence rate.
The third assumption is no longer safe, either --
for example, a recent study~\citep{martin2017human}
simulated 600,000 samples of human chromosome 20 to examine biases in GWAS.
Several studies have now shown that in samples approaching the size of the population,
genealogical properties may be distorted relative to the coalescent expectation
\citep{wakeley2003gene,maruvka2011recovering,bhaskar2014distortion}.
These considerations, and increasing computational power, have led to a resurgence of
interest in large forwards-time, individual-based simulations.
For instance, \citet{harris2016genetic} used SLiM \citep{slim} to simulate ten thousand human exomes
to assess the impact of genetic load and Neanderthal introgression on human genetic diversity.
\cite{Sanjak2017-ko} used
\fwdpp{} \citep{fwdpp} to simulate a series of models of quantitative traits under mutation-selection balance with
population sizes of $2 \times 10^4$ diploids in stable populations and populations growing up to $\sim 5
\times 10^5$ individuals, using the output to explore the relationship between the genotype/phenotype model and GWAS
outcomes.
Modern computing power easily allows simulations of birth, death and reproduction
in a population having even hundreds of millions of individuals.
However, if our interest lies in the resulting genetic patterns of variation
-- and often, the point of such simulations is to compare to real data --
then such simulations must record each individual's genome.
As samples of most species' genomes harbor tens or hundreds of millions of variant sites,
carrying full genotypes for even modest numbers of individuals through a simulation
can quickly become prohibitive.
To make matters worse,
a population of size $N$ must be simulated across many multiples of $N$ generations
to produce stable genetic patterns \citep{wright1931evolution, wakeley2005coalescent}.
Because of this computational burden, even the fastest simulation frameworks such as
SLiM 2~\citep{haller2017flexible} and fwdpp~\citep{fwdpp}
can ``only'' simulate tens of megabases of sequence in tens of thousands of individuals
for tens of thousands of generations.
%% plr: not sure why we need to say these things here?
% and there are still near-order-of-magnitude performance differences between these implementations~\citep{haller2017flexible}.
In practice, current state-of-the-art simulation software may take on the order of
weeks to simulate models of large genomic regions without selection~\citep{fwdpp,Hernandez2015-wf},
and existing simulation engines differ in how efficiently they
calculate fitnesses in models with selection~\citep{fwdpp}.
These population and region sizes are still substantially short of whole genomes
(hundreds to thousands of megabases)
for many biological population sizes of interest.
However, it is thought that most genetic variation is selectively neutral (or nearly so).
By definition, neutral alleles carried by individuals in a population
do not affect the population process.
For this reason, if one records the entire genealogical history of a population over the course of a simulation,
simply laying down neutral mutations on top of that history afterwards
is equivalent to having generated them during the simulation:
it does not matter if we generate each generation's mutations during the simulation, or afterwards.
Precisely, we need to know the genealogical trees relating all sampled individuals
at each position along the genome.
Combined with ancestral genotypes and the origins of new mutations,
these trees completely specify the genomic sequence of any individual in the population at any time.
To obtain this information, we record from forward simulation the \emph{population pedigree} --
the complete history of parent-offspring relationships of an entire population
going back to a remote time -- and the genetic outcomes of each ancestral meiosis,
periodically discarding all information irrelevant to the genetic history
of the extant population.
The information in this embellished pedigree is stored as a \emph{succinct tree sequence}
(or, for brevity, ``tree sequence''),
which contains all the information necessary
to construct the genealogical tree that relates each individual to every other
at each position on the genome.
In this paper, we describe a storage method for \emph{succinct tree sequences}
(and hence, genome sequence) as well as an algorithm for simplifying these.
The data structure is \emph{succinct} in the sense that its' space usage is close to optimal,
while still allowing efficient retrieval of information (see, e.g.,
\citet{gog2014theory}).
The storage method and simplification algorithm are both implemented
as improvements to the algorithmic tools and data structures in \msprime{}.
We also describe how these tools can efficiently record,
and later process, the embellished population pedigree from a forwards-time simulation.
While providing substantial savings in computational time and space, our methods provide much
more information than simply simulating the genomes -- the tree sequence
encodes all marginal genealogies of individuals living at the end of the simulation.
Although we were motivated by a need for more efficient genomic simulations,
these tools may prove more widely useful.
%%%%%%%%%%%%%%%%%%%%%
\section*{Results}
The strategy described above is only of interest if it is computationally feasible.
Therefore,
we begin by benchmarking the performance improvement achieved by this method,
implemented using the forwards-time simulation library \fwdpp{} \citep{fwdpp}
and tree sequence tools implemented in \msprime{}.
Then, we describe the conceptual and algorithmic foundations for the method:
(a) a format, implemented in the \msprime{} Python API,
for recording tree sequences efficiently in several \emph{tables};
(b) an algorithm to record these tables during a forwards-time simulation;
and (c) an algorithm to \emph{simplify} a tree sequence, i.e., remove redundant information.
Finally, we analyze the run time and space complexity of our general-purpose method.
%%%%%%
\subsection*{Simulation benchmarks}
To measure the performance gains from recording the pedigree we ran simulations
both with and without recording.
(Although we record more than just the parent--offspring relationships of the pedigree,
for brevity we refer to the method as ``pedigree recording''.)
All simulations used \fwdpp{} to implement a discrete-time Wright-Fisher population of $N$ diploid individuals,
simulated for $10N$ generations (details below).
Simulations without pedigree recording introduced neutral mutations at a rate
equal to the recombination rate,
so $\mu = r$, where $\mu$ and $r$ are the expected per-generation number of mutations per gamete
and recombination breakpoints per diploid, respectively.
Simulations with pedigree recording introduced neutral mutations at the same rate
retrospectively, as described below, resulting in statistically identical simulation results.
% Both mutations and recombinations occur as Poisson processes with respective means $\mu$ and $r$,
% and we used the infinitely-many sites mutation model \citep{Kimura1969-uc}.
We ran simulations with different values of $N$ and varied the size of the genomic region according to the scaled recombination
parameter $\rho = 4Nr$.
%% this was said already:
% When simulating neutral mutations, we kept the scaled mutation parameter, $\theta = 4N\mu$, equal to $\rho$.
Deleterious mutations were introduced at rate $\rho/100$ per generation, drawing scaled selection
coefficients ($2Ns$)
from a Gamma distribution with a mean of -5 and a shape parameter of 1. This distribution of fitness effects results in
thousands of weakly-deleterious mutations segregating in the population, many of which drift to intermediate
frequencies. The case of many mutations with selection is a non-trivial violation of exchangeability assumptions of the
coalescent \citep{Neuhauser1997-nn}. Therefore, these selected mutations must be explicitly tracked in our forward simulation
and the time savings due to pedigree recording come from not having to record \textit{neutral} mutations.
% When tracking neutral mutations (instead of the pedigree), run times increase dramatically with increasing region size ($4Nr = 4Nu$ for these simulations).
Pedigree tracking dramatically reduced runtimes, as shown in Figure~\ref{fig:runtimes_selection},
producing a relative speedup of up to $\approx 50$ fold relative to standard simulations that track neutral mutations
(Figure~\ref{fig:relative_speedup_selection}).
Pedigree tracking results in greater relative speedups for larger $N$
and we observe increasing relative speedups as $4Nr$ increases for a given $N$
(Figure~\ref{fig:relative_speedup_selection}).
Importantly, runtimes are approximately linear in region size $\rho$ when pedigree tracking
(partially obscured by the log scale of the horizontal axis in Figure~\ref{fig:runtimes_selection}).
%% don't know what an 'edge' is yet
% The reason for this behavior is that, when tracking pedigrees, the simulations are generating a constant number of new edges on average
% each generation as the number of new recombination breakpoints is Poisson distributed with expectation $\rho$.
In a more limited set of neutral simulations we found the same qualitative behavior,
and a numerically larger speedup by using pedigree tracking (see Appendix~\ref{ss:timing_nosel}).
In our implementation, simulations with pedigree recording
used substantially more RAM than simple forward simulations (see Appendix~\ref{ss:memuse}).
This is unsurprising:
unsimplified tree sequences grow quickly, and so storing history can use arbitrarily much memory.
However, this is not a requirement of the method, only a straightforwards consequence of a speed--memory tradeoff:
the amount of required memory is mostly determined by the interval between simplification steps,
but less frequent simplification reduces overall computation time (see Appendix~\ref{ss:gcinterval}).
In fact, our method could in some situations \emph{reduce} the amount of memory required,
if memory usage in the forwards simulation was dominated by the cost of maintaining neutral genetic variants.
\begin{figure}
\includegraphics[]{sims/rawspeed}
\caption{
\label{fig:runtimes_selection}
Total run time per single simulation replicate as a function of region length.
% measured as the scaled recombination parameter $\rho = 4Nr$.
Line color represents different diploid population sizes ($N$).
The left figure shows run times for standard simulations including neutral mutations.
The right column of panels shows run times of simulations that recorded the pedigree
and added neutral mutations afterwards.
The dashed line in the right panel shows results for an implementation using \fwdpy{}
where the pedigree simplification steps were handled in a separate thread of execution
and fitness calculations were parallelized.
Simulations with $N=5 \times 10^4$ timed out for region sizes larger than $10^3$.
}
\end{figure}
\begin{figure}
\includegraphics[]{sims/speedup}
\caption{
\label{fig:relative_speedup_selection}
Relative speedup of simulations due to pedigree recording.
Each line shows the ratio of total run times of standard simulations to
those of simulations with pedigree recording.
Data points are taken from Figure~\ref{fig:runtimes_selection}
for simulations that ran to completion in both cases.
}
\end{figure}
% Maybe an estimate of how long \emph{just} the pedigree recording and simplification takes,
% so that then we can say how fast the simulator would have to be to do $10^6$
% whole chromosomes for $10^7$ generations in a day.
%%%%%%
\subsection*{Tables for succinct tree sequences}
We now explain what we actually did to achieve a $50\times$ speedup.
The ``pedigree recording'' simulations above recorded information about each new individual
in a collection of tables that together define a \emph{succinct tree sequence}
(or, simply ``tree sequence'').
A {tree sequence} is an encoding for a sequence of correlated trees,
such as those describing the history of a sexual population.
Tree sequences are efficient because branches that are shared by adjacent trees are stored once,
rather than repeatedly for each tree.
The topology of a tree sequence is defined via its \emph{nodes} and \emph{edges},
while information about variants is recorded as \emph{sites} and \emph{mutations};
we give an example in Figure~\ref{fig:example_tree_sequence}.
This formulation is derived from the ``coalescence records'' encoding of tree
sequences~\citep{kelleher2016efficient}, normalised to remove redundancy
and generalised to include a more general class of tree topologies.
The idea of a tree sequence is closely related to the \emph{ancestral recombination graph},
or {ARG} \citep{griffiths1991two,griffiths1997ancestral},
which also describes the embellished pedigree.
The ARG has been the subject of substantial study
under the assumptions of coalescent
theory~\citep{wiuf1997number,wiuf1999ancestry,marjoram2006coalescent,wilton2015smc}.
However, the properties of the ARG as a computational structure have not
been studied and, despite several efforts to standardise a common
format~\citep{morin2006netgen,mcgill2013graphml}, % TODO check these refs against others in msprime paper
ARGs are rarely used in practise.
In contrast, the algorithmic properties of tree sequence
algorithms have been explored in detail~\citep{kelleher2016efficient},
contributing substantially to the efficiency of the \msprime{} coalescent simulator.
The \emph{nodes} of a tree sequence
correspond to the vertices in the individual genealogies along the sequence.
Each node refers to a specific, distinct ancestor,
and so has a unique ``time'',
thought of as the node's birth time, which determines the height of any vertices
the node is associated with.
(Note that since each node time is equal to the amount of time since the {birth} of the
corresponding parent, time is measured in clock time, not in meioses.)
The example of Figure~\ref{fig:example_tree_sequence} has five nodes:
nodes 0, 1 and 2 occur at time 0 and are the \emph{samples},
while nodes 3 and 4 represent those ancestors necessary to record their genealogy,
who were born one and two units of time in the past, respectively.
\begin{figure}
\begin{center}
\includegraphics[width=\textwidth]{example_tree_sequence}
\end{center}
\caption{
An example tree sequence with three samples over a chromosome of length 10.
The left-hand panel shows the tree sequence pictorially in two different ways:
(top) as a sequence of tree topologies
and (bottom) the spatial extent of the edges that define these topologies.
The right-hand panels show the specific encoding
of this tree sequence in the four tables (nodes, edges, sites and mutations)
defined by \msprime.
\label{fig:example_tree_sequence}
}
\end{figure}
The \emph{edges} define how nodes relate to each other over specific genomic intervals.
Each edge records
% is a tuple $(\ell, r, p, c)$, where
the endpoints $[\ell, r)$ of the half-open genomic interval defining the
spatial extent of the edge;
and the identities $p$ and $c$ of the parent and child nodes
of a single branch that occurs in all trees in this interval.
The spatial extent of the edges defining the topology of Figure~\ref{fig:example_tree_sequence}
are shown in the bottom left panel.
For example, the branch joining nodes 1 to 3 appears in both trees,
and so is recorded as a single edge extending over the whole chromosome.
It is this method of capturing the shared structure between adjacent trees that makes the
tree sequence encoding compact and algorithmically efficient.
Recovering the sequence of trees from this information is straightforward:
each point along the genome at which the tree topology changes
is accompanied by the end of some {edges} and the beginning of others.
Since each {edge} records the genomic interval
over which a given node inherits from a particular ancestor,
to construct the tree at a certain point in the genome
we need only retrieve all edges overlapping that point
and construct the corresponding tree.
To modify the tree to reflect the genealogy at a nearby location,
we simply remove those edges whose intervals do not overlap that location,
and add those new edges whose intervals do.
Incidentally, this property that edges naturally encode \emph{differences}
between nearby trees (e.g., as ``subtree prune and regraft'' moves)
allows for efficient algorithms to compute statistics of the genome sequence that take advantage
of the highly correlated nature of nearby trees~\citep{kelleher2016efficient}.
Given the topology defined by the nodes and edges, \emph{sites} and \emph{mutations}
encode the sequence information for each sample in an efficient way. Each site
records two things: its position on the genome and an ancestral state.
For example,
in Figure~\ref{fig:example_tree_sequence} we have two sites, one at position
2.5 with ancestral state `A' and the other at position 7.5 with ancestral state `G'.
If no mutations occur at a given site, all nodes inherit the ancestral state.
Each mutation records three things: the site at which it occurs,
the first node to inherit the mutation, and the derived state.
Thus, all nodes below the mutation's node in the tree will inherit this state,
unless further mutations are encountered.
Three mutations are shown in Figure~\ref{fig:example_tree_sequence},
illustrated by red stars.
The first site, in the left-hand tree,
has a single mutation, which results in node $2$ inheriting the state `T'.
The second site, in the right hand tree, has two mutations:
one occurring over node $3$ changing the state to `C',
and a back mutation over node $1$ changing the state to `G'.
This encoding of a sequence of trees and accompanying mutational information is
very concise. To illustrate this, we used \msprime{} to simulate 500,000 samples of a
$200$ megabase chromosome with human-like parameters: $N_e=10^4$ and per-base mutation and
recombination rates of $10^{-8}$ per generation. This resulted
in about 1 million distinct marginal trees and $1.1$ million infinite-sites
mutations. The HDF5 file encoding the node, edge, site and mutation tables (as
described above) for this simulation consumed 157MiB of storage space. Using
the \msprime\ Python API, the time required to load this file into memory was
around 1.5 seconds, and the time required to iterate over all 1 million trees
was 2.7 seconds. In contrast, recording the topological information in Newick
format would require around 20 TiB and storing the genotype information
in VCF would require about 1 TiB (giving a compression factor of 144,000 in
this instance).
Working with either the Newick or VCF encoding
of this dataset would likely require several
days of CPU time just to read the information into memory.
\paragraph{Validity of a set of tables}
Given a set of node and edge tables as described above,
there are only two requirements that ensure the tables
describe a valid tree sequence.
% (There are essentially no such requirements on the site and mutation tables.)
These are:
\begin{enumerate}
\item Offspring must be born after their parents (and hence, no loops).
\item The set of intervals on which each individual is a child must be disjoint.
\end{enumerate}
A pair of node and edge tables that satisfy these two requirements
is guaranteed to uniquely describe at each point on the genome
a collection of directed, acyclic graphs -- in other words, a forest of trees.
For some applications it is necessary to check that at every point
there is only a \emph{single} tree.
Checking this is more difficult, but is implemented in \msprime{}'s API.
For efficiency, \msprime{} makes several other sortedness requirements on the tables,
that are not necessarily satisfied by tables emitted by a forwards-time simulation.
\msprime{}'s API includes tools to rectify this by first sorting % (using \texttt{sort\_tables})
and then using the \texttt{simplify} algorithm described below, which works on sorted tables
and is guaranteed to produce a valid, \msprime{}-ready tree sequence.
%%%%%%
\subsection*{The \msprime\ Tables API}
The facilities for working with succinct tree sequences are implemented as part
of the \msprime\ Python API, which provides a powerful platform for processing
tree topology and mutation data. The new portions of \msprime{} that we discuss
here are dedicated to tree sequence input and output using simple tables of
data, as described above, so we refer to this as the ``Tables API''.
The Tables API is primarily designed to facilitate efficient interchange of
data between programs or between different modules of the same program. We
adopted a `columnar' design, where all the values for a
particular column are stored in adjacent memory locations.
There are many advantages to columnar storage -- for example, since adjacent
values in memory are from the same column, they tend to compress well,
and suitable encodings can be chosen on a per-column basis~\citep{abadi2006integrating}.
A particular advantage of this approach is that it enables very
efficient copying of data, and in principle zero-copy data access
(where a data consumer reads directly from the memory of a producer).
Our implementation
% uses the NumPy C API~\citep{walt2011numpy} to efficiently copy
efficiently copies data from Python as a NumPy array \citep{walt2011numpy}
into the low-level C library used to manipulate tree sequences.
This architecture allows for data transfer rates of gigabytes per second
(impossible under any text-based approach), while retaining excellent portability.
NumPy's array interface provides a great deal of flexibility and efficiency,
and makes it straightforward to transfer data from sources
such as HDF5 \citep{hdf5} or Dask~\citep{dask}.
For small scale data and debugging purposes, a simple text based format is also supported.
The \msprime\ Python Tables API provides a general purpose toolkit for importing
and processing succinct tree sequences. Interoperation with Python simulators
is then straightforward. The implementation we benchmark here uses
\texttt{pybind11} (\url{https://github.com/pybind/pybind11/}) to interface
with the \fwdpp{} \cpp{} API \citep{fwdpp}. No modifications were
required to the \fwdpp{} code base; rather, we simply need to bookkeep parent/offspring labels,
and perform simple processing of the recombination breakpoints from each mating
event to generate node and edge data. This information is then periodically copied
to the \msprime\ Tables API, where it is sorted and simplified.
\paragraph{Flexibility.}
To demonstrate the flexibility provided by the Tables API and provide an
implementation that decouples forward simulation internals from transfer of data
to \msprime, we also implemented a version of the simulations described in
``Simulation benchmarks'' separately in Python, described in Appendix~\ref{ss:simupop}.
In this proof-of-concept implementation, the simulation engine (we use \simupop{}, \citet{simupop})
invokes callbacks at critical points of the simulation, and we infer nodes and edges
from the information that is provided. Rows are appended to the tables
one-by-one, and the tables are periodically sorted and simplified to control
memory usage.
Benchmarking results from this implementation are shown (alongside results from \fwdpp{})
for simulations without selection in Appendix~\ref{ss:timing_nosel}:
a relatively modest speedup of around $5 \times$ is achieved, likely due to increased overhead.
%%%%%%
\subsection*{Recording the pedigree in forwards time}
To record the genealogical history of a forwards time simulation,
we need to record two things for each new chromosome:
the birth time; and the endpoints and parental IDs of each distinctly inherited segment.
These are naturally stored as the \emph{nodes} and \emph{edges} of a tree sequence.
To demonstrate the idea, we write out in pseudocode how to run a neutral Wright--Fisher simulation
that records genealogical history in this way.
The simulation will run for $T$ generations,
and has $N$ haploid individuals, each carrying a single chromosome of length $L$.
For simplicity, we sample exactly one crossover per generation.
We use $\randomuniform(A)$ to denote an element of the set $A$ chosen uniformly at random
(and all such instances are independent).
Given a node table $\Nt$, the function $\taddrow{\Nt}{t}$
adds a new node to the table $\Nt$ with time $t$
and returns the ID of this new node.
Similarly, the function $\taddrow{\Et}{\ell, r, p, c}$
adds a new edge $(\ell\text{eft}, r\text{ight}, p\text{arent}, c\text{hild})$ to the edge table $\Et$.
The function $\tsimplify{P, \Nt, \Et}$ (described below) % in section \ref{ss:simplify}
simplifies the history stored in the tables $\Nt$ and $\Et$
to the minimal information required to represent the genealogies of the list of node IDs $P$;
after simplification the nodes appearing in $P$ are relabeled $(0, 1, \ldots, |P|-1)$.
A step-by-step explanation follows the pseudocode.
\begin{taocpalg}{W}{Forwards-time tree sequence}
{Simulates a randomly mating population of $N$ haploid individuals with
chromosome of length $L$ for $T$ generations, and returns the node
and edge tables ($\Nt$ and $\Et$) recording the simulated history.
The tables are simplified every $s$ generations, removing genealogical
information from $\Nt$ and $\Et$ irrelevant to the current population $P$.
}
\algstep{W1.}{Initialisation.}{
Set $\Nt \leftarrow \tnodetable{}$, $\mathcal{E}
\leftarrow \tedgetable{}$, $t \leftarrow T$, and $j \leftarrow 0$.
For $0 \leq k < N$, set $P_k \leftarrow \taddrow{\Nt}{T}$.
}
\algstep{W2.}{Generation loop head: new node.}{Set $u \leftarrow \taddrow{\Nt}{t}$ and $P'_j \leftarrow u$.
}
\algstep{W3.}{Choose parents.}{Set $a \leftarrow \randomuniform(\{0, \dots, N - 1\})$,
$b \leftarrow \randomuniform(\{0, \dots, N - 1\})$ and $x \leftarrow \randomuniform((0, L))$.
}
\algstep{W4.}{Record edges.}{
Call $\taddrow{\Et}{0, x, P_a, u}$ and $\taddrow{\Et}{x, L, P_b, u}$.
}
\algstep{W5.}{Individual loop.}{Set $j \leftarrow j + 1$. If $j < N$ go to \algref{W2}.
Otherwise, if $t\bmod s \neq 0$ go to \algref{W7}.
}
\algstep{W6.}{Simplify.}{Call $\tsimplify{P', \Nt, \Et}$, and set $P'_k
\leftarrow k $ for $0 \leq k < N$. } %% too cryptic: (Tables may need to be sorted.)
\algstep{W7.}{Generation loop.}{Set $t \leftarrow t - 1$. If $t = 0$ terminate.
Set $P \leftarrow P'$, $j \leftarrow 0$, and go to \algref{W2}.
}
\end{taocpalg}
We begin in~\algref{W1} by creating new node and edge tables, and setting
our population $P$ (a vector of $N$ node IDs) to the initial population.
This initial population is a set of $N$ nodes with birth time $T$ generations
ago. We also initialise our generation clock $t$ and individual index $j$.
Step~\algref{W2} replaces the $j^\text{th}$ individual (with node ID $P_j$)
by creating a new node with birth time $t$ (and ID $u$).
In step~\algref{W3} we determine the new node's ancestry
by choosing two indexes $a$ and $b$ uniformly,
giving us parental IDs $P_a$ and $P_b$,
and choose a chromosomal breakpoint $x$.
We record the effects of this event by storing two new edges: one recording that the parent of node $u$
from $0$ to $x$ is $P_a$, and another recording that the parent of $u$
from $x$ to $L$ is $P_b$. Step \algref{W5} then iterates these steps
for each of the $N$ individuals for each generation.
At the end of a generation, we then check
if we need to simplify (done every $s$ generations).
If simplification is required, we do this in step \algref{W6} by calling the simplify function
on the node and edge tables with the current set of population IDs $P'$ as the samples.
This updates the tables in-place to remove all redundant
information, and remaps the specified sample IDs to $0, \dots, N - 1$ in the updated tables.
Hence, we set our current population IDs to
$0, \dots N - 1$ after simplify has completed.
Step \algref{W7} loops these steps until the required number of generations have been simulated.
% then completes the algorithm by looping over generations;
% we decrement our clock $t$, and terminate if $t = 0$.
% Otherwise, we update our current population and individual index and loop
% back to \algref{W2}.
\begin{figure}
\begin{center}
\includegraphics{wf-before-after}
\end{center}
\caption{An example of a marginal genealogy from a Wright-Fisher simulation
with $N=5$. \textbf{(A)} the original tree including all
intermediate nodes and dead-ends, and \textbf{(B)} the minimal tree
relating all of the currently-alive individuals (27--31).
\label{fig:wf-trees}
}
\end{figure}
This algorithm records only topological information about the simulated genealogies,
but it is straightforward to add mutational information.
Mutations that occur during the simulation can be recorded
by simply storing the node in which they first occur, the derived state,
and (if not already present) the genomic position of the site at which it occurs.
This allows selected mutations, that the forwards time simulation must generate,
to be recorded in the tree sequence.
Neutral mutations can be generated after the simulation has completed, thus
avoiding the cost of generating the many mutations that are lost in the population.
This is straightforward to do because we have access to the marginal genealogies.
Figure~\ref{fig:wf-trees} shows
an example of a marginal genealogy produced by a forwards-time Wright--Fisher
process like Algorithm~\algref{W}.
On the left is the tree showing all the edges output by the simulation,
while on the right
is the minimal tree representing the ancestry of the current set of samples.
Clearly there is a great deal of redundancy in the topological
information output by the simulation.
This redundancy comes from two sources.
First, there are a large number of nodes in the tree that have only one child.
In Algorithm~\algref{W} we do not attempt to identify coalescence events,
but simply record all parent-child
relationships in the history of the population.
As such, many of these edges
will record the simple passing of genealogical information from parent to child
and only some small subset will correspond to coalescences within the marginal
trees. The second source of redundancy in the (unsimplified) output of Algorithm~\algref{W}
is due to the fact that lineages die out: a large number of
individuals in the simulation leave no descendants in the present day population.
Node 26 in Figure~\ref{fig:wf-trees}a, for example, leaves no
ancestors in the current population, and so the entire path tracing back to
its common ancestor with 27 is redundant.
%%%%%%
\subsection*{Tree sequence simplification}
\label{ss:simplify}
It is desirable for many reasons to remove redundant information from a tree sequence.
To formalize this:
suppose that we are only interested in a subset of the nodes of a tree sequence
(which we refer to as our `samples'),
and wish to reduce this input tree sequence
to the smallest one that still completely describes the history of the specified samples,
having the following properties:
\begin{enumerate}
\item All marginal trees must match the subtree
of the corresponding tree in the input tree sequence
that is induced by the samples.
\item Within the marginal trees, all non-sample vertices must have at least
two children (i.e., unary tree vertices are removed).
\item Any nodes and edges not ancestral to any of the sampled nodes are removed.
\item There are no adjacent redundant edges, i.e., pairs of edges
$(\ell, x, p, c)$ and $(x, r, p, c)$ which can be represented with a single edge
$(\ell, r, p, c)$.
\end{enumerate}
\begin{figure}
\begin{center}
\includegraphics{method_diagram}
\end{center}
\caption{
An example of tree sequence simplification.
\textbf{(A)} The augmented pedigree diagram on the left
relates the ten homologous chromosomes of five diploid individuals (BC, DE, FG, HI, and JK)
to each other and to a common ancestral chromosome (A);
dotted lines connect the two chromosomes of each individual,
and solid lines lead to the products of their meioses.
The corresponding tables (right) have 11 node records (one for each chromosome)
and 15 edge records (one for each distinctly inherited segment).
Blue numbers denote crossing over locations --
for instance, $D$ and $E$ were parents to $G$,
who inherited the left 70\% of the chromosome from $E$ and the remainder from $D$.
$B$, $C$, $D$, and $E$ inherit clonally from $A$.
\textbf{(B)} The five distinct trees
found across the chromosome (blue numbers denote locations on the chromosome).
Labels after simplification are shown in red.
\textbf{(C)} Tables recording the tree sequence after simplification
with nodes $J$ and $K$ as samples.
The mapping from labels in the forwards time simulation to nodes in the tree sequence
is shown in red.
% which allows additional records to be added as the simulation progresses.
\label{fig:method_diagram}
}
\end{figure}
Simplification is essential
not only for keeping the information recorded by forwards simulation manageable,
but also is useful for extracting subsets of a tree sequence representing a very large dataset.
% The tree sequences produced by forwards simulations
% record all of history for everyone alive at any time through the simulation.
% Simplification is essential to reduce this
% to a manageable quantity that still contains all
% the information that we are interested in.
% Simplification is also useful if we have a
% tree sequence representing a large dataset and wish to extract the
% information relevant to a subset of the samples.
Our approach to simplification is based on Hudson's algorithm for simulating
the coalescent with recombination~\citep{hudson1983properties},
paralleling the implementation in \msprime{} \citep{kelleher2016efficient};
an implementation in pseudocode is provided in Appendix~\ref{ss:simplify_algorithm}.
Conceptually, this works by
(a) beginning by painting the chromosome in each sample a distinct color;
(b) moving back through history,
copying the colors of each chromosome to the portions of its' parental chromosomes
from which it was inherited;
(c) each time we would paint two colors in the same spot (a coalescence),
record that information as an edge and instead paint a brand-new color;
and
(d) once all colors have coalesced on a given segment,
stop propagating it.
This ``paint pot'' description misses some details --
for instance, we must ensure that all coalescing segments in a given individual
are assigned the \emph{same} new color --
but is reasonably close.
Figure~\ref{fig:method_diagram} shows an example tree sequence,
before and after simplification,
and Figure~\ref{fig:simplify_state} depicts the ``paint pot'' state of the algorithm
during the process of simplifying this tree sequence.
\begin{figure}
\begin{center}
\includegraphics{simplify-state-diagram}
\end{center}
\caption{
A depiction of the state of the simplification algorithm
at each point in time,
in the example of Figure~\ref{fig:method_diagram}A.
Ancestral material from J (red) and K (blue) are traced up through the pedigree
until they coalesce;
the smaller colored chromosomes on either side of each solid arrow show the bits inherited from each of the two parental chromosomes;
each time two colors overlap, a coalescence occurs, and two edges are output.
For instance, both J and K inherit from H between 0.5 and 0.9,
which resulted in the first two edges of the simplified table of Figure~\ref{fig:method_diagram}C.
\label{fig:simplify_state}
}
\end{figure}
More concretely,
the algorithm works by moving back through time,
processing each parent in the input tree sequence in chronological order.
The main state of the algorithm at each point in time is a set of ancestral lineages,
and each lineage is a linked list of ancestral segments.
An ancestral segment $(\ell, r, u)$ is found in a lineage
if the output node $u$ inherits the genomic interval $[\ell, r)$ from that lineage
(and so $u$ corresponds to a ``color'' in the description above).
% These segments are stored in a collection of linked lists,
% one list of segments for each ancestral lineage present at that time.
We also maintain a map from input nodes to lineages. % called $A_j$
Crucially, the time required to run the algorithm is
linear in the number of edges of the input tree sequence.
%%%%%%%%%%%%%%
\subsubsection*{Sequential simplification and prior history}
\label{ss:seq_simp}
Any simulation scheme that records data into tables,
as Algorithm~\algref{W} does,
has its genealogical history available at any time as a tree sequence.
This has two additional advantages:
First, simplification can be run periodically through the simulation,
if we take the set of samples to be the entire currently alive population.
This is important in practice as it keeps memory usage from growing linearly (and quickly) with time.
Second, the simulation can
be \emph{begun} with a tree sequence produced by some other method -- for
instance, by a coalescent simulation with \msprime,
providing an easy, efficient way to specify prior history.
A natural question is now: how often should simplification occur?
Limited testing (described in Appendix~\ref{ss:gcinterval})
found that different simplification intervals affect run times by approximately
25\%, with the lowest run time occurring when simplifying every $10^3$ generations.
Thus, there is a memory-versus-speed tradeoff
-- simplifying more often would keep fewer extinct nodes and edges in memory.
% As shown in the next section,
% there is no computational advantage to simplifying more often than is necessary
% to keep memory usage down.
%
% \krt{Empirically, I did find that simplifying more often than every 1,000 generations gave a run-time hit of about 25
% percent???}
%%%%%%%%%%%%
\subsubsection*{Estimates of run-time complexity}
The simulation results shown in Figures \ref{fig:runtimes_selection} and \ref{fig:relative_speedup_selection} show that
pedigree tracking greatly speeds up forward-time simulations. Further, the run times with pedigree tracking become
near-linear in $\rho$ (Figure \ref{fig:runtimes_selection}). In this section, we explore the run-time complexity using a
simplified Wright-Fisher model with recombination.
%% Memory reduction thanks to simplification
Consider a simulation of a Wright-Fisher population of $2N$ haploid individuals
using Algorithm~\algref{W} for $T$ generations,
with simplify interval $s = 1$ so that redundant information is removed after every generation.
Since every chromosome inherits material from both parents,
without simplification this would produce tables of
$2NT$ nodes and
$4NT$ edges.
Suppose that $T$ is large enough that all samples coalesce within the simulation with high probability,
($T = 20N$, say).
After simplification, we are left with the tree sequence describing the history
of only the current generation of $2N$ individuals.
Ignoring the possibility of coalescent events involving more than two lineages, this tree sequence has $4N-2$ edges to describe the leftmost tree;
and each time the marginal tree changes along the sequence,
four edges end and four new edges begin
(except for changes affecting the root, which require fewer; see \citet{kelleher2016efficient}).
Coalescent theory tells us that
the expected total length of the edges of a marginal tree is approximately $4N\log(2N)$,
which is also equal to the mean number of ancestral recombination events that occur on a branch of the marginal tree,
since we have taken one crossover per individual per generation.
Not all such recombinations actually change the tree topology,
so four times this gives us an upper bound on the expected number of additional edges.
Similarly, not every new edge derives from a never-before-seen node,
but the number of nodes is at most equal to the number of edges plus the sample size.
With $T=20N$, this reduces the initial storage of $120 N^2$ items to
% 4N - 2 + 2N + 4N log(2N) * 4 * 2 = 6N + 32 N log(2N)
$6N + 32 N \log(2N)$;
at $N=10^4$, a factor of 3,700.
%% Progressive memory reduction as a function of T
To get an idea of how required space depends on the length of time between simplification steps,
suppose instead that we have run a simulation of $2N$ individuals for $T$ generations,
begun with no prior history.
How many of the resulting $4NT$ edges are required after simplification?
As above, the expected number of edges is bounded by $2N-2$ plus four times the mean marginal tree length.
Roughly, the length of time for which a marginal tree has $k$ tips is
$2N/(k(k-1)) = 2N(1/(k-1) - 1/k)$ generations,
and so the time to go from $N$ to $n$ tips is $2N(1/(n-1) - 1/(N-1))$.
This implies that after running for time $T$ we expect to
have around $n(T)$ roots, where $n(T) \approx 1 + 2N/(T+2)$.
The total tree length over this time is
$2 N \sum_{k=n(T)+1}^{N-1} 1/k$, which
% is approximately
% \begin{align*}
% 2 N \log\left( \frac{N}{1 + 2 N/(T+2)} \right) .
% \end{align*}
leads to an upper bound on the number of edges of
\begin{align}
\label{eqn:edge_bound}
2 N \left( 1 + 4 \log\left( \frac{NT}{T + 2 N} \right)\right) .
\end{align}
These calculations hold up quite well in practice,
as shown in Figure~\ref{fig:simplify_complexity}.
% by comparison to
% the number of edges actually required for 50 independent Wright--Fisher simulations,
% as a function of time (obtained from Algorithm~\algref{W} with $s=1$)
% is shown in Figure~\ref{fig:simplify_complexity}, compared to this prediction.
\begin{figure}
\begin{center}
\includegraphics{sims/simplify-results}
\end{center}
\caption{
Time and space complexity of simplify.
\textbf{(A)}
Number of edges in the simplified tree sequence
for 10 replicate Wright--Fisher simulations with $N=50$ as a function
of number of generations.
Each line is one simulation, the heavy blue line gives the average,
and the dashed line is the upper bound of equation \eqref{eqn:edge_bound}.
\textbf{(B)}
Time required to simplify the first $k$ edges of a large (4.2GiB)
unsimplified tree sequence produced by a forwards-time simulation plotted
against $k$. The time scales linearly with the number of input edges.
\textbf{(C)}
Time required to simplify the tree sequence resulting from a coalescent
simulation of 500,000 samples of a 200 megabase human chromosome
to a random subsample of $n$ samples, plotted against $n$
(note the log scale; the time scales logarithmically with $n$).
\label{fig:simplify_complexity}
}
\end{figure}
%% Above was edges; same holds for memory usage savings due to mutations
If we were to add mutations in forwards time
under the infinite-sites model with total mutation rate per generation $\mu$,
there would also be around $\mu 2NT$ mutations (and the same number of sites).
Since mutations fall as recombinations do on the marginal trees,
adding them after simplification results in only around $\mu 4 N \log(2N)$ mutations,
and the same savings as we obtained from edges above,
even without considering the
computational burden of propagating mutations forwards across generations.
%% simplify run time
How does the computation \emph{time} required for simplification scale?
Simply because it must process each edge,
% below per discussion
% https://github.com/petrelharp/ftprime_ms/pull/37#discussion_r156617283
the simplification algorithm is at least linear in the number of edges of the input tree sequence.
Empirically the algorithm is exactly linear,
as seen in Figure~\ref{fig:simplify_complexity}B,
where we show shows the time required to simplify increasingly large subsets of a large tree sequence.
When simplifying the result of a forwards-time sequence, the number of edges is the main contributing factor.
Suppose on the other hand we want to
simplify an already-minimal but large tree sequence with $N$ nodes
to a subsample of size $n$.
How does the required time scale with $n$?
In this case, the computation is dominated by the size of the output tree sequence,
which grows with $\log(n)$, as shown in Figure~\ref{fig:simplify_complexity}C,
%% the below assumes naive propagation of genotypes
% Our method stores genealogies, and so records substantially more information
% than would a method only recording genotypes.
% However, since the simplification algorithm requires computational effort,
% it is informative to compare time complexity of the algorithm
% to one that propagates neutral genotypes.
% A typical individual differs at around $2 N \mu$ sites from the population's consensus sequence,
% so propagating these to offspring by simple copying will take $4 N^2 \mu$ operations per generation.
% On the other hand,
% in our scheme we must store $13N$ quantities per generation (two edges and one node per individual).
% The simplification algorithm is linear in the number of edges of the input tree sequence
% (because it must process each edge),
% and so multiplies this by a constant factor.
% These considerations imply that propagating neutral genotypes for $T$ generations has time complexity $O(\mu T N^2)$,
% while our implementation is only $O(T N \log(2N))$.
%%%%%%%%%%%%%%%%%%%%%%
\section*{Discussion}
In this paper, we have shown that storing pedigrees
and associated recombination events
in a forwards-time simulation
not only results in having available a great deal more information about the simulated population,
but also can speed up the simulation by orders of magnitude.
To make this feasible,
we have described how to efficiently store this information in numerical tables,
and have described a fundamental algorithm for simplification of tree sequences.
Conceptually, recording of genealogical and recombination events
can happen independently of the details of simulation;
for this reason, we provide a well-defined and well-tested API in Python
for use in other code bases (a C API is also planned).
The idea of storing genealogical information to speed up simulations is not new.
It was implemented completely in AnA-FiTS~\citep{aberer2013rapid},
but without the critical step of discarding irrelevant genealogical information.
A similar but much more limited method for
discarding this information also appears in \citet{padhukasahasram2008exploring}.
The tree sequences produced by default by this method
are very compact, storing genotype \emph{and} genealogical information
in a small fraction of the space taken by a compressed VCF file.
The format also allows highly efficient processing for downstream analysis.
Efficient processing is possible because many statistics of interest for population genetics
are naturally expressed in terms of tree topologies,
and so can be quickly computed from the trees underlying the tree sequence format.
For example, pairwise nucleotide diversity $\pi$, is the average density of
differences between sequences in the sample.
To compute this directly from sequence data at $m$ sites in $n$ samples
requires computing allele frequencies, taking $O(nm)$ operations.
By using the locations of the mutations on the marginal trees,
and the fact that these are correlated,
sequential tree algorithms similar to those in~\citep{kelleher2016efficient}
can do this in roughly $O(n + m + t \log n)$ operations, where
$t$ is the number of distinct trees.
The \msprime{} API provides a method to compute $\pi$ among arbitrary subsets of the
samples in a tree sequence, which took about 0.7 seconds when
applied to an example simulation of 100 megabases of human-like
sequence for 200,000 samples (about 500K sites). The corresponding
numeric genotype matrix required about 95GiB of RAM, and
calculating $\pi$ took about 66 seconds with NumPy.
% This is done by running $ python msprime-examples.py benchmark-pi
Another attractive feature of this set of tools
is that it makes it easy to incorporate \emph{prior history},
simply by seeding the simulation with a (relatively inexpensive) coalescent simulation.
This allows for incorporation of deep-time history beyond the reach of individual-based simulations.
This may not even negatively affect realism,
since geographic structure from times longer ago than the mixing
time of migration across the range has limited effect on modern genealogies
\citep{wilkins2004separation},
other than possibly changing effective population size \citep{barton2002neutral,cox2002stepping}.
Simulating very large populations and entire genomes will likely require parallelization.