# Tree-Sequence Tutorial

Before starting the tutorial, you will need to install `msprime`. According to the documentation, one of the most straightforward ways to do this is via Anaconda:

```bash
# Create a new anaconda environment and install msprime.
conda create --name msprime-env
conda activate msprime-env
conda install -c conda-forge msprime
```

In [1]:
# Import packages.
import msprime
import numpy as np

# Print versions.
print("msprime", msprime.__version__)
print("numpy", np.__version__)

msprime 1.3.3
numpy 2.2.0


## Overview with a Single Tree

Let's start by simulating a tree-sequence with a single tree. To do so, we will ignore recombination (for now).

In [2]:
# Perform an ancestry simulation for three samples with an effective population size of 10,000.
ts = msprime.sim_ancestry(
    samples=[msprime.SampleSet(3, ploidy=1)],  # Sample haploids.
    population_size=1e4,
    random_seed=42,
)
# Show the tree-sequence.
print(ts)

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │          1║
╟───────────────┼───────────╢
║Sequence Length│          1║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │          3║
╟───────────────┼───────────╢
║Total Size     │    1.8 KiB║
╚═══════════════╧═══════════╝
╔═══════════╤════╤═════════╤════════════╗
║Table      │Rows│Size     │Has Metadata║
╠═══════════╪════╪═════════╪════════════╣
║Edges      │   4│136 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Individuals│   3│108 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Migrations │   0│  8 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Mutations  │   0│ 16 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Nodes      │   5│148 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Populations│   1│224 Bytes│         Yes║
╟───────────┼────┼────

The code above simulates a tree-sequence for three monoploid samples that originate from a population with a diploid effective population size of 10,000. Since there is no recombination, there should only be a single tree in this tree-sequence. It might be helpful to visualize it:

In [3]:
# Print the tree as a text output.
print(ts.draw_text())

7181.42┊   4   ┊
       ┊  ┏┻━┓ ┊
3128.45┊  3  ┃ ┊
       ┊ ┏┻┓ ┃ ┊
0.00   ┊ 0 1 2 ┊
       0       1



Because we only have one tree, we can directly extract it—we will go over iterating through trees later.

In [4]:
# Extract the first and only tree from the tree-sequence.
tree = ts.first()
# Print a summary.
print(tree)

╔══════════════════════════════╗
║Tree                          ║
╠═══════════════════╤══════════╣
║Index              │         0║
╟───────────────────┼──────────╢
║Interval           │    0-1(1)║
╟───────────────────┼──────────╢
║Roots              │         1║
╟───────────────────┼──────────╢
║Nodes              │         5║
╟───────────────────┼──────────╢
║Sites              │         0║
╟───────────────────┼──────────╢
║Mutations          │         0║
╟───────────────────┼──────────╢
║Total Branch Length│17,491.302║
╚═══════════════════╧══════════╝



### Nodes and Edges

The node table records the information associated with a node which represents an ancestral haploid genome—**NOTE:** The `id` column is only visualized but not actually recorded. 
- `flags` column has a value of 1 if the node is a sample node and 0 otherwise.
- `population` column records the population id for the node and is -1 otherwise.
- `individual` column records the individual id for the node and is -1 otherwise.
- `time` column records the birth time of that node.
- `metadata` column contains any associated metadata for that node.  

In [5]:
# View the node table.
print(ts.tables.nodes)

╔══╤═════╤══════════╤══════════╤══════════════╤════════╗
║id│flags│population│individual│time          │metadata║
╠══╪═════╪══════════╪══════════╪══════════════╪════════╣
║0 │    1│         0│         0│    0.00000000│        ║
║1 │    1│         0│         1│    0.00000000│        ║
║2 │    1│         0│         2│    0.00000000│        ║
║3 │    0│         0│        -1│3,128.45388481│        ║
║4 │    0│         0│        -1│7,181.42391743│        ║
╚══╧═════╧══════════╧══════════╧══════════════╧════════╝



The edge table records the parent-child relationship between a pair of nodes over a genomic interval—**NOTE:** The `id` column is only visualized but not actually recorded. 
- `left` column records the left coordinate (inclusive) of the half open genomic interval `[left, right)` over which the `child` node is inherited from a given `parent` node.
- `right` column records the right coordinate (inclusive) of the half open genomic interval `[left, right)` over which the `child` node is inherited from a given `parent` node.
- `parent` column records the node id of the `parent` node.
- `child` column records the node id of the `child` node.
- `metadata` column contains any associated metadata for that node. 

In [6]:
# View the edge table.
print(ts.tables.edges)

╔══╤════╤═════╤══════╤═════╤════════╗
║id│left│right│parent│child│metadata║
╠══╪════╪═════╪══════╪═════╪════════╣
║0 │   0│    1│     3│    0│        ║
║1 │   0│    1│     3│    1│        ║
║2 │   0│    1│     4│    2│        ║
║3 │   0│    1│     4│    3│        ║
╚══╧════╧═════╧══════╧═════╧════════╝



### Tree Traversal

Often one may want to traverse the tree to extract information from nodes. However, there are multiple different orderings in which we can visit all the nodes. The most efficient way for traversing a tree differs based on the specific analysis, but here is a quick overview of what is implemented in `msprime`—**NOTE:** Subtrees are sorted by node id such that the "left" subtree corresponds to the child node with the smallest node id and the "right" subtree is child node with the larger node id.

- **Preorder**: The root node is visited first, followed by its left subtree, and then its right subtree. 
- **Inorder**: The left subtree is visited first, followed by the root, and then the right subtree.
- **Postorder**: The left subtree is visited first, then the right subtree, and finally the root.

In [7]:
# For every traversal order.
for trav_order in ["preorder", "inorder", "postorder"]:
    # Print the traversal order.
    print(f"{trav_order}:\t", list(tree.nodes(order=trav_order)))

preorder:	 [4, 2, 3, 0, 1]
inorder:	 [2, 4, 0, 3, 1]
postorder:	 [2, 0, 1, 3, 4]


Let's quickly recap the traversal methods.

**Preorder (root $\rightarrow$ left subtree $\rightarrow$ right subtree)**
1. Start at the root, node **4**.
2. Visit the left subtree, node **2**.
3. Visit the right subtree:
   - Node **3** (root of the right subtree).
   - Node **0** (left child of node 3).
   - Node **1** (right child of node 3).

**Inorder (left subtree $\rightarrow$ root $\rightarrow$ right subtree)**
1. Start with the left subtree of the root, node **2**.
2. Visit the root, **Node 4**.
3. Visit the right subtree:
   - Node **0** (left child of node 3).
   - Node **3** (root of the right subtree).
   - Node **1** (right child of node 3).

**Postorder (left subtree $\rightarrow$ right subtree $\rightarrow$ root)**

1. Start with the left subtree of the root, node **2**.
2. Visit the right subtree:
   - Node **0** (left child of node 3).
   - Node **1** (right child of node 3).
   - Node **3** (root of the right subtree).
3. Visit the root, node **4**.


### Node Attributes

While traversing a tree, you may need information about nodes. For instance, one may be interested in knowing if the node is an internal or leaf node, the children nodes of an internal node, the parent of a specified node, the time of the specified node, the branch length of the edge connecting the specified node to its parent, or maybe just the edge id connecting the specified node to its parent. Note that if a traversal method is not specified the default is a preorder traversal.

In [8]:
# Traverse the tree in preorder.
for node in tree.nodes():
    # If the node is a leaf.
    if tree.is_leaf(node):
        # Find the parent node.
        parent = tree.parent(node)
        # Find the time of the node.
        time = tree.time(node)
        # Find the branch length between the node and its parent.
        branch_length = tree.branch_length(node)
        # Find the edge id of the edge between the node and its parent.
        edge_id = tree.edge(node)
        # Print a summary.
        print(
            f"Node {node} is a leaf node at time {time} whose parent is node {parent} and branch length {branch_length} (edge {edge_id})."
        )
    # Else, the node is an internal node.
    else:
        # Find the children of the node.
        left_child, right_child = tree.children(node)
        # Find the time of the node.
        time = tree.time(node)
        # Find the branch length between the left child and the node.
        left_branch_length = tree.branch_length(left_child)
        # Find the branch length between the right child and the node.
        right_branch_length = tree.branch_length(right_child)
        # Find the edge id of the edge between the left child and the node.
        left_edge_id = tree.edge(left_child)
        # Find the edge id of the edge between the right child and the node.
        right_edge_id = tree.edge(right_child)
        # Print a summary.
        print(
            f"Node {node} is an internal node at time {time} whose left child is node {left_child} with a branch length of {left_branch_length} (edge {left_edge_id}) and whose right child is node {right_child} with a branch length of {right_branch_length} (edge {right_edge_id})."
        )

Node 4 is an internal node at time 7181.423917434264 whose left child is node 2 with a branch length of 7181.423917434264 (edge 2) and whose right child is node 3 with a branch length of 4052.970032622001 (edge 3).
Node 2 is a leaf node at time 0.0 whose parent is node 4 and branch length 7181.423917434264 (edge 2).
Node 3 is an internal node at time 3128.453884812263 whose left child is node 0 with a branch length of 3128.453884812263 (edge 0) and whose right child is node 1 with a branch length of 3128.453884812263 (edge 1).
Node 0 is a leaf node at time 0.0 whose parent is node 3 and branch length 3128.453884812263 (edge 0).
Node 1 is a leaf node at time 0.0 whose parent is node 3 and branch length 3128.453884812263 (edge 1).


## Tree-Sequences

So far, we have ignored recombination. However, recombination is important because it creates multiple trees along the genome that are spatially autocorrelated. Let's re-run our previous example with recombination to generate a sequence of trees—**NOTE:** Previously we did not need to define a sequence length since there was no recombination, however when there is recombination you will need to specify a sequence length.

In [9]:
# Perform an ancestry simulation for three samples over a 10kb sequence with an effective population size of 10,000 and a recombination rate of 1e-8.
ts = msprime.sim_ancestry(
    samples=[msprime.SampleSet(3, ploidy=1)],  # Sample haploids.
    population_size=1e4,
    recombination_rate=1e-8,
    sequence_length=1e4,
    random_seed=42,
)
# Show the tree-sequence.
print(ts)

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │          3║
╟───────────────┼───────────╢
║Sequence Length│      10000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │          3║
╟───────────────┼───────────╢
║Total Size     │    2.1 KiB║
╚═══════════════╧═══════════╝
╔═══════════╤════╤═════════╤════════════╗
║Table      │Rows│Size     │Has Metadata║
╠═══════════╪════╪═════════╪════════════╣
║Edges      │  10│328 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Individuals│   3│108 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Migrations │   0│  8 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Mutations  │   0│ 16 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Nodes      │   7│204 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Populations│   1│224 Bytes│         Yes║
╟───────────┼────┼────

From the output we can see that there are three trees in this tree-sequence, with seven unique nodes, and ten unique edges. Now let's visualize the tree for clarity and verify that the tree-sequence summary is correct.

In [10]:
# Print the tree as a text output.
print(ts.draw_text())
# Print the number of nodes in the tree-sequence.
print(f"There are {ts.num_nodes} nodes in the tree-sequence.")
# Print the number of edges in the tree-sequence.
print(f"There are {ts.num_edges} edges in the tree-sequence.")

28162.10┊   6   ┊       ┊       ┊  
        ┊ ┏━┻┓  ┊       ┊       ┊  
7114.83 ┊ ┃  ┃  ┊       ┊   5   ┊  
        ┊ ┃  ┃  ┊       ┊ ┏━┻┓  ┊  
4407.97 ┊ ┃  4  ┊   4   ┊ ┃  4  ┊  
        ┊ ┃ ┏┻┓ ┊  ┏┻━┓ ┊ ┃ ┏┻┓ ┊  
4056.86 ┊ ┃ ┃ ┃ ┊  3  ┃ ┊ ┃ ┃ ┃ ┊  
        ┊ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┊ ┃ ┃ ┃ ┊  
0.00    ┊ 0 1 2 ┊ 0 2 1 ┊ 0 1 2 ┊  
        0     1912    8521    10000

There are 7 nodes in the tree-sequence.
There are 10 edges in the tree-sequence.


Now let's view the node and edge tables for this tree sequence.

In [11]:
# Print the node table.
print(ts.tables.nodes)
# Print the edge table.
print(ts.tables.edges)

╔══╤═════╤══════════╤══════════╤═══════════════╤════════╗
║id│flags│population│individual│time           │metadata║
╠══╪═════╪══════════╪══════════╪═══════════════╪════════╣
║0 │    1│         0│         0│     0.00000000│        ║
║1 │    1│         0│         1│     0.00000000│        ║
║2 │    1│         0│         2│     0.00000000│        ║
║3 │    0│         0│        -1│ 4,056.86242079│        ║
║4 │    0│         0│        -1│ 4,407.97125394│        ║
║5 │    0│         0│        -1│ 7,114.82581114│        ║
║6 │    0│         0│        -1│28,162.10275058│        ║
╚══╧═════╧══════════╧══════════╧═══════════════╧════════╝

╔══╤═════╤══════╤══════╤═════╤════════╗
║id│left │right │parent│child│metadata║
╠══╪═════╪══════╪══════╪═════╪════════╣
║0 │1,912│ 8,521│     3│    0│        ║
║1 │1,912│ 8,521│     3│    2│        ║
║2 │    0│10,000│     4│    1│        ║
║3 │    0│ 1,912│     4│    2│        ║
║4 │8,521│10,000│     4│    2│        ║
║5 │1,912│ 8,521│     4│    3│        ║
║

Note that unlike the previous example for a single tree, the edges in a tree-sequence span for varying genomic intervals.

### Iterating Over Trees

Unlike our first example without recombination, Trees in a tree-sequence are defined by the half-open interval `[left, right)` defined by the recombination breakpoints. Let's now iterate through every tree in the tree-sequence and record the tree's index, genomic interval, genomic span, most recent common ancestor (MRCA), time to the most recent common ancestor (TMRCA), and the total branch length.

In [12]:
# Iterate over the trees in the tree-sequence.
for tree in ts.trees():
    # Find the tree's index.
    index = tree.index
    # Find the tree's genomic interval.
    interval = tree.interval
    # Find the tree's genomic span.
    span = tree.span
    # Find the tree's MRCA node.
    mrca = tree.root
    # Find the tree's TMRCA.
    tmrca = tree.time(mrca)
    # Find the tree's total branch length.
    total_branch_length = tree.total_branch_length
    # Print a summary.
    print(
        f"Tree {index} is at the genomic interval [{interval.left}, {interval.right}) with span {span}bp whose MRCA node is {mrca} at time {tmrca} with total branch length {total_branch_length}."
    )

Tree 0 is at the genomic interval [0.0, 1912.0) with span 1912.0bp whose MRCA node is 6 at time 28162.10275058316 with total branch length 60732.176755101726.
Tree 1 is at the genomic interval [1912.0, 8521.0) with span 6609.0bp whose MRCA node is 4 at time 4407.971253935411 with total branch length 12872.804928661822.
Tree 2 is at the genomic interval [8521.0, 10000.0) with span 1479.0bp whose MRCA node is 5 at time 7114.825811141256 with total branch length 18637.622876217923.


Now let's combine what we have learned thus far by iterating over the trees in the tree-sequence, then traversing the tree to find the first coalescent event, and then recording time of the coalescent event and the two lineages involved.

In [13]:
# Iterate over the trees in the tree-sequence.
for tree in ts.trees():
    # Find the tree's index.
    index = tree.index
    # Initialize variables.
    first_coal_node = None
    first_coal_time = float("inf")
    # Traverse the tree in preorder.
    for node in tree.nodes():
        # If the node is an internal node.
        if tree.is_internal(node):
            # Find the time of the node.
            time = tree.time(node)
            # Determine if this is the earliest coalescent event we've seen.
            if time < first_coal_time:
                # Update the node of the first coalescent event.
                first_coal_node = node
                # Update the the time of the first coalescent event.
                first_coal_time = time
    # Find the lineages that coalesce first.
    left_child, right_child = tree.children(first_coal_node)
    # Print a summary.
    print(
        f"Tree {index}: Leaf nodes {left_child} and {right_child} coalesce first at time {first_coal_time}."
    )

Tree 0: Leaf nodes 1 and 2 coalesce first at time 4407.971253935411.
Tree 1: Leaf nodes 0 and 2 coalesce first at time 4056.862420790999.
Tree 2: Leaf nodes 1 and 2 coalesce first at time 4407.971253935411.


## Demography

Up to now, we have simulated a single population with a constant size. Now that we are more familiar with tree-sequences, let's generate a `Demography` object in `msprime` for a single population with a constant population size of 10,000.

In [14]:
# Initialize the demography object.
one_pop_const_demo = msprime.Demography()
# Initialize the population size of 10,000.
one_pop_const_demo.add_population(name="A", initial_size=1e4)
# View the demographic history.
print(one_pop_const_demo.debug())

DemographyDebugger
╠════════════════════════════════╗
║ Epoch[0]: [0, inf) generations ║
╠════════════════════════════════╝
╟    Populations (total=1 active=1)
║    ┌───────────────────────────────────────┐
║    │   │     start│       end│growth_rate  │
║    ├───────────────────────────────────────┤
║    │  A│   10000.0│   10000.0│ 0           │
║    └───────────────────────────────────────┘



Now, let's define a demographic history with two populations that diverged from an ancestral population 15,000 generations ago, all with a constant population size of 10,000.

In [15]:
# Initialize the demography object.
two_pop_const_demo = msprime.Demography()
# Initialize the populations and their associated sizes of 10,000.
two_pop_const_demo.add_population(name="A", initial_size=1e4)
two_pop_const_demo.add_population(name="B", initial_size=1e4)
two_pop_const_demo.add_population(name="C", initial_size=1e4)
# Initialize the time when populations A and B split from C 15,000 generations ago.
two_pop_const_demo.add_population_split(time=1.5e4, derived=["A", "B"], ancestral="C")
# View the demographic history.
print(two_pop_const_demo.debug())

DemographyDebugger
╠════════════════════════════════════╗
║ Epoch[0]: [0, 1.5e+04) generations ║
╠════════════════════════════════════╝
╟    Populations (total=3 active=2)
║    ┌───────────────────────────────────────────────┐
║    │   │     start│       end│growth_rate  │ A │ B │
║    ├───────────────────────────────────────────────┤
║    │  A│   10000.0│   10000.0│ 0           │ 0 │ 0 │
║    │  B│   10000.0│   10000.0│ 0           │ 0 │ 0 │
║    └───────────────────────────────────────────────┘
╟    Events @ generation 1.5e+04
║    ┌─────────────────────────────────────────────────────────────────────────────────┐
║    │     time│type        │parameters       │effect                                  │
║    ├─────────────────────────────────────────────────────────────────────────────────┤
║    │  1.5e+04│Population  │derived=[A, B],  │Moves all lineages from derived         │
║    │         │Split       │ancestral=C      │populations 'A' and 'B' to the          │
║    │         │    

Next, we'll set up a scenario where population "A" experienced a bottleneck 10,000 generations ago, reducing its population size from 10,000 to 5,000. 

In [16]:
# Initialize the demography object.
two_pop_bott_demo = msprime.Demography()
# Initialize the populations and their associated sizes.
two_pop_bott_demo.add_population(name="A", initial_size=5e3)
two_pop_bott_demo.add_population_parameters_change(
    time=1e4, population="A", initial_size=1e4
)
two_pop_bott_demo.add_population(name="B", initial_size=1e4)
two_pop_bott_demo.add_population(name="C", initial_size=1e4)
# Initialize the time when populations A and B split from C 15,000 generations ago.
two_pop_bott_demo.add_population_split(time=1.5e4, derived=["A", "B"], ancestral="C")
# View the demographic history.
print(two_pop_bott_demo.debug())

DemographyDebugger
╠══════════════════════════════════╗
║ Epoch[0]: [0, 1e+04) generations ║
╠══════════════════════════════════╝
╟    Populations (total=3 active=2)
║    ┌───────────────────────────────────────────────┐
║    │   │     start│       end│growth_rate  │ A │ B │
║    ├───────────────────────────────────────────────┤
║    │  A│    5000.0│    5000.0│ 0           │ 0 │ 0 │
║    │  B│   10000.0│   10000.0│ 0           │ 0 │ 0 │
║    └───────────────────────────────────────────────┘
╟    Events @ generation 1e+04
║    ┌───────────────────────────────────────────────────────────────────────────────────┐
║    │   time│type        │parameters            │effect                                 │
║    ├───────────────────────────────────────────────────────────────────────────────────┤
║    │  1e+04│Population  │population=A,         │initial_size → 1e+04 for population A  │
║    │       │parameter   │initial_size=10000.0  │                                       │
║    │       │chan

Let's simulate a 1Mb segment, sampling two lineages per population under each of these demographic models.


In [17]:
# Simulate a tree-sequence under the one population model.
ts_one_pop_const_demo = msprime.sim_ancestry(
    samples=[msprime.SampleSet(2, ploidy=1, population="A")],  # Sample haploids.
    demography=one_pop_const_demo,
    recombination_rate=1e-8,
    sequence_length=1e6,
    random_seed=42,
)
# Show the tree-sequence.
print(ts_one_pop_const_demo)

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │        261║
╟───────────────┼───────────╢
║Sequence Length│    1000000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │          2║
╟───────────────┼───────────╢
║Total Size     │   27.6 KiB║
╚═══════════════╧═══════════╝
╔═══════════╤════╤═════════╤════════════╗
║Table      │Rows│Size     │Has Metadata║
╠═══════════╪════╪═════════╪════════════╣
║Edges      │ 522│ 16.3 KiB│          No║
╟───────────┼────┼─────────┼────────────╢
║Individuals│   2│ 80 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Migrations │   0│  8 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Mutations  │   0│ 16 Bytes│          No║
╟───────────┼────┼─────────┼────────────╢
║Nodes      │ 200│  5.5 KiB│          No║
╟───────────┼────┼─────────┼────────────╢
║Populations│   1│220 Bytes│         Yes║
╟───────────┼────┼────

In [18]:
# Simulate a tree-sequence under the two population model.
ts_two_pop_const_demo = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(2, ploidy=1, population="A"),
        msprime.SampleSet(2, ploidy=1, population="B"),
    ],  # Sample haploids.
    demography=two_pop_const_demo,
    recombination_rate=1e-8,
    sequence_length=1e6,
    random_seed=42,
)
# Show the tree-sequence.
print(ts_two_pop_const_demo)

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │        729║
╟───────────────┼───────────╢
║Sequence Length│    1000000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │          4║
╟───────────────┼───────────╢
║Total Size     │   94.6 KiB║
╚═══════════════╧═══════════╝
╔═══════════╤═════╤═════════╤════════════╗
║Table      │Rows │Size     │Has Metadata║
╠═══════════╪═════╪═════════╪════════════╣
║Edges      │2,017│ 63.0 KiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Individuals│    4│136 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Migrations │    0│  8 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Mutations  │    0│ 16 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Nodes      │  489│ 13.4 KiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Populations│    3│294 Bytes│         Yes║
╟───────

In [19]:
# Simulate a tree-sequence under the two population model with a bottleneck event in population A.
ts_two_pop_bott_demo = msprime.sim_ancestry(
    samples=[
        msprime.SampleSet(2, ploidy=1, population="A"),
        msprime.SampleSet(2, ploidy=1, population="B"),
    ],  # Sample haploids.
    demography=two_pop_bott_demo,
    recombination_rate=1e-8,
    sequence_length=1e6,
    random_seed=42,
)
# Show the tree-sequence.
print(ts_two_pop_bott_demo)

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │        647║
╟───────────────┼───────────╢
║Sequence Length│    1000000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │          4║
╟───────────────┼───────────╢
║Total Size     │   83.9 KiB║
╚═══════════════╧═══════════╝
╔═══════════╤═════╤═════════╤════════════╗
║Table      │Rows │Size     │Has Metadata║
╠═══════════╪═════╪═════════╪════════════╣
║Edges      │1,765│ 55.2 KiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Individuals│    4│136 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Migrations │    0│  8 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Mutations  │    0│ 16 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Nodes      │  455│ 12.4 KiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Populations│    3│294 Bytes│         Yes║
╟───────

Now let's compute the pairwise TMRCA estimates for all samples within and between populations, per demographic model—**NOTE:** Population A's id is 0 and B's id is 1.


In [20]:
# Initialize the sequence length.
seq_len = 1e6
# Extract the sample indices.
idx_1, idx_2 = ts_one_pop_const_demo.samples(0)
# Initialize the TMRCA.
tmrca_one_pop_const_demo = 0
# Iterate over the trees in the tree-sequence.
for tree in ts_one_pop_const_demo.trees():
    # Find the tree's genomic span.
    span = tree.span
    # Determine the tree's weight.
    tree_weight = span / seq_len
    # Find the TMRCA.
    tmrca = tree.tmrca(idx_1, idx_2)
    # Update the TMRCA.
    tmrca_one_pop_const_demo += tmrca * tree_weight
# Print a summary.
print(
    f"The TMRCA of the two samples in the one population model is {tmrca_one_pop_const_demo}."
)

The TMRCA of the two samples in the one population model is 19043.508095066813.


In [21]:
# Initialize the sequence length.
seq_len = 1e6
# Extract the sample indices.
pop_a1, pop_a2 = ts_two_pop_const_demo.samples(0)
pop_b1, pop_b2 = ts_two_pop_const_demo.samples(1)
# Initialize the TMRCAs.
tmrca_two_pop_const_demo = {
    "A": {(pop_a1, pop_a2): 0},
    "B": {(pop_b1, pop_b2): 0},
    "A-B": {
        (pop_a1, pop_b1): 0,
        (pop_a1, pop_b2): 0,
        (pop_a2, pop_b1): 0,
        (pop_a2, pop_b2): 0,
    },
}
# Iterate over the trees in the tree-sequence.
for tree in ts_two_pop_const_demo.trees():
    # Find the tree's genomic span.
    span = tree.span
    # Determine the tree's weight.
    tree_weight = span / seq_len
    # Find the TMRCAs.
    tmrca_two_pop_const_demo["A"][(pop_a1, pop_a2)] += (
        tree.tmrca(pop_a1, pop_a2) * tree_weight
    )
    tmrca_two_pop_const_demo["B"][(pop_b1, pop_b2)] += (
        tree.tmrca(pop_b1, pop_b2) * tree_weight
    )
    tmrca_two_pop_const_demo["A-B"][(pop_a1, pop_b1)] += (
        tree.tmrca(pop_a1, pop_b1) * tree_weight
    )
    tmrca_two_pop_const_demo["A-B"][(pop_a1, pop_b2)] += (
        tree.tmrca(pop_a1, pop_b2) * tree_weight
    )
    tmrca_two_pop_const_demo["A-B"][(pop_a2, pop_b1)] += (
        tree.tmrca(pop_a2, pop_b1) * tree_weight
    )
    tmrca_two_pop_const_demo["A-B"][(pop_a2, pop_b2)] += (
        tree.tmrca(pop_a2, pop_b2) * tree_weight
    )
# Print a summary.
print(
    f"The TMRCA of the two samples in population A is {tmrca_two_pop_const_demo['A'][(pop_a1, pop_a2)]}."
)
print(
    f"The TMRCA of the two samples in population B is {tmrca_two_pop_const_demo['B'][(pop_b1, pop_b2)]}."
)
print(
    f"The mean TMRCA between two samples in populations A and B is {sum([value for value in tmrca_two_pop_const_demo['A-B'].values()]) / 4}."
)

The TMRCA of the two samples in population A is 21177.942974200236.
The TMRCA of the two samples in population B is 19231.466887275474.
The mean TMRCA between two samples in populations A and B is 35293.538128770604.


In [22]:
# Initialize the sequence length.
seq_len = 1e6
# Extract the sample indices.
pop_a1, pop_a2 = ts_two_pop_bott_demo.samples(0)
pop_b1, pop_b2 = ts_two_pop_bott_demo.samples(1)
# Initialize the TMRCAs.
tmrca_two_pop_bott_demo = {
    "A": {(pop_a1, pop_a2): 0},
    "B": {(pop_b1, pop_b2): 0},
    "A-B": {
        (pop_a1, pop_b1): 0,
        (pop_a1, pop_b2): 0,
        (pop_a2, pop_b1): 0,
        (pop_a2, pop_b2): 0,
    },
}
# Iterate over the trees in the tree-sequence.
for tree in ts_two_pop_bott_demo.trees():
    # Find the tree's genomic span.
    span = tree.span
    # Determine the tree's weight.
    tree_weight = span / seq_len
    # Find the TMRCAs.
    tmrca_two_pop_bott_demo["A"][(pop_a1, pop_a2)] += (
        tree.tmrca(pop_a1, pop_a2) * tree_weight
    )
    tmrca_two_pop_bott_demo["B"][(pop_b1, pop_b2)] += (
        tree.tmrca(pop_b1, pop_b2) * tree_weight
    )
    tmrca_two_pop_bott_demo["A-B"][(pop_a1, pop_b1)] += (
        tree.tmrca(pop_a1, pop_b1) * tree_weight
    )
    tmrca_two_pop_bott_demo["A-B"][(pop_a1, pop_b2)] += (
        tree.tmrca(pop_a1, pop_b2) * tree_weight
    )
    tmrca_two_pop_bott_demo["A-B"][(pop_a2, pop_b1)] += (
        tree.tmrca(pop_a2, pop_b1) * tree_weight
    )
    tmrca_two_pop_bott_demo["A-B"][(pop_a2, pop_b2)] += (
        tree.tmrca(pop_a2, pop_b2) * tree_weight
    )
# Print a summary.
print(
    f"The TMRCA of the two samples in population A is {tmrca_two_pop_bott_demo['A'][(pop_a1, pop_a2)]}."
)
print(
    f"The TMRCA of the two samples in population B is {tmrca_two_pop_bott_demo['B'][(pop_b1, pop_b2)]}."
)
print(
    f"The mean TMRCA between two samples in populations A and B is {sum([value for value in tmrca_two_pop_bott_demo['A-B'].values()]) / 4}."
)

The TMRCA of the two samples in population A is 10488.205749724497.
The TMRCA of the two samples in population B is 19872.234884219277.
The mean TMRCA between two samples in populations A and B is 35490.82504041501.


How and why do the within and between population TMRCA estimates differ between demographic models?