<a href="https://colab.research.google.com/gist/mmore500/a2e88e7c239935c362ec59c6b5a3f7b5/reconstruction-quality-experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Procedure:

For each experimental replicate per treatment,
- Navigate to <https://colab.research.google.com/gist/mmore500/a2e88e7c239935c362ec59c6b5a3f7b5> to open a fresh copy of the experiment notebook. **Open a fresh notebook copy for each treatment.**
- Click on filename on the top left of the Colab page(`a2e88e7c239935c362ec59c6b5a3f7b5`) and rename according to template
  - `evo=island{num_islands}-niche{num_niches}-ngen{num_generations}-popsize{population_size}-tournsize{tournament_size}+instrument={"steady"|"tilted"}-{"old"|"new"}-bits{annotation_size_bits}-diff{differentia_width}+replicate={replicate}+ext=.ipynb`.
  - For example, `evo=island1-niche1-ngen10000-popsize1024-tournsize2+instrument=steady-old-bits64-diff1+replicate=0+ext=.ipynb`.
- Configure variables in "Configure Experment" section.
- On the top menu, click `Runtime > Restart sesson and run all` if available, otherwise `Runtime > Run all`.
- Wait for final cell's execution to complete.
- Record configured variables and results from "Evaluate Reconstruction" section in [results spreadsheet](https://docs.google.com/spreadsheets/d/1ZhS4NDTDyBiwmwtWrZO5L06MGB3lhmp2-5ZzClhEwPU/edit?usp=sharing).
- On the top menu, click `File > Download > Download .ipynb`.
- Upload ipynb file to treatment directory at <https://osf.io/n4b2g/>, named same as notebook, except excluding `+replicate={replicate}+ext=.ipynb`.
  - Treatment directory should contain notebooks for each replicate of notebook.


## Set Up Environment

In [1]:
!python3 -m pip install \
    "alifedata_phyloinformatics_convert==0.15.1" \
    "biopython==1.83" \
    "dendropy==4.6.1" \
    "git+https://github.com/mmore500/hstrat-surface-concept.git@v0.1.0#egg=hsurf" \
    "hstrat==1.9.1" \
    "matplotlib==3.8.2" \
    "pandas==1.5.3" \
    "tqdist==1.0" \
    "tqdm==4.66.1" \
    "typing_extensions>=4.9.0" \
    "watermark==2.4.3"

Collecting hsurf
  Cloning https://github.com/mmore500/hstrat-surface-concept.git (to revision v0.1.0) to /tmp/pip-install-hv_l4nh1/hsurf_5bc4e309f5644ab8b9fd3e5590f51ab1
  Running command git clone --filter=blob:none --quiet https://github.com/mmore500/hstrat-surface-concept.git /tmp/pip-install-hv_l4nh1/hsurf_5bc4e309f5644ab8b9fd3e5590f51ab1
  Running command git checkout -q 0873dcb9281393fc952925fe35debec6a05c5f1e
  Resolved https://github.com/mmore500/hstrat-surface-concept.git to commit 0873dcb9281393fc952925fe35debec6a05c5f1e
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting hstrat==1.9.1
  Downloading hstrat-1.9.1-py2.py3-none-any.whl (548 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.7/548.7 kB[0m [31m3.3 M

In [2]:
from collections import Counter
import typing

import alifedata_phyloinformatics_convert as apc
from Bio import Phylo
import dendropy as dp
from hstrat import hstrat
from hstrat import _auxiliary_lib as hstrat_aux
from hsurf import hsurf
from matplotlib import pyplot as plt
import pandas as pd
import tqdist
from tqdm import tqdm

## Configure Experiment

Configure instrumentation. **Edit me**

In [3]:
# TODO Uncomment one...
# annotation_size_bits = 64
annotation_size_bits = 256
# annotation_size_bits = 1024
assert annotation_size_bits.bit_count() == 1, "must be power of 2 (1, 2, 4, 8, etc.)"

# TODO Uncomment one...
differentia_width_bits = 1
# differentia_width_bits = 8
assert differentia_width_bits.bit_count() == 1, "must be power of 2 (1, 2, 4, 8, etc.)"

# TODO Uncomment one...
# stratum_retention_algo = hstrat.depth_proportional_resolution_tapered_algo  # old impl/steady behavior
stratum_retention_algo = hstrat.recency_proportional_resolution_curbed_algo  # old impl/tilted behavior
# stratum_retention_algo = hsurf.stratum_retention_interop_steady_algo  # new impl/steady behavior
# stratum_retention_algo = hsurf.stratum_retention_interop_tilted_sticky_algo  # new impl/tilted behavior

Configure evolutionary scale. **Edit me**

In [4]:
# TODO Uncomment one...
# population_size = 1024  # default condition
population_size = 65536  # alternate condition
assert population_size.bit_count() == 1, "must be power of 2 (1, 2, 4, 8, etc.)"

# TODO Uncomment one...
num_generations = 10000  # default condition
# num_generations = 100000  # alternate condition


Configure evolutionary conditions.  **Edit me**

In [5]:
# TODO Uncomment one...
num_islands=1  # default condition
# num_islands=64  # alternate condition
assert num_islands.bit_count() == 1, "must be power of 2 (1, 2, 4, 8, etc.)"

# TODO Uncomment one...
num_niches=1  # default condition
# num_niches=8  # alternate condition
assert num_niches.bit_count() == 1, "must be power of 2 (1, 2, 4, 8, etc.)"

# TODO Uncomment one...
tournament_size=2  # default condition
# tournament_size=1  # alternate condition
# tournament_size=8  # alternate condition


Configure experimental replicate. **Edit me**

In [6]:
replicate = 0 # TODO set to a number, 0 through 19

Set up random number generator. (Do not edit.)

In [7]:
seed = hash(
  (
      replicate,
      population_size,
      num_generations,
      num_islands,
      num_niches,
      tournament_size,
  )
) % 2 ** 32

seed

3263141568

In [8]:
from hstrat._auxiliary_lib import seed_random

seed_random(seed)


Parametrize instrumentation. (Do not edit.)

In [9]:
annotation_capacity_strata = annotation_size_bits // differentia_width_bits
assert annotation_capacity_strata.bit_count() == 1, "must be power of 2 (1, 2, 4, 8, etc.)"
print(f"{annotation_capacity_strata=}")

parametrized_policy = stratum_retention_algo.Policy(
  parameterizer=hstrat.PropertyAtMostParameterizer(
    target_value=annotation_capacity_strata,
    policy_evaluator=hstrat.NumStrataRetainedUpperBoundEvaluator(
      at_num_strata_deposited=num_generations,
    ),
    param_lower_bound=2,
    param_upper_bound=1024,
  ),
)

print(f"{parametrized_policy=}")
print(f"num strata retained upper bound {parametrized_policy.CalcNumStrataRetainedUpperBound(num_generations)}")


annotation_capacity_strata=256
parametrized_policy=depth_proportional_resolution_tapered_algo.Policy(policy_spec=depth_proportional_resolution_tapered_algo.PolicySpec(depth_proportional_resolution=127))
num strata retained upper bound 255


## Setup

Helper functions.

In [10]:
def calc_tqdist_distance(
    x: pd.DataFrame,
    y: pd.DataFrame,
    progress_wrap: typing.Callable = lambda x: x,
  ) -> float:
    """Calculate dissimilarity between two trees. Used to measure how accurate
    tree reconstructions are."""
    tree_a = apc.RosettaTree(x).as_dendropy
    tree_b = apc.RosettaTree(y).as_dendropy

    # must suppress root unifurcations or tqdist barfs
    # see https://github.com/uym2/tripVote/issues/15
    tree_a.unassign_taxa(exclude_leaves=True)
    tree_a.suppress_unifurcations()
    tree_b.unassign_taxa(exclude_leaves=True)
    tree_b.suppress_unifurcations()

    tree_a_taxon_labels = [
        leaf.taxon.label for leaf in progress_wrap(tree_a.leaf_node_iter())
    ]
    tree_b_taxon_labels = [
        leaf.taxon.label for leaf in progress_wrap(tree_b.leaf_node_iter())
    ]
    all(
        progress_wrap(
          zip(tree_a.leaf_node_iter(), tree_b.leaf_node_iter(), strict=True),
        ),
    )
    assert sorted(tree_a_taxon_labels) == sorted(tree_b_taxon_labels)
    assert sorted(tree_a_taxon_labels) == sorted(
        x.loc[hstrat_aux.alifestd_find_leaf_ids(x), "taxon_label"],
      )
    assert sorted(tree_a_taxon_labels) == sorted(
        y.loc[hstrat_aux.alifestd_find_leaf_ids(y), "taxon_label"],
    )
    for taxon_label in progress_wrap(tree_a_taxon_labels):
        assert taxon_label
        assert taxon_label.strip()

    newick_a = tree_a.as_string(schema="newick").strip()
    newick_b = tree_b.as_string(schema="newick").strip()

    return {
        "quartet_distance": tqdist.quartet_distance(newick_a, newick_b),
        "quartet_distanc_rawe": tqdist.quartet_distance_raw(newick_a, newick_b),
        "triplet_distance": tqdist.triplet_distance(newick_a, newick_b),
        "triplet_distance_raw": tqdist.triplet_distance_raw(newick_a, newick_b),
    }


## Generate Phylogeny

Use simple evolutionary simulation to generate a phylogenetic history to test reconstruction process on.

In [18]:
true_phylogeny_df = hstrat.evolve_fitness_trait_population(
    num_islands=num_islands,
    num_niches=num_niches,
    num_generations=num_generations,
    population_size=population_size,
    tournament_size=tournament_size,
    progress_wrap=tqdm,
)

100%|██████████| 10000/10000 [01:48<00:00, 92.26it/s]
100%|██████████| 207964/207964 [00:03<00:00, 61087.26it/s]


Unnamed: 0,id,ancestor_list,loc,trait,origin_time,island,niche,taxon_label,is_leaf
0,0,[None],0,,0,0,0,,False
1,1,[0],1313,0.000000,1,0,0,,False
2,2,[1],54869,0.798812,2,0,0,,False
3,3,[2],35052,1.337494,3,0,0,,False
4,4,[3],29099,3.632604,4,0,0,,False
...,...,...,...,...,...,...,...,...,...
207959,207959,[112940],65531,11398.186523,10001,0,0,65531,True
207960,207960,[107091],65532,11394.820312,10001,0,0,65532,True
207961,207961,[124357],65533,11396.279297,10001,0,0,65533,True
207962,207962,[123057],65534,11393.061523,10001,0,0,65534,True


In [19]:
true_phylogeny_df["taxon_label"] = true_phylogeny_df["loc"].astype(str)
true_phylogeny_df = hstrat_aux.alifestd_mark_leaves(true_phylogeny_df, mutate=True)
true_phylogeny_df.loc[
    ~true_phylogeny_df["is_leaf"], "taxon_label"
] = ""
true_phylogeny_df

Unnamed: 0,id,ancestor_list,loc,trait,origin_time,island,niche,taxon_label,is_leaf
0,0,[None],0,,0,0,0,,False
1,1,[0],1313,0.000000,1,0,0,,False
2,2,[1],54869,0.798812,2,0,0,,False
3,3,[2],35052,1.337494,3,0,0,,False
4,4,[3],29099,3.632604,4,0,0,,False
...,...,...,...,...,...,...,...,...,...
207959,207959,[112940],65531,11398.186523,10001,0,0,65531,True
207960,207960,[107091],65532,11394.820312,10001,0,0,65532,True
207961,207961,[124357],65533,11396.279297,10001,0,0,65533,True
207962,207962,[123057],65534,11393.061523,10001,0,0,65534,True


In [12]:
true_phylogeny_df = hstrat_aux.alifestd_to_working_format(
  hstrat_aux.alifestd_collapse_unifurcations(true_phylogeny_df, mutate=True),
  mutate=True,
).reset_index(drop=True)
true_phylogeny_df

Unnamed: 0,id,ancestor_list,loc,trait,origin_time,island,niche,taxon_label,ancestor_id
0,0,[none],0,,0,0,0,0,0
1,1,[0],35431,11211.383789,9832,0,0,35431,0
2,2,[1],35310,11242.329102,9858,0,0,35310,1
3,3,[2],42517,11243.378906,9859,0,0,42517,2
4,4,[2],49638,11246.254883,9862,0,0,49638,2
...,...,...,...,...,...,...,...,...,...
110193,110193,[31649],65531,11401.050781,10001,0,0,65531,31649
110194,110194,[43418],65532,11397.212891,10001,0,0,65532,43418
110195,110195,[27437],65533,11399.445312,10001,0,0,65533,27437
110196,110196,[32142],65534,11395.846680,10001,0,0,65534,32142


## Generate Reconstruction

Generate genome annotations as if tracking phylogeny in distributed environment.
Then run reconstruction proess to estimate true phylogeny from generated annotations.

In [13]:
extant_annotations = hstrat.descend_template_phylogeny_alifestd(
    true_phylogeny_df,
    seed_column=hstrat.HereditaryStratigraphicColumn(parametrized_policy),
    extant_ids=hstrat_aux.alifestd_find_leaf_ids(true_phylogeny_df),
    progress_wrap=tqdm,
)

len(extant_annotations)

100%|██████████| 110198/110198 [00:12<00:00, 8766.51it/s] 
100%|██████████| 65536/65536 [00:25<00:00, 2619.71it/s]


65536

In [14]:
reconstructed_phylogeny_df = hstrat.build_tree(
  extant_annotations,
  progress_wrap=tqdm,
  version_pin=hstrat.__version__,
  taxon_labels=true_phylogeny_df.loc[
      hstrat_aux.alifestd_find_leaf_ids(true_phylogeny_df),
      "taxon_label",
  ],
)
reconstructed_phylogeny_df

100%|██████████| 65536/65536 [01:27<00:00, 753.03it/s]
65536it [00:01, 57463.20it/s]
132211it [00:01, 73557.97it/s]
132211it [00:03, 41161.99it/s]


Unnamed: 0,id,ancestor_list,origin_time,taxon_label,ancestor_id
0,0,[none],0.0,Root,0
248,248,[0],9839.5,Inner+r=9824+d=NkI3l5X27iz+uid=Bv6lHKT64s9c4z3...,0
249,249,[248],9871.5,Inner+r=9856+d=1dFc4yggBA+uid=DCNEcvW1h5aq1hGj...,248
252,252,[249],9903.5,Inner+r=9888+d=G0SbfjEYqqS+uid=COkD_ZNDO2DGHkB...,249
261,261,[248],9903.5,Inner+r=9888+d=KQeB_rWCy0n+uid=D80m2t_NLPHrZOF...,248
...,...,...,...,...,...
132206,132206,[1137],10001.0,27237,1137
132207,132207,[1137],10001.0,34504,1137
132208,132208,[1137],10001.0,44508,1137
132209,132209,[1137],10001.0,50809,1137


In [56]:
reconstructed_phylogeny_df = hstrat_aux.alifestd_collapse_unifurcations(reconstructed_phylogeny_df, mutate=True)
reconstructed_phylogeny_df

Unnamed: 0,id,ancestor_list,origin_time,taxon_label,ancestor_id
0,0,[none],0.0,Root,0
248,248,[0],9839.5,Inner+r=9824+d=PMIry_IlJJr+uid=Bv6lHKT64s9c4z3...,0
249,249,[248],9871.5,Inner+r=9856+d=N901Fc-CEiP+uid=DCNEcvW1h5aq1hG...,248
252,252,[249],9903.5,Inner+r=9888+d=L6OSbRkRjp3+uid=COkD_ZNDO2DGHkB...,249
261,261,[248],9903.5,Inner+r=9888+d=NyOxTPtmf9p+uid=D80m2t_NLPHrZOF...,248
...,...,...,...,...,...
132206,132206,[1137],10001.0,27237,1137
132207,132207,[1137],10001.0,34504,1137
132208,132208,[1137],10001.0,44508,1137
132209,132209,[1137],10001.0,50809,1137


## Evaluate Reconstruction

Reconstruction quality data --- collect into spreadsheet.

In [40]:
num_true_inner_nodes = hstrat_aux.alifestd_count_inner_nodes(true_phylogeny_df)
num_reconstructed_inner_nodes = hstrat_aux.alifestd_count_inner_nodes(reconstructed_phylogeny_df)
f"{num_true_inner_nodes=} {num_reconstructed_inner_nodes=}"

'num_true_inner_nodes=44662 num_reconstructed_inner_nodes=792'

In [41]:
num_true_polytomies = hstrat_aux.alifestd_count_polytomies(true_phylogeny_df)
num_reconstructed_polytomies = hstrat_aux.alifestd_count_polytomies(reconstructed_phylogeny_df)
f"{num_true_polytomies=} {num_reconstructed_polytomies=}"

'num_true_polytomies=14818 num_reconstructed_polytomies=728'

In [42]:
true_polytomic_index = hstrat_aux.alifestd_calc_polytomic_index(true_phylogeny_df)
reconstructed_polytomic_index = hstrat_aux.alifestd_calc_polytomic_index(reconstructed_phylogeny_df)
f"{true_polytomic_index=} {reconstructed_polytomic_index=}"

'true_polytomic_index=20874 reconstructed_polytomic_index=64744'

In [36]:
num_true_leaf_nodes = hstrat_aux.alifestd_count_leaf_nodes(true_phylogeny_df)
num_reconstructed_leaf_nodes = hstrat_aux.alifestd_count_leaf_nodes(reconstructed_phylogeny_df)
f"{num_true_leaf_nodes=} {num_reconstructed_leaf_nodes=}"

'true_num_leaves=65536 reconstructed_num_leaves=65536'

In [72]:
distances = calc_tqdist_distance(
  true_phylogeny_df,
  reconstructed_phylogeny_df,
  progress_wrap=tqdm,
)
f"{distances=}"

65536it [00:00, 226416.60it/s]
65536it [00:00, 417473.49it/s]
65536it [00:00, 199041.21it/s]
100%|██████████| 65536/65536 [00:00<00:00, 1474864.69it/s]


"distances={'quartet_distance': 8226.607410651844, 'quartet_distanc_rawe': 1.897344982357028e+17, 'triplet_distance': 0.34829207134452494, 'triplet_distance_raw': 0.34829207134452494}"

In [22]:
sampled_triplet_distance_strict = hstrat_aux.alifestd_estimate_triplet_distance_asexual(
    true_phylogeny_df,
    reconstructed_phylogeny_df,
    taxon_label_key="taxon_label",
    confidence=0.8,
    precision=0.05,
    strict=True,
    progress_wrap=tqdm,
    mutate=True,
)
f"{sampled_triplet_distance_strict=}"

434it [01:38,  4.42it/s]


'sampled_triplet_distance_strict=0.7908045977011494'

In [23]:
sampled_triplet_distance_lax = hstrat_aux.alifestd_estimate_triplet_distance_asexual(
    true_phylogeny_df,
    reconstructed_phylogeny_df,
    taxon_label_key="taxon_label",
    confidence=0.8,
    precision=0.05,
    strict=False,
    progress_wrap=tqdm,
    mutate=True,
)
f"{sampled_triplet_distance_lax=}"

638it [02:04,  5.14it/s]


'sampled_triplet_distance_lax=0.4194053208137715'

## Visualize Phylogeny & Reconstruction

For validating results.

Topology only (no time).

In [None]:
true_phylogeny_tree = apc.alife_dataframe_to_biopython_tree(
  hstrat_aux.alifestd_collapse_unifurcations(true_phylogeny_df),
  setup_branch_lengths=False,
)
reconstructed_phylogeny_tree = apc.alife_dataframe_to_biopython_tree(
  hstrat_aux.alifestd_collapse_unifurcations(reconstructed_phylogeny_df),
  setup_branch_lengths=False,
)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

ax1.set_title("True Tree")
Phylo.draw(true_phylogeny_tree, do_show=False, axes=ax1)

ax2.set_title("Reconstructed Tree")
Phylo.draw(reconstructed_phylogeny_tree, do_show=False, axes=ax2)

plt.tight_layout()
plt.show()

Scaled by time.

In [None]:
true_phylogeny_tree = apc.alife_dataframe_to_biopython_tree(
  hstrat_aux.alifestd_collapse_unifurcations(true_phylogeny_df),
  setup_branch_lengths=True,
)
reconstructed_phylogeny_tree = apc.alife_dataframe_to_biopython_tree(
  hstrat_aux.alifestd_collapse_unifurcations(reconstructed_phylogeny_df),
  setup_branch_lengths=True,
)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

ax1.set_title("True Tree")
ax1.set_xscale("log")
Phylo.draw(true_phylogeny_tree, do_show=False, axes=ax1)

ax2.set_title("Reconstructed Tree")
ax2.set_xscale("log")
Phylo.draw(reconstructed_phylogeny_tree, do_show=False, axes=ax2)

plt.tight_layout()
plt.show()

## Reproducibility Information

For future reference if reproducing experiments.

In [None]:
print(
  f"""# instrumentation
{annotation_size_bits=}
{differentia_width_bits=}
{stratum_retention_algo.PolicySpec.GetAlgoTitle()=}

# evolutionary scale
{population_size=}
{num_generations=}

# evolutionary conditions
{num_islands=}
{num_niches=}
{tournament_size=}
"""
)

In [None]:
import datetime
datetime.datetime.now().isoformat()

In [None]:
%load_ext watermark
%watermark

In [None]:
!pip freeze

In [None]:
locals()