# Exploratory Demo of All-types Functional Site Methods


*Goal:* infer prevalence of *all* functional genome sites that cannot be identified through individual knockouts (including both additive or epistatic sites).

**Outline:**
- Generate sample genome with both additive and redundant (epistatic) sites
- Inspect ground truth site counts in sample genome
- Generate repeat skeletonizations of sample genome (i.e., knockouts where no remaining sites can be removed without observing a fitness effect)
- Use mark-recapture statistics over "captures" of functional sites within skeletons to infer overall prevalence of functional sites


## Preliminaries


In [None]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm

from pylib.analyze_agnostic import assay_agnostic_naive
from pylib.analyze_epistasis import (
    describe_skeletons,
    skeletonize_naive,
)
from pylib.modelsys_explicit import GenomeExplicit
from pylib.modelsys_explicit import (
    GenomeExplicit,
    CalcKnockoutEffectsAdditive,
    CalcKnockoutEffectsEpistasis,
    create_additive_array,
    create_epistasis_matrix_disjoint,
    describe_additive_array,
    describe_epistasis_matrix,
)


Method implementations are organized as external Python source files within the local `pylib` directory.


In [None]:
np.random.seed(1234)


Ensure reproducibility.


## Create Sample Genome


Create a genome with 10,000 distinct sites.

Let 4% of sites have a knockout fitness effect below detectability threshold.
Effect sizes are distributed uniformly between 0 and 0.7, relative to the detectability threshold of 1.0.

Add 40 epistatic sets, each with 4 sites.
Fitness consequences of magnitudes between 0.7 and 1.6 occur when all sites within an epistatic set are knocked out.

Overlap is allowed --- n individual sites may have both additive and epistatic effects.


In [None]:
num_sites = 10000
distn = lambda x: np.random.rand(x) * 0.7
additive_array = create_additive_array(num_sites, 0.04, distn)
epistasis_matrix = create_epistasis_matrix_disjoint(num_sites, 40, 4)
genome = GenomeExplicit(
    [
        CalcKnockoutEffectsAdditive(additive_array),
        CalcKnockoutEffectsEpistasis(epistasis_matrix, effect_size=(0.7, 1.6)),
    ],
)


## Inspect Sample Genome


Create DataFrame with rows describing content of each genome site.


In [None]:
dfa = describe_additive_array(additive_array)
dfb = describe_epistasis_matrix(epistasis_matrix)
df_genome = pd.DataFrame.merge(dfa, dfb, on="site")
df_genome["site type"] = (
    df_genome["additive site"].astype(int)
    + df_genome["epistasis site"].astype(int) * 2
).map(
    {
        0: "neutral",
        1: "additive",
        2: "epistasis",
        3: "both",
    }
)

df_genome


How many of each kind of site are in the genome?


In [None]:
sns.displot(df_genome["site type"])
plt.yscale("log")
df_genome["site type"].value_counts()


How many functional (i.e., non-neutral) sites are there?


In [None]:
num_functional_sites = (df_genome["site type"] != "neutral").sum()
num_functional_sites


## Perform Skeletonizations


"Skeletons" are minimal sets of genome sites that maintain wile-type fitness.
Skeletons can be generated by sequentially removing sites from the genome, until no further sites can be removed without detectably reducing fitness.

Sample 5 skeletons.


In [None]:
num_skeletonizations = 5
skeletons = np.vstack(
    [
        skeletonize_naive(num_sites, genome.test_knockout)
        for _ in tqdm(range(num_skeletonizations))
    ],
)


Here is an example skeleton.


In [None]:
# convert from knockout true to retained true
retained_sites = ~skeletons[0].astype(bool)
sns.rugplot(
    np.flatnonzero(retained_sites),
    height=0.5,
)
retained_sites


## Describe Skeletons


Tabulate information across skeletons on a site-by-site basis.


In [None]:
df_skeletons = describe_skeletons(skeletons, genome.test_knockout)

df_skeletons


How many unique sites show up in any skeleton?
(i.e., num sites with direct evidence of functionality)


In [None]:
np.any(
    (~skeletons.astype(bool)),
    axis=0,
).sum()


## Estimate Number Functional Sites


The skeletonization process can actually be interpreted as a mark-recapture experiment.
Just like field researchers counting rabbits, we can estimate the total population of functional sites from the rate at which we "re-capture" specimens.
(Here, "re-capture" means that a site is included in more than one skeleton.)

Note that statistics taking into account bias in capture probability (aka "trap shyness") are necessary.
This implementation uses a nonparametric jackknife estimator due to Burnham and Overton (see source code for details).


In [None]:
assay_agnostic_naive(df_skeletons)


For comparison the actual number of functional sites is


In [None]:
num_functional_sites
