# Lab 5: Statistical Models of Genetic Inheritance

## Overview

This lab explores the statistical models of genetic inheritance used in Bonsai v3. These models form the mathematical foundation for relationship inference from IBD segments. 

Key topics include:

1. Probability distributions for IBD segment counts and lengths
2. Statistical modeling of different genetic relationships
3. Likelihood functions for relationship inference
4. Handling the stochastic nature of genetic inheritance
5. Bayesian approaches to pedigree reconstruction

By the end of this lab, you'll understand the mathematical foundations that allow Bonsai v3 to infer relationships from genetic data.

In [None]:
# 🧬 Google Colab Setup - Run this cell first!
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown

def is_colab():
    '''Check if running in Google Colab'''
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    print("🔬 Setting up Google Colab environment...")
    
    # Install dependencies
    print("📦 Installing packages...")
    !pip install -q pysam biopython scikit-allel networkx pygraphviz seaborn plotly
    !apt-get update -qq && apt-get install -qq samtools bcftools tabix graphviz-dev
    
    # Create directories
    !mkdir -p /content/class_data /content/results
    
    # Download essential class data
    print("📥 Downloading class data...")
    S3_BASE = "https://computational-genetic-genealogy.s3.us-east-2.amazonaws.com/class_data/"
    data_files = [
        "pedigree.fam", "pedigree.def", 
        "merged_opensnps_autosomes_ped_sim.seg",
        "merged_opensnps_autosomes_ped_sim-everyone.fam",
        "ped_sim_run2.seg", "ped_sim_run2-everyone.fam"
    ]
    
    for file in data_files:
        !wget -q -O /content/class_data/{file} {S3_BASE}{file}
        print(f"  ✅ {file}")
    
    # Define utility functions
    def setup_environment():
        return "/content/class_data", "/content/results"
    
    def save_results(dataframe, filename, description="results"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        dataframe.to_csv(full_path, index=False)
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e3f2fd; border-left: 4px solid #2196f3; margin: 10px 0;">
            <p><strong>💾 Results saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    def save_plot(plt, filename, description="plot"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        plt.savefig(full_path, dpi=300, bbox_inches='tight')
        plt.show()
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e8f5e8; border-left: 4px solid #4caf50; margin: 10px 0;">
            <p><strong>📊 Plot saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    print("✅ Colab setup complete! Ready to explore genetic genealogy.")
    
else:
    print("🏠 Local environment detected")
    def setup_environment():
        return "class_data", "results"
    def save_results(df, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        df.to_csv(path, index=False)
        return path
    def save_plot(plt, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        plt.savefig(path, dpi=300, bbox_inches='tight')
        plt.show()
        return path

# Set up paths and configure visualization
DATA_DIR, RESULTS_DIR = setup_environment()
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code
        from IPython.display import display, Markdown
        display(Markdown(f"```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
    
    # Check if moments module is available
    try:
        from utils.bonsaitree.bonsaitree.v3 import moments, likelihoods
        print("✅ Successfully imported Bonsai v3 moments and likelihoods modules")
    except ImportError as e:
        print(f"❌ Failed to import Bonsai v3 moments module: {e}")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Part 1: Mathematical Foundations of IBD Sharing

Let's start by exploring the mathematical foundations of IBD sharing. Identity-by-Descent (IBD) segments are regions of DNA that two individuals have inherited from a common ancestor.

The key statistical property of IBD segments is that both their number and their length follow well-defined probability distributions. Let's examine these distributions and how they vary with relationship type.

### 1.1 Poisson Distribution for Segment Counts

The number of IBD segments shared between relatives follows a Poisson distribution. The parameter $\lambda$ (expected number of segments) depends on the relationship type.

For relatives separated by meiotic distance $d$, the expected number of segments above threshold $t$ is:

$$\lambda = \frac{a(rd+c)}{2^{d-1}} \cdot p(t)$$

Where:
- $a$ is the number of common ancestors (1 for half-relationships, 2 for full relationships)
- $r$ is the number of recombination events per meiosis (approximately 34 for the human genome)
- $c$ is the number of chromosomes (22 autosomes)
- $d$ is the meiotic distance between individuals
- $p(t)$ is the probability a segment exceeds detection threshold $t$, calculated as $e^{-dt/100}$

Let's implement this formula and visualize the Poisson distributions for different relationships:

In [None]:
def calculate_expected_segments(relation_tuple, min_seg_len=7, genome_length=3400):
    """Calculate expected number of IBD segments for a relationship.
    
    Args:
        relation_tuple: (up, down, num_ancs) tuple representing the relationship
        min_seg_len: Minimum segment length in cM
        genome_length: Total genome length in cM
        
    Returns:
        Expected number of segments (lambda parameter for Poisson distribution)
    """
    up_meioses = relation_tuple[0]      # Generations up to common ancestor
    down_meioses = relation_tuple[1]    # Generations down from common ancestor
    num_ancestors = relation_tuple[2]   # Number of common ancestors
    
    # Total meiotic distance
    meiotic_distance = up_meioses + down_meioses
    
    # Number of recombination events per meiosis
    r = genome_length / 100  # approx 34 for human genome
    
    # Number of chromosomes
    c = 22
    
    # Expected segments before filtering for minimum length
    expected_segments = num_ancestors * (r * meiotic_distance + c) / (2 ** (meiotic_distance - 1))
    
    # Probability a segment exceeds the minimum threshold
    p_obs = np.exp(-meiotic_distance * min_seg_len / 100)
    
    # Final expected segment count
    lambda_val = expected_segments * p_obs
    
    return lambda_val

# Define relationships to model
relationships = [
    ((0, 1, 1), "Parent-Child", "#1f77b4"),      # Parent-child
    ((1, 1, 2), "Full Siblings", "#ff7f0e"),     # Full siblings
    ((1, 1, 1), "Half Siblings", "#2ca02c"),     # Half siblings
    ((2, 2, 2), "First Cousins", "#d62728"),     # First cousins
    ((3, 3, 2), "Second Cousins", "#9467bd"),    # Second cousins
    ((4, 4, 2), "Third Cousins", "#8c564b")      # Third cousins
]

# Set up the plot
plt.figure(figsize=(12, 8))

# Plot Poisson distributions for segment counts
x = np.arange(0, 100, 1)  # Range of segment counts to plot

for rel_tuple, rel_name, color in relationships:
    # Calculate expected segments (lambda)
    lambda_val = calculate_expected_segments(rel_tuple)
    
    # Generate Poisson PMF
    pmf = stats.poisson.pmf(x, lambda_val)
    
    # Plot distribution
    plt.plot(x, pmf, '-', linewidth=2, color=color, label=f"{rel_name} (λ={lambda_val:.1f})")

plt.title("Poisson Distributions for IBD Segment Counts by Relationship")
plt.xlabel("Number of IBD Segments")
plt.ylabel("Probability")
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 50)
plt.show()

### 1.2 Exponential Distribution for Segment Lengths

The lengths of IBD segments follow an exponential distribution. For a relationship with meiotic distance $d$, the probability density function for segment length $x$ given that it exceeds the threshold $t$ is:

$$f(x|x>t) = \frac{d}{100} \cdot e^{-d(x-t)/100}$$

The expected average length for IBD segments at meiotic distance $d$ is approximately $\frac{100}{d}$ centimorgans.

Let's implement this formula and visualize the distribution of segment lengths for different relationships:

In [None]:
def segment_length_pdf(x, meiotic_distance, min_seg_len=7):
    """Calculate the probability density for a segment of length x.
    
    Args:
        x: Segment length in cM
        meiotic_distance: Total number of meioses between individuals
        min_seg_len: Minimum detectable segment length in cM
        
    Returns:
        Probability density at length x
    """
    if x < min_seg_len:
        return 0
    
    # Rate parameter
    rate = meiotic_distance / 100
    
    # Conditional PDF given x > min_seg_len
    return rate * np.exp(-rate * (x - min_seg_len))

# Plot exponential distributions for segment lengths
plt.figure(figsize=(12, 8))

# Range of segment lengths to plot
x = np.linspace(7, 200, 1000)

for rel_tuple, rel_name, color in relationships:
    # Calculate meiotic distance
    meiotic_distance = rel_tuple[0] + rel_tuple[1]
    
    # Calculate expected segment length
    expected_length = 100 / meiotic_distance if meiotic_distance > 0 else float('inf')
    
    # Generate PDF values
    pdf_values = [segment_length_pdf(xi, meiotic_distance) for xi in x]
    
    # Plot distribution
    plt.plot(x, pdf_values, '-', linewidth=2, color=color, 
             label=f"{rel_name} (mean={expected_length:.1f} cM)")

plt.title("Exponential Distributions for IBD Segment Lengths by Relationship")
plt.xlabel("Segment Length (cM)")
plt.ylabel("Probability Density")
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 100)
plt.show()

### 1.3 Distribution of Total IBD Sharing

The total amount of IBD sharing between relatives is a product of the number of segments and their lengths. For a relationship with meiotic distance $d$, the expected total IBD sharing in centimorgans is:

$$E[\text{Total IBD}] = \lambda \cdot \Big(\frac{100}{d} + \text{min_seg_len}\Big)$$

Where $\lambda$ is the expected number of segments.

Let's implement this formula and calculate expected total IBD sharing for different relationships:

In [None]:
def calculate_expected_total_ibd(relation_tuple, min_seg_len=7, genome_length=3400):
    """Calculate expected total IBD sharing for a relationship.
    
    Args:
        relation_tuple: (up, down, num_ancs) tuple representing the relationship
        min_seg_len: Minimum segment length in cM
        genome_length: Total genome length in cM
        
    Returns:
        (Expected total IBD, Standard deviation)
    """
    up_meioses = relation_tuple[0]      # Generations up to common ancestor
    down_meioses = relation_tuple[1]    # Generations down from common ancestor
    
    # Total meiotic distance
    meiotic_distance = up_meioses + down_meioses
    
    # Handle special case for meiotic distance of 0 (self)
    if meiotic_distance == 0:
        return genome_length, 0
    
    # Calculate expected number of segments
    lambda_val = calculate_expected_segments(relation_tuple, min_seg_len, genome_length)
    
    # Expected segment length
    expected_length = 100 / meiotic_distance + min_seg_len
    
    # Expected total IBD
    expected_total = lambda_val * expected_length
    
    # Standard deviation of total IBD
    # Simplified formula based on Central Limit Theorem
    std_dev = np.sqrt(lambda_val) * expected_length
    
    return expected_total, std_dev

# Calculate expected total IBD for different relationships
total_ibd_data = []

for rel_tuple, rel_name, color in relationships:
    expected_total, std_dev = calculate_expected_total_ibd(rel_tuple)
    total_ibd_data.append({
        'relationship': rel_name,
        'expected_total_ibd': expected_total,
        'std_dev': std_dev,
        'coefficient_of_variation': std_dev / expected_total if expected_total > 0 else 0,
        'color': color
    })

# Convert to DataFrame for display
total_ibd_df = pd.DataFrame(total_ibd_data)
display(total_ibd_df[['relationship', 'expected_total_ibd', 'std_dev', 'coefficient_of_variation']])

# Visualize total IBD distributions
plt.figure(figsize=(12, 8))

# Range of total IBD values to plot
x_range = np.linspace(0, 3500, 1000)

for i, row in total_ibd_df.iterrows():
    # Generate normal distribution with mean and std_dev
    mean = row['expected_total_ibd']
    std = row['std_dev']
    
    # Normal approximation is reasonable for total IBD
    y = stats.norm.pdf(x_range, mean, std)
    
    # Plot distribution
    plt.plot(x_range, y, linewidth=2, color=row['color'], 
             label=f"{row['relationship']} ({mean:.1f} ± {std:.1f} cM)")

plt.title("Distribution of Total IBD Sharing by Relationship")
plt.xlabel("Total IBD (cM)")
plt.ylabel("Probability Density")
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 3500)
plt.show()

## Part 2: The Three-Parameter Model in Bonsai v3

Bonsai v3 uses a three-parameter model to represent the statistical properties of IBD sharing. Let's explore this model and how it's used for relationship inference.

### 2.1 Relationship Representation in Bonsai v3

Bonsai v3 represents relationships as tuples $(up, down, num\_ancs)$ where:

- $up$: Generations up from the first individual to their common ancestor with the second individual
- $down$: Generations down from the common ancestor to the second individual
- $num\_ancs$: Number of common ancestors (1 for half-relationships, 2 for full relationships)

Let's explore how this representation maps to familiar relationship types:

In [None]:
# Create a comprehensive mapping of relationship tuples to descriptions
def describe_relationship(rel_tuple):
    """Convert a relationship tuple to a human-readable description."""
    up, down, num_ancs = rel_tuple
    
    if up == 0 and down == 0 and num_ancs == 2:
        return "Self"
    elif up == 0 and down == 1 and num_ancs == 1:
        return "Parent"
    elif up == 1 and down == 0 and num_ancs == 1:
        return "Child"
    elif up == 1 and down == 1 and num_ancs == 2:
        return "Full Sibling"
    elif up == 1 and down == 1 and num_ancs == 1:
        return "Half Sibling"
    elif up == 0 and down == 2 and num_ancs == 1:
        return "Grandparent"
    elif up == 2 and down == 0 and num_ancs == 1:
        return "Grandchild"
    elif up == 1 and down == 2 and num_ancs == 1:
        return "Aunt/Uncle"
    elif up == 2 and down == 1 and num_ancs == 1:
        return "Niece/Nephew"
    elif up == 2 and down == 2 and num_ancs == 2:
        return "Full First Cousin"
    elif up == 2 and down == 2 and num_ancs == 1:
        return "Half First Cousin"
    elif up == 3 and down == 3 and num_ancs == 2:
        return "Full Second Cousin"
    elif up == 3 and down == 3 and num_ancs == 1:
        return "Half Second Cousin"
    elif up == 4 and down == 4 and num_ancs == 2:
        return "Full Third Cousin"
    elif up == 4 and down == 4 and num_ancs == 1:
        return "Half Third Cousin"
    elif up == 1 and down == 3 and num_ancs == 1:
        return "First Cousin Once Removed (Aunt/Uncle)"
    elif up == 3 and down == 1 and num_ancs == 1:
        return "First Cousin Once Removed (Niece/Nephew)"
    else:
        return f"Complex Relationship (up={up}, down={down}, num_ancs={num_ancs})"

# Generate a comprehensive list of relationship tuples
relationship_tuples = []

# Add standard relationships
standard_tuples = [
    (0, 0, 2),  # Self
    (0, 1, 1),  # Parent
    (1, 0, 1),  # Child
    (1, 1, 2),  # Full Sibling
    (1, 1, 1),  # Half Sibling
    (0, 2, 1),  # Grandparent
    (2, 0, 1),  # Grandchild
    (1, 2, 1),  # Aunt/Uncle
    (2, 1, 1),  # Niece/Nephew
    (2, 2, 2),  # Full First Cousin
    (2, 2, 1),  # Half First Cousin
    (3, 3, 2),  # Full Second Cousin
    (3, 3, 1),  # Half Second Cousin
    (4, 4, 2),  # Full Third Cousin
    (4, 4, 1),  # Half Third Cousin
    (1, 3, 1),  # First Cousin Once Removed (Aunt/Uncle)
    (3, 1, 1),  # First Cousin Once Removed (Niece/Nephew)
]

for rel_tuple in standard_tuples:
    relationship_tuples.append({
        'up': rel_tuple[0],
        'down': rel_tuple[1],
        'num_ancs': rel_tuple[2],
        'relationship': describe_relationship(rel_tuple),
        'meiotic_distance': rel_tuple[0] + rel_tuple[1],
    })

# Convert to DataFrame and display
relationship_df = pd.DataFrame(relationship_tuples)
display(relationship_df)

### 2.2 The Three Parameters: Lambda, Mean Length, and Background IBD

In Bonsai v3, each relationship type is characterized by three key parameters:

1. **Lambda (λ)**: Expected number of IBD segments above threshold
2. **Mean Segment Length**: Expected average length of IBD segments
3. **Background IBD**: Model for chance IBD sharing and false positives

Let's explore how these parameters are calculated for different relationships:

In [None]:
def analytical_ibise_distribution(relation_tuple, genome_length=3400, min_seg_len=7):
    """Calculate the IBD distribution parameters for a relationship.
    
    Args:
        relation_tuple: (up, down, num_ancs) tuple representing the relationship
        genome_length: Total genome length in cM
        min_seg_len: Minimum segment length in cM
        
    Returns:
        Tuple of (k_mean, k_std, T_mean, T_std, ibd2_mean, ibd2_std)
        k_mean: Expected number of segments
        k_std: Standard deviation of segment count
        T_mean: Expected total IBD length
        T_std: Standard deviation of total IBD length
        ibd2_mean: Expected IBD2 proportion
        ibd2_std: Standard deviation of IBD2 proportion
    """
    up_meioses = relation_tuple[0]     # Generations up to common ancestor
    down_meioses = relation_tuple[1]   # Generations down from common ancestor
    num_ancestors = relation_tuple[2]  # Number of common ancestors
    
    # Handle special case of self-comparison
    if up_meioses == 0 and down_meioses == 0 and num_ancestors == 2:
        return (0, 0, genome_length, 0, 1.0, 0.0)
    
    # Expected IBD2 proportion
    ibd2_mean = 0.0
    ibd2_std = 0.0
    
    # Special case for full siblings, which have IBD2 regions
    if up_meioses == 1 and down_meioses == 1 and num_ancestors == 2:
        ibd2_mean = 0.25  # 25% of the genome is expected to be IBD2 for full siblings
        ibd2_std = 0.05   # Approximated standard deviation
    
    # Number of recombinations per meiosis
    num_recs_per_gen = genome_length / 100  # ~34 for human genome
    
    # Total meiotic distance
    g = up_meioses + down_meioses
    
    # Expected mean number of segments pre-filtering
    expected_mean_k = (num_ancestors * (num_recs_per_gen * g + 22) * (1 / (2 ** (g - 1))))
    
    # Probability a segment exceeds the minimum threshold
    p_obs = np.exp(-g * min_seg_len / 100)
    
    # Final expected segment count and variance
    k_mean = expected_mean_k * p_obs
    k_std = np.sqrt(k_mean)  # For Poisson distribution, std = sqrt(mean)
    
    # Expected total IBD length
    # This is mean segment count * (expected segment length - threshold)
    # This accounts for length truncation due to minimum threshold
    T_mean = (100 / g) * k_mean
    
    # Standard deviation of total length
    # Approximation based on Central Limit Theorem
    T_std = (100 / g) * np.sqrt(k_mean)
    
    return (k_mean, k_std, T_mean, T_std, ibd2_mean, ibd2_std)

# Calculate the three parameters for each relationship type
parameter_data = []

for _, row in relationship_df.iterrows():
    rel_tuple = (row['up'], row['down'], row['num_ancs'])
    k_mean, k_std, T_mean, T_std, ibd2_mean, ibd2_std = analytical_ibise_distribution(rel_tuple)
    
    parameter_data.append({
        'relationship': row['relationship'],
        'meiotic_distance': row['meiotic_distance'],
        'lambda': k_mean,
        'mean_segment_length': 100 / row['meiotic_distance'] if row['meiotic_distance'] > 0 else np.inf,
        'expected_total_ibd': T_mean,
        'ibd2_proportion': ibd2_mean
    })

# Convert to DataFrame and display
parameter_df = pd.DataFrame(parameter_data)
display(parameter_df)

### 2.3 Visualizing the Three-Parameter Model

Let's visualize how these three parameters vary across different relationship types:

In [None]:
# Create a visualization of the three parameters
plt.figure(figsize=(15, 12))

# Plot Lambda (Expected Segment Count)
plt.subplot(3, 1, 1)
bars1 = plt.bar(parameter_df['relationship'], parameter_df['lambda'], color='#1f77b4')
plt.xticks(rotation=90)
plt.ylabel('Expected Segment Count (λ)')
plt.title('Parameter 1: Lambda (Expected Number of IBD Segments)')
plt.grid(axis='y', alpha=0.3)
for bar in bars1:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{height:.1f}', ha='center', va='bottom')

# Plot Mean Segment Length
plt.subplot(3, 1, 2)
# Filter out infinite values (self)
filtered_df = parameter_df[parameter_df['mean_segment_length'] < 1000]
bars2 = plt.bar(filtered_df['relationship'], filtered_df['mean_segment_length'], color='#ff7f0e')
plt.xticks(rotation=90)
plt.ylabel('Mean Segment Length (cM)')
plt.title('Parameter 2: Expected Mean Segment Length')
plt.grid(axis='y', alpha=0.3)
for bar in bars2:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{height:.1f}', ha='center', va='bottom')

# Plot IBD2 Proportion
plt.subplot(3, 1, 3)
bars3 = plt.bar(parameter_df['relationship'], parameter_df['ibd2_proportion'] * 100, color='#2ca02c')
plt.xticks(rotation=90)
plt.ylabel('IBD2 Proportion (%)')
plt.title('Parameter 3: Expected IBD2 Proportion')
plt.grid(axis='y', alpha=0.3)
for bar in bars3:
    height = bar.get_height()
    if height > 0:
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                 f'{height:.1f}%', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## Part 3: Likelihood Functions for Relationship Inference

Bonsai v3 uses likelihood functions to quantify how well observed IBD statistics match the expected patterns for different relationships. Let's implement simplified versions of these likelihood functions.

### 3.1 Likelihood Function for Observed Segment Counts

The likelihood of observing $k$ segments given relationship $r$ is based on the Poisson distribution:

$$P(k|r) = \frac{\lambda^k e^{-\lambda}}{k!}$$

Where $\lambda$ is the expected number of segments for relationship $r$.

Let's implement this function:

In [None]:
def segment_count_likelihood(observed_count, relation_tuple, min_seg_len=7, genome_length=3400):
    """Calculate the likelihood of observing a specific segment count given a relationship.
    
    Args:
        observed_count: Number of observed IBD segments
        relation_tuple: (up, down, num_ancs) tuple representing the relationship
        min_seg_len: Minimum segment length in cM
        genome_length: Total genome length in cM
        
    Returns:
        Log-likelihood of the observed count
    """
    # Get expected segment count (lambda)
    lambda_val = calculate_expected_segments(relation_tuple, min_seg_len, genome_length)
    
    # Handle special case of self-comparison
    if relation_tuple[0] == 0 and relation_tuple[1] == 0 and relation_tuple[2] == 2:
        # Self should have no segments (all DNA is identical)
        return 0.0 if observed_count == 0 else float('-inf')
    
    # Calculate log-likelihood using Poisson PMF
    log_likelihood = stats.poisson.logpmf(observed_count, lambda_val)
    
    return log_likelihood

# Test the likelihood function with different segment counts
test_counts = [5, 10, 20, 30, 40, 50]
test_relationships = [
    ((0, 1, 1), "Parent-Child"),        # Parent-child
    ((1, 1, 2), "Full Siblings"),       # Full siblings
    ((1, 1, 1), "Half Siblings"),       # Half siblings
    ((2, 2, 2), "First Cousins"),       # First cousins
]

# Calculate log-likelihoods for each combination
likelihood_data = []

for count in test_counts:
    row_data = {'observed_count': count}
    
    for rel_tuple, rel_name in test_relationships:
        log_ll = segment_count_likelihood(count, rel_tuple)
        row_data[rel_name] = log_ll
    
    likelihood_data.append(row_data)

# Convert to DataFrame and display
likelihood_df = pd.DataFrame(likelihood_data)
display(likelihood_df)

# Visualize the likelihood function
plt.figure(figsize=(12, 8))

counts = np.arange(0, 100, 1)

for rel_tuple, rel_name in test_relationships:
    log_likelihoods = [segment_count_likelihood(c, rel_tuple) for c in counts]
    
    # Convert to regular likelihoods for better visualization
    likelihoods = np.exp(log_likelihoods)
    
    plt.plot(counts, likelihoods, '-', linewidth=2, label=rel_name)

plt.title("Likelihood of Observed Segment Counts by Relationship")
plt.xlabel("Number of Observed IBD Segments")
plt.ylabel("Likelihood")
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 75)
plt.show()

### 3.2 Likelihood Function for Total IBD Sharing

The likelihood of observing total IBD sharing $T$ given relationship $r$ is based on a normal approximation:

$$P(T|r) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(T-\mu)^2}{2\sigma^2}}$$

Where $\mu$ is the expected total IBD and $\sigma$ is the standard deviation for relationship $r$.

Let's implement this function:

In [None]:
def total_ibd_likelihood(observed_total, relation_tuple, min_seg_len=7, genome_length=3400):
    """Calculate the likelihood of observing a specific total IBD sharing given a relationship.
    
    Args:
        observed_total: Observed total IBD sharing in cM
        relation_tuple: (up, down, num_ancs) tuple representing the relationship
        min_seg_len: Minimum segment length in cM
        genome_length: Total genome length in cM
        
    Returns:
        Log-likelihood of the observed total IBD
    """
    # Calculate expected total IBD and standard deviation
    expected_total, std_dev = calculate_expected_total_ibd(relation_tuple, min_seg_len, genome_length)
    
    # Handle special case of self-comparison
    if relation_tuple[0] == 0 and relation_tuple[1] == 0 and relation_tuple[2] == 2:
        # Self should have total IBD equal to genome length
        return 0.0 if abs(observed_total - genome_length) < 1 else float('-inf')
    
    # Handle case of very small std_dev to avoid division by zero
    if std_dev < 1:
        std_dev = 1
    
    # Calculate log-likelihood using normal PDF
    log_likelihood = stats.norm.logpdf(observed_total, expected_total, std_dev)
    
    return log_likelihood

# Test the likelihood function with different total IBD values
test_totals = [50, 100, 500, 1000, 1500, 2000, 3000]

# Calculate log-likelihoods for each combination
total_likelihood_data = []

for total in test_totals:
    row_data = {'observed_total': total}
    
    for rel_tuple, rel_name in test_relationships:
        log_ll = total_ibd_likelihood(total, rel_tuple)
        row_data[rel_name] = log_ll
    
    total_likelihood_data.append(row_data)

# Convert to DataFrame and display
total_likelihood_df = pd.DataFrame(total_likelihood_data)
display(total_likelihood_df)

# Visualize the likelihood function
plt.figure(figsize=(12, 8))

totals = np.linspace(0, 3500, 1000)

for rel_tuple, rel_name in test_relationships:
    log_likelihoods = [total_ibd_likelihood(t, rel_tuple) for t in totals]
    
    # Convert to regular likelihoods for better visualization
    likelihoods = np.exp(log_likelihoods)
    
    plt.plot(totals, likelihoods, '-', linewidth=2, label=rel_name)

plt.title("Likelihood of Observed Total IBD Sharing by Relationship")
plt.xlabel("Observed Total IBD (cM)")
plt.ylabel("Likelihood")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

### 3.3 Combined Likelihood Function

Bonsai v3 combines the likelihoods from segment count, total IBD, and IBD2 proportion to create a comprehensive likelihood function for relationship inference. Let's implement a simplified version:

In [None]:
def combined_likelihood(observed_count, observed_total, observed_ibd2, relation_tuple, min_seg_len=7, genome_length=3400):
    """Calculate the combined likelihood for observed IBD statistics given a relationship.
    
    Args:
        observed_count: Observed number of IBD segments
        observed_total: Observed total IBD sharing in cM
        observed_ibd2: Observed IBD2 proportion (0-1)
        relation_tuple: (up, down, num_ancs) tuple representing the relationship
        min_seg_len: Minimum segment length in cM
        genome_length: Total genome length in cM
        
    Returns:
        Combined log-likelihood
    """
    # Calculate likelihood for segment count
    count_ll = segment_count_likelihood(observed_count, relation_tuple, min_seg_len, genome_length)
    
    # Calculate likelihood for total IBD
    total_ll = total_ibd_likelihood(observed_total, relation_tuple, min_seg_len, genome_length)
    
    # Calculate likelihood for IBD2 proportion
    ibd2_ll = 0.0  # Default for most relationships (no IBD2)
    
    # Special case for full siblings which have IBD2
    if relation_tuple[0] == 1 and relation_tuple[1] == 1 and relation_tuple[2] == 2:
        # Expected IBD2 proportion is 0.25 for full siblings
        expected_ibd2 = 0.25
        ibd2_std = 0.05  # Approximated standard deviation
        ibd2_ll = stats.norm.logpdf(observed_ibd2, expected_ibd2, ibd2_std)
    elif relation_tuple[0] == 0 and relation_tuple[1] == 0 and relation_tuple[2] == 2:
        # Self should have IBD2 = 1.0
        ibd2_ll = 0.0 if abs(observed_ibd2 - 1.0) < 0.01 else float('-inf')
    else:
        # Other relationships should have no IBD2
        ibd2_ll = 0.0 if observed_ibd2 < 0.01 else stats.norm.logpdf(observed_ibd2, 0, 0.01)
    
    # Combine likelihoods (sum in log space = product in original space)
    combined_ll = count_ll + total_ll + ibd2_ll
    
    return combined_ll

# Function to infer relationship from observed IBD statistics
def infer_relationship(observed_count, observed_total, observed_ibd2, max_degree=4):
    """Infer the most likely relationship from observed IBD statistics.
    
    Args:
        observed_count: Observed number of IBD segments
        observed_total: Observed total IBD sharing in cM
        observed_ibd2: Observed IBD2 proportion (0-1)
        max_degree: Maximum relationship degree to consider
        
    Returns:
        List of (relationship tuple, log-likelihood) sorted by likelihood
    """
    # Generate all possible relationship tuples up to max_degree
    relationship_tuples = []
    
    # Add self relationship
    relationship_tuples.append((0, 0, 2))
    
    # Add direct lineage relationships
    for d in range(1, max_degree + 1):
        relationship_tuples.append((0, d, 1))  # Ancestor
        relationship_tuples.append((d, 0, 1))  # Descendant
    
    # Add non-direct lineage relationships
    for up in range(1, max_degree + 1):
        for down in range(1, max_degree + 1):
            if up + down <= max_degree * 2:
                relationship_tuples.append((up, down, 2))  # Full relationship
                relationship_tuples.append((up, down, 1))  # Half relationship
    
    # Calculate likelihood for each relationship
    likelihoods = []
    
    for rel_tuple in relationship_tuples:
        log_ll = combined_likelihood(observed_count, observed_total, observed_ibd2, rel_tuple)
        likelihoods.append((rel_tuple, log_ll))
    
    # Sort by likelihood (highest first)
    likelihoods.sort(key=lambda x: x[1], reverse=True)
    
    return likelihoods

# Test the relationship inference with some example observations
test_observations = [
    (35, 3400, 0.0, "Expected Parent-Child"),  # Parent-child
    (40, 2500, 0.25, "Expected Full Siblings"),  # Full siblings
    (25, 1700, 0.0, "Expected Half Siblings"),  # Half siblings
    (15, 800, 0.0, "Expected First Cousins"),  # First cousins
    (8, 400, 0.0, "Expected Second Cousins"),  # Second cousins
]

# Infer relationships for test observations
for obs_count, obs_total, obs_ibd2, label in test_observations:
    print(f"\n{label}:")
    print(f"Observed: {obs_count} segments, {obs_total} cM total, {obs_ibd2*100:.1f}% IBD2")
    
    likelihoods = infer_relationship(obs_count, obs_total, obs_ibd2)
    
    print("Top 3 most likely relationships:")
    for i, (rel_tuple, log_ll) in enumerate(likelihoods[:3]):
        rel_desc = describe_relationship(rel_tuple)
        print(f"{i+1}. {rel_desc} ({rel_tuple}): Log-likelihood = {log_ll:.2f}")

## Part 4: Stochasticity and Uncertainty in IBD Sharing

One of the key challenges in relationship inference from IBD segments is the stochastic nature of genetic inheritance. Even for a fixed relationship type, there can be significant variation in observed IBD statistics due to random chance.

Let's simulate this stochasticity to understand its impact on relationship inference.

In [None]:
def simulate_ibd_statistics(relation_tuple, num_simulations=1000, min_seg_len=7, genome_length=3400):
    """Simulate IBD statistics for a given relationship, accounting for stochasticity.
    
    Args:
        relation_tuple: (up, down, num_ancs) tuple representing the relationship
        num_simulations: Number of simulations to run
        min_seg_len: Minimum segment length in cM
        genome_length: Total genome length in cM
        
    Returns:
        DataFrame with simulated statistics
    """
    # Get distribution parameters
    k_mean, k_std, T_mean, T_std, ibd2_mean, ibd2_std = analytical_ibise_distribution(
        relation_tuple, genome_length, min_seg_len)
    
    # Handle special case of self-comparison
    if relation_tuple[0] == 0 and relation_tuple[1] == 0 and relation_tuple[2] == 2:
        return pd.DataFrame([
            {'segment_count': 0, 'total_ibd': genome_length, 'ibd2_proportion': 1.0}
        ] * num_simulations)
    
    # Simulate IBD statistics
    simulations = []
    
    for _ in range(num_simulations):
        # Simulate segment count (Poisson distribution)
        segment_count = max(0, int(np.random.poisson(k_mean)))
        
        # Simulate total IBD (Normal distribution, but ensure it's positive)
        total_ibd = max(0, np.random.normal(T_mean, T_std))
        
        # Simulate IBD2 proportion (Normal distribution, but bound between 0 and 1)
        if ibd2_mean > 0 and ibd2_std > 0:
            ibd2_proportion = np.clip(np.random.normal(ibd2_mean, ibd2_std), 0, 1)
        else:
            ibd2_proportion = 0.0
        
        simulations.append({
            'segment_count': segment_count,
            'total_ibd': total_ibd,
            'ibd2_proportion': ibd2_proportion
        })
    
    return pd.DataFrame(simulations)

# Simulate statistics for different relationships
simulations = {}

for rel_tuple, rel_name in test_relationships:
    simulations[rel_name] = simulate_ibd_statistics(rel_tuple, num_simulations=1000)

# Visualize the simulations
plt.figure(figsize=(15, 12))

# Plot segment count distributions
plt.subplot(3, 1, 1)
for rel_name, sim_df in simulations.items():
    sns.kdeplot(sim_df['segment_count'], label=rel_name)
plt.title('Segment Count Distributions')
plt.xlabel('Number of IBD Segments')
plt.ylabel('Density')
plt.legend()
plt.grid(alpha=0.3)

# Plot total IBD distributions
plt.subplot(3, 1, 2)
for rel_name, sim_df in simulations.items():
    sns.kdeplot(sim_df['total_ibd'], label=rel_name)
plt.title('Total IBD Distributions')
plt.xlabel('Total IBD (cM)')
plt.ylabel('Density')
plt.legend()
plt.grid(alpha=0.3)

# Plot IBD2 proportion distributions
plt.subplot(3, 1, 3)
for rel_name, sim_df in simulations.items():
    if sim_df['ibd2_proportion'].mean() > 0:
        sns.kdeplot(sim_df['ibd2_proportion'], label=rel_name)
plt.title('IBD2 Proportion Distributions')
plt.xlabel('IBD2 Proportion')
plt.ylabel('Density')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 4.1 Impact of Stochasticity on Relationship Inference

Let's examine how this stochasticity affects our ability to accurately infer relationships:

In [None]:
def assess_relationship_inference_accuracy(simulations):
    """Assess accuracy of relationship inference using simulated data.
    
    Args:
        simulations: Dictionary of {relationship_name: simulation_dataframe}
        
    Returns:
        DataFrame with inference accuracy statistics
    """
    # Map relationship names to tuples
    relationship_map = {
        "Parent-Child": (0, 1, 1),
        "Full Siblings": (1, 1, 2),
        "Half Siblings": (1, 1, 1),
        "First Cousins": (2, 2, 2)
    }
    
    accuracy_data = []
    
    # For each true relationship
    for true_rel_name, sim_df in simulations.items():
        # Count correct inferences
        correct_count = 0
        confusions = defaultdict(int)
        
        # Sample 100 simulations for faster computation
        sampled_df = sim_df.sample(100) if len(sim_df) > 100 else sim_df
        
        for i, row in sampled_df.iterrows():
            # Infer relationship
            likelihoods = infer_relationship(
                row['segment_count'], row['total_ibd'], row['ibd2_proportion'])
            
            # Get top inferred relationship
            inferred_rel_tuple, _ = likelihoods[0]
            inferred_rel_name = describe_relationship(inferred_rel_tuple)
            
            # Check if correct
            if inferred_rel_tuple == relationship_map.get(true_rel_name):
                correct_count += 1
            else:
                confusions[inferred_rel_name] += 1
        
        # Calculate accuracy
        accuracy = correct_count / len(sampled_df)
        
        # Find top confusion
        top_confusion = ""
        if confusions:
            top_confusion = max(confusions.items(), key=lambda x: x[1])[0]
        
        accuracy_data.append({
            'true_relationship': true_rel_name,
            'accuracy': accuracy,
            'top_confusion': top_confusion,
            'confusion_rate': max(confusions.values()) / len(sampled_df) if confusions else 0
        })
    
    return pd.DataFrame(accuracy_data)

# Assess inference accuracy
accuracy_df = assess_relationship_inference_accuracy(simulations)
display(accuracy_df)

# Visualize accuracy
plt.figure(figsize=(10, 6))
bars = plt.bar(accuracy_df['true_relationship'], accuracy_df['accuracy'] * 100)
plt.xlabel('True Relationship')
plt.ylabel('Inference Accuracy (%)')
plt.title('Relationship Inference Accuracy')
plt.grid(axis='y', alpha=0.3)

# Add accuracy labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 1,
             f'{height:.1f}%', ha='center', va='bottom')

# Add confusion information
for i, row in accuracy_df.iterrows():
    if row['confusion_rate'] > 0:
        plt.text(i, row['accuracy'] * 100 / 2,
                 f"Confused with\n{row['top_confusion']}\n({row['confusion_rate']*100:.1f}%)",
                 ha='center', va='center', color='white', fontweight='bold')

plt.ylim(0, 105)
plt.show()

## Summary

In this lab, we've explored the statistical models of genetic inheritance used in Bonsai v3. Key takeaways include:

1. **Mathematical Foundations**: IBD segment counts follow a Poisson distribution, while segment lengths follow an exponential distribution. These distributions have parameters that depend on the relationship type.

2. **Three-Parameter Model**: Bonsai v3 characterizes relationships using three key parameters: expected segment count (λ), mean segment length, and IBD2 proportion.

3. **Likelihood Functions**: Relationship inference is done by computing likelihoods of observed IBD statistics under different relationship models, combining evidence from segment counts, total IBD, and IBD2 proportion.

4. **Stochasticity**: Due to the random nature of genetic inheritance, there is inherent stochasticity in IBD sharing patterns, even for the same relationship type. This can sometimes lead to ambiguity in relationship inference.

5. **Inference Accuracy**: The accuracy of relationship inference varies by relationship type, with closer relationships generally being easier to distinguish than more distant ones.

These statistical foundations allow Bonsai v3 to make principled inferences about relationships from observed IBD patterns, which is the core of its pedigree reconstruction capabilities.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab04_IBD_Statistics_Extraction.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive