# Lab 13: Mathematical Foundations of Bonsai

Building upon our exploration of IBD segments in Lab 12, we now delve into the mathematical principles that underpin the Bonsai algorithm. This lab will focus on the probabilistic framework, likelihood functions, and optimization techniques that power Bonsai's pedigree reconstruction capabilities.

**Learning Objectives**:
- Master the probabilistic framework that underpins the Bonsai algorithm
- Understand how likelihood functions quantify the probability of observed IBD patterns
- Analyze the mathematical models for different relationship types
- Explore IBD moment calculations and their role in pedigree inference
- Examine Bonsai's optimization algorithms for finding maximum likelihood pedigrees
- Implement and interpret key mathematical components of the Bonsai algorithm

## Environment Setup

In [None]:
import os
import sys
import math
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from scipy.stats import poisson, expon, norm, multivariate_normal
from collections import defaultdict, deque
from pathlib import Path
from IPython.display import display, HTML
from dotenv import load_dotenv

## 1. The Probabilistic Framework of Bonsai

At its core, Bonsai is a statistical algorithm that uses probabilistic inference to reconstruct pedigrees from genetic data. The algorithm employs a Bayesian approach, balancing prior knowledge about relationship structures with the observed evidence from IBD segment patterns.

### Bayesian Inference in Pedigree Reconstruction

Bonsai uses Bayesian inference to reason about unobserved structures (pedigrees) based on observed data (IBD segments). The Bayesian framework is expressed as:

$$P(\text{Pedigree} | \text{IBD data}) \propto P(\text{IBD data} | \text{Pedigree}) \times P(\text{Pedigree})$$

where:
- $P(\text{Pedigree} | \text{IBD data})$ is the posterior probability of a pedigree given the IBD data
- $P(\text{IBD data} | \text{Pedigree})$ is the likelihood of observing the IBD data given a particular pedigree
- $P(\text{Pedigree})$ is the prior probability of the pedigree

Bonsai seeks to find the pedigree that maximizes this posterior probability, which is equivalent to maximizing the log-likelihood when using uniform priors.

In [None]:
def log_posterior(pedigree, ibd_data, prior_weight=1.0):
    """Calculate the log posterior probability of a pedigree given IBD data.
    
    Args:
        pedigree: A representation of the pedigree structure
        ibd_data: Observed IBD segment data
        prior_weight: Weight to apply to the prior (default: 1.0)
        
    Returns:
        Log posterior probability
    """
    # Calculate log likelihood
    log_likelihood = calculate_log_likelihood(pedigree, ibd_data)
    
    # Calculate log prior
    log_prior = calculate_log_prior(pedigree) * prior_weight
    
    # Combine to get log posterior
    return log_likelihood + log_prior

def calculate_log_likelihood(pedigree, ibd_data):
    """Calculate the log likelihood of observing the IBD data given the pedigree.
    
    This is a simplified implementation for demonstration purposes.
    """
    # Placeholder for actual likelihood calculation
    # In a real implementation, this would use the mathematical models
    # described in the following sections
    return 0.0

def calculate_log_prior(pedigree):
    """Calculate the log prior probability of a pedigree.
    
    This is a simplified implementation for demonstration purposes.
    """
    # For simplicity, we'll use a uniform prior (log prior = 0)
    # In a real implementation, this might incorporate domain knowledge
    # such as demographic patterns, typical family structures, etc.
    return 0.0

## 2. Likelihood Functions: Quantifying the Evidence

The likelihood function measures how well a proposed pedigree explains the observed IBD data. In Bonsai, this function is built from probabilistic models of IBD segment inheritance.

### IBD Likelihood Models

For a pair of individuals, the likelihood of their IBD sharing given a specific relationship can be expressed as:

$$L(r | \text{IBD}) = P(\text{IBD} | r)$$

where:
- $r$ is the relationship type (e.g., parent-child, siblings, cousins)
- IBD represents the observed IBD segments between the individuals

Let's implement some basic likelihood models for different relationship types.

In [None]:
def parent_child_likelihood(segments):
    """Calculate likelihood of a parent-child relationship.
    
    Args:
        segments: List of IBD segments with attributes: length, type (IBD1 or IBD2)
        
    Returns:
        Log-likelihood of the relationship
    """
    # For parent-child:
    # - Expect IBD1 across entire genome (~3500 cM)
    # - No IBD2 segments expected
    
    # Calculate total IBD1 length
    ibd1_length = sum(seg['length'] for seg in segments if seg['type'] == 'IBD1')
    
    # Count IBD2 segments
    ibd2_count = sum(1 for seg in segments if seg['type'] == 'IBD2')
    
    # Calculate coverage (proportion of genome covered by IBD1)
    genome_length = 3500  # cM
    coverage = min(1.0, ibd1_length / genome_length)
    
    # High likelihood if coverage is close to 100% and no IBD2
    if coverage > 0.95 and ibd2_count == 0:
        return math.log(0.99)  # High log-likelihood
    else:
        # Penalize based on how far from expected
        penalty = ((1.0 - coverage) ** 2) * 10 + (ibd2_count ** 2)
        return -penalty  # Negative log-likelihood

def sibling_likelihood(segments):
    """Calculate likelihood of a full sibling relationship.
    
    Args:
        segments: List of IBD segments with attributes: length, type (IBD1 or IBD2)
        
    Returns:
        Log-likelihood of the relationship
    """
    # For full siblings:
    # - Expect ~25% IBD0, ~50% IBD1, ~25% IBD2
    # - Total IBD sharing around 2550 cM
    
    # Calculate IBD lengths
    ibd1_length = sum(seg['length'] for seg in segments if seg['type'] == 'IBD1')
    ibd2_length = sum(seg['length'] for seg in segments if seg['type'] == 'IBD2')
    total_length = ibd1_length + 2 * ibd2_length  # IBD2 counts double
    
    # Calculate proportions
    genome_length = 3500  # cM
    ibd0_prop = max(0, 1 - (ibd1_length + ibd2_length) / genome_length)
    ibd1_prop = ibd1_length / genome_length
    ibd2_prop = ibd2_length / genome_length
    
    # Expected proportions for siblings
    expected_ibd0 = 0.25
    expected_ibd1 = 0.50
    expected_ibd2 = 0.25
    
    # Calculate squared error from expected proportions
    error = ((ibd0_prop - expected_ibd0) ** 2 + 
             (ibd1_prop - expected_ibd1) ** 2 + 
             (ibd2_prop - expected_ibd2) ** 2)
    
    # Convert to log-likelihood (higher is better)
    return -10 * error

def distant_relationship_likelihood(segments, meioses):
    """Calculate likelihood of a distant relationship with specified meioses.
    
    Args:
        segments: List of IBD segments with attributes: length, type (IBD1 or IBD2)
        meioses: Number of meioses separating the individuals
        
    Returns:
        Log-likelihood of the relationship
    """
    # For distant relationships:
    # - Expect segment count and length to follow theoretical distributions
    # - Almost all segments should be IBD1 (very rare IBD2)
    
    # Filter to include only segments above minimum threshold
    min_cm = 7  # Minimum segment length in cM
    filtered_segments = [seg for seg in segments 
                         if seg['type'] == 'IBD1' and seg['length'] >= min_cm]
    
    # Calculate observed moments
    segment_count = len(filtered_segments)
    total_length = sum(seg['length'] for seg in filtered_segments)
    
    # Calculate expected moments based on relationship
    relatedness = 2 ** (-meioses)
    expected_count = calculate_expected_segments(relatedness, min_cm)
    expected_length = calculate_expected_length(relatedness, min_cm)
    
    # Use Poisson distribution for segment count likelihood
    count_log_like = poisson.logpmf(segment_count, expected_count)
    
    # Use Gamma distribution for total length likelihood (simplified)
    length_factor = 0.0
    if segment_count > 0 and expected_count > 0:
        # Simple approximation using normal distribution
        length_log_like = norm.logpdf(total_length, 
                                      expected_length, 
                                      expected_length / math.sqrt(expected_count))
        length_factor = length_log_like
    
    # Combine the likelihoods (weighting count more heavily)
    return count_log_like * 0.7 + length_factor * 0.3

# Helper functions for calculating expected moments
def calculate_expected_segments(relatedness, min_cm=7):
    """Calculate expected number of IBD segments for a given relatedness."""
    # Based on theoretical model
    r = -math.log(relatedness) / math.log(2)  # meioses
    genome_length = 3500  # cM
    
    # Expected number of segments above min_cm
    # This is a simplified approximation
    expected = (genome_length / 100) * relatedness * math.exp(-r * min_cm) * 22
    return expected

def calculate_expected_length(relatedness, min_cm=7):
    """Calculate expected total length of IBD segments."""
    # Based on theoretical model
    r = -math.log(relatedness) / math.log(2)  # meioses
    genome_length = 3500  # cM
    
    # Expected total length
    # This is a simplified approximation
    expected = genome_length * relatedness * (1 + r * min_cm) * math.exp(-r * min_cm)
    return expected

### Comparing Likelihood Models for Different Relationships

Let's compare how these likelihood models behave for different relationship scenarios. We'll create synthetic IBD data for various relationships and see how well our models can distinguish them.

In [None]:
def generate_synthetic_ibd(relationship_type, noise_level=0.1):
    """Generate synthetic IBD data for a specific relationship type.
    
    Args:
        relationship_type: One of 'parent-child', 'siblings', 'grandparent', 
                          'half-siblings', 'first-cousins', 'second-cousins'
        noise_level: Level of noise/randomness to add (0.0-1.0)
        
    Returns:
        List of synthetic IBD segments
    """
    genome_length = 3500  # cM
    segments = []
    
    if relationship_type == 'parent-child':
        # Parent-child: ~100% IBD1, no IBD2
        # Create segments covering the entire genome with some fragmentation
        remaining = genome_length
        while remaining > 0:
            # Create segments of varying sizes but maintain total coverage
            seg_length = min(remaining, random.uniform(50, 200))
            segments.append({
                'type': 'IBD1',
                'length': seg_length * (1.0 + random.uniform(-noise_level, noise_level))
            })
            remaining -= seg_length
            
    elif relationship_type == 'siblings':
        # Siblings: ~25% IBD0, ~50% IBD1, ~25% IBD2
        ibd0_target = 0.25 * genome_length
        ibd1_target = 0.50 * genome_length
        ibd2_target = 0.25 * genome_length
        
        # Add noise to targets
        ibd0_target *= (1.0 + random.uniform(-noise_level, noise_level))
        ibd1_target *= (1.0 + random.uniform(-noise_level, noise_level))
        ibd2_target *= (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 10:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(10, 100))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
            
        # Create IBD2 segments
        ibd2_remaining = ibd2_target
        while ibd2_remaining > 10:  # Minimum segment size
            seg_length = min(ibd2_remaining, random.uniform(10, 100))
            segments.append({
                'type': 'IBD2',
                'length': seg_length
            })
            ibd2_remaining -= seg_length
            
    elif relationship_type in ['half-siblings', 'grandparent']:
        # Half-siblings/Grandparent: ~50% IBD0, ~50% IBD1, no IBD2
        ibd1_target = 0.50 * genome_length * (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 10:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(10, 100))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
            
    elif relationship_type == 'first-cousins':
        # First cousins: ~12.5% IBD1
        ibd1_target = 0.125 * genome_length * (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 7:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(7, 50))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
            
    elif relationship_type == 'second-cousins':
        # Second cousins: ~3.125% IBD1
        ibd1_target = 0.03125 * genome_length * (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 7:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(7, 30))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
    
    # Sort segments by length (largest first)
    return sorted(segments, key=lambda x: x['length'], reverse=True)

In [None]:
# Generate synthetic IBD data for different relationships
relationship_types = ['parent-child', 'siblings', 'half-siblings', 'first-cousins', 'second-cousins']
synthetic_data = {}

for rel_type in relationship_types:
    synthetic_data[rel_type] = generate_synthetic_ibd(rel_type)
    
    # Print summary statistics
    total_ibd1 = sum(seg['length'] for seg in synthetic_data[rel_type] if seg['type'] == 'IBD1')
    total_ibd2 = sum(seg['length'] for seg in synthetic_data[rel_type] if seg['type'] == 'IBD2')
    ibd1_count = sum(1 for seg in synthetic_data[rel_type] if seg['type'] == 'IBD1')
    ibd2_count = sum(1 for seg in synthetic_data[rel_type] if seg['type'] == 'IBD2')
    
    print(f"{rel_type}:")
    print(f"  IBD1: {total_ibd1:.1f} cM in {ibd1_count} segments")
    print(f"  IBD2: {total_ibd2:.1f} cM in {ibd2_count} segments")
    print(f"  Total: {total_ibd1 + total_ibd2:.1f} cM")
    print()

In [None]:
# Calculate likelihoods for each relationship type using each model
results = []

for true_rel in relationship_types:
    data = synthetic_data[true_rel]
    
    # Calculate likelihoods
    pc_like = parent_child_likelihood(data)
    sib_like = sibling_likelihood(data)
    
    # Distant relationship likelihoods
    hs_like = distant_relationship_likelihood(data, 2)  # Half-siblings: 2 meioses
    fc_like = distant_relationship_likelihood(data, 4)  # First cousins: 4 meioses
    sc_like = distant_relationship_likelihood(data, 6)  # Second cousins: 6 meioses
    
    results.append({
        'True Relationship': true_rel,
        'Parent-Child Log-Likelihood': pc_like,
        'Sibling Log-Likelihood': sib_like,
        'Half-Sibling Log-Likelihood': hs_like,
        'First-Cousin Log-Likelihood': fc_like,
        'Second-Cousin Log-Likelihood': sc_like
    })

# Convert to DataFrame for easier analysis
likelihood_df = pd.DataFrame(results)
likelihood_df

In [None]:
# Identify the highest likelihood model for each relationship
likelihood_columns = [
    'Parent-Child Log-Likelihood',
    'Sibling Log-Likelihood',
    'Half-Sibling Log-Likelihood',
    'First-Cousin Log-Likelihood',
    'Second-Cousin Log-Likelihood'
]

# Find max likelihood for each row
likelihood_df['Max Likelihood'] = likelihood_df[likelihood_columns].max(axis=1)
likelihood_df['Predicted Relationship'] = likelihood_df[likelihood_columns].idxmax(axis=1)

# Clean up the predicted relationship string
likelihood_df['Predicted Relationship'] = likelihood_df['Predicted Relationship'].str.replace('-Log-Likelihood', '')

# Display results
likelihood_df[['True Relationship', 'Predicted Relationship', 'Max Likelihood']]

### IBD Moments Model

A key innovation in Bonsai is the use of "IBD moments" to summarize the IBD sharing between individuals. Let's implement a function to calculate these moments from IBD segment data.

In [None]:
def calculate_ibd_moments(segment_list, min_length=7):
    """Calculate IBD moments from a list of segments.
    
    Args:
        segment_list: List of IBD segments with 'length' attribute
        min_length: Minimum segment length to consider
    
    Returns:
        Dictionary with first moment (count) and second moment (total length)
    """
    # Filter segments by minimum length
    filtered_segments = [seg for seg in segment_list if seg['length'] >= min_length]
    
    # First moment: number of segments
    first_moment = len(filtered_segments)
    
    # Second moment: total length of segments
    second_moment = sum(seg['length'] for seg in filtered_segments)
    
    # Third moment (optional): sum of squared lengths
    third_moment = sum(seg['length']**2 for seg in filtered_segments)
    
    return {
        "first_moment": first_moment,
        "second_moment": second_moment,
        "third_moment": third_moment
    }

# Calculate moments for each relationship type
moments_results = []

for rel_type, data in synthetic_data.items():
    moments = calculate_ibd_moments(data)
    moments_results.append({
        'Relationship': rel_type,
        'Segment Count': moments['first_moment'],
        'Total Length (cM)': moments['second_moment'],
        'Mean Segment Length': moments['second_moment'] / max(1, moments['first_moment'])
    })

moments_df = pd.DataFrame(moments_results)
moments_df

In [None]:
# Visualize the moments by relationship type
plt.figure(figsize=(12, 6))

# Sort by total length for better visualization
moments_df = moments_df.sort_values('Total Length (cM)', ascending=False)

# Plot segment count and total length
ax = plt.subplot(1, 2, 1)
moments_df.plot(x='Relationship', y='Segment Count', kind='bar', ax=ax)
plt.title('First Moment: Segment Count')
plt.ylabel('Number of Segments (>7cM)')
plt.xticks(rotation=45)

ax = plt.subplot(1, 2, 2)
moments_df.plot(x='Relationship', y='Total Length (cM)', kind='bar', ax=ax)
plt.title('Second Moment: Total IBD Length')
plt.ylabel('Total Length (cM)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 3. The Up-Node Dictionary: Encoding Pedigree Structures

A key data structure in Bonsai is the "up-node dictionary," which encodes the pedigree structure in a way that facilitates efficient likelihood calculations and structural modifications.

In [None]:
def create_empty_up_node_dict(individuals):
    """Create an empty up-node dictionary for a set of individuals.
    
    Args:
        individuals: List of individual IDs
        
    Returns:
        Empty up-node dictionary
    """
    up_node_dict = {}
    for ind in individuals:
        up_node_dict[ind] = {}  # Empty dictionary indicates no parents
    return up_node_dict

def add_relationship(up_node_dict, child, parent1, parent2=None):
    """Add a parent-child relationship to the up-node dictionary.
    
    Args:
        up_node_dict: The up-node dictionary to modify
        child: ID of the child
        parent1: ID of the first parent
        parent2: ID of the second parent (optional)
        
    Returns:
        Modified up-node dictionary
    """
    if child not in up_node_dict:
        up_node_dict[child] = {}
    
    # Add first parent
    up_node_dict[child][parent1] = 1
    
    # Add second parent if provided
    if parent2 is not None:
        up_node_dict[child][parent2] = 1
    
    # Make sure parents exist in the dictionary
    if parent1 not in up_node_dict:
        up_node_dict[parent1] = {}
    if parent2 is not None and parent2 not in up_node_dict:
        up_node_dict[parent2] = {}
    
    return up_node_dict

def visualize_pedigree(up_node_dict):
    """Create a visualization of the pedigree from an up-node dictionary.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
    """
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add nodes and edges
    for child, parents in up_node_dict.items():
        G.add_node(child)
        for parent in parents:
            G.add_node(parent)
            G.add_edge(parent, child)  # Direction from parent to child
    
    # Set node colors: green for real individuals (positive IDs), white for latent (negative IDs)
    node_colors = ['lightgreen' if isinstance(node, int) and node > 0 else 'lightgray' 
                   for node in G.nodes()]
    
    # Calculate layout
    pos = nx.nx_agraph.graphviz_layout(G, prog='dot') if nx.nx_agraph else nx.spring_layout(G)
    
    # Draw the graph
    plt.figure(figsize=(12, 8))
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            node_size=700, font_size=10, arrows=True)
    plt.title('Pedigree Structure')
    plt.show()

In [None]:
# Create a simple example pedigree
individuals = [1000, 1001, 1002, 1003, 1004, 1005]
up_node_dict = create_empty_up_node_dict(individuals)

# Add relationships
up_node_dict = add_relationship(up_node_dict, 1003, 1001, 1002)  # 1003 has parents 1001 and 1002
up_node_dict = add_relationship(up_node_dict, 1004, 1001, 1002)  # 1004 has the same parents
up_node_dict = add_relationship(up_node_dict, 1005, 1000, 1003)  # 1005 has parents 1000 and 1003

# Visualize the pedigree
visualize_pedigree(up_node_dict)

### Calculating Genetic Relationships Using the Up-Node Dictionary

One of the key operations in Bonsai is calculating the genetic relationship coefficient between individuals based on the pedigree structure. Let's implement this calculation using the up-node dictionary.

In [None]:
def get_genetic_paths(up_node_dict, individual, path=None, paths=None, ancestor=None):
    """Find all paths from an individual to their ancestors.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        individual: ID of the individual to trace
        path: Current path being explored (for recursion)
        paths: Dictionary of collected paths (for recursion)
        ancestor: Current ancestor being considered (for recursion)
        
    Returns:
        Dictionary mapping ancestor IDs to lists of paths
    """
    if path is None:
        path = []
    if paths is None:
        paths = {individual: [[]]}  # Start with self path
    
    # If this individual has no parents, return current paths
    if individual not in up_node_dict or not up_node_dict[individual]:
        return paths
    
    # Process each parent
    for parent in up_node_dict[individual]:
        # Create a new path for this parent
        new_path = path + [parent]
        
        # Add this path to the parent's paths
        if parent not in paths:
            paths[parent] = []
        paths[parent].append(new_path)
        
        # Recursively process this parent's ancestors
        get_genetic_paths(up_node_dict, parent, new_path, paths, parent)
    
    return paths

def calculate_relationship_coefficient(up_node_dict, id1, id2):
    """Calculate the relationship coefficient between two individuals.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        id1: ID of the first individual
        id2: ID of the second individual
        
    Returns:
        Relationship coefficient (proportion of shared genetic material)
    """
    if id1 == id2:
        return 1.0  # Self-relationship is 1.0
    
    # Direct parent-child relationship check
    if id1 in up_node_dict.get(id2, {}) or id2 in up_node_dict.get(id1, {}):
        return 0.5  # Parent-child share 50%
    
    # Get genetic paths to ancestors for each individual
    paths1 = get_genetic_paths(up_node_dict, id1)
    paths2 = get_genetic_paths(up_node_dict, id2)
    
    # Find common ancestors and calculate contributions
    relatedness = 0.0
    common_ancestors = set(paths1.keys()) & set(paths2.keys())
    
    for ancestor in common_ancestors:
        if ancestor == id1 or ancestor == id2:
            continue  # Skip self-paths
            
        # Each path contributes 0.5^(length of path)
        for path1 in paths1[ancestor]:
            for path2 in paths2[ancestor]:
                contribution = 0.5**(len(path1) + len(path2))
                relatedness += contribution
    
    return relatedness

In [None]:
# Calculate and display relationship coefficients for all pairs
relationship_results = []

for id1 in individuals:
    for id2 in individuals:
        if id1 < id2:  # Avoid duplicates and self-relationships
            coef = calculate_relationship_coefficient(up_node_dict, id1, id2)
            relationship_name = "Unknown"
            
            # Map coefficient to relationship name
            if coef == 0.5:
                relationship_name = "Parent-Child"
            elif coef == 0.25:
                relationship_name = "Grandparent or Half-Sibling"
            elif coef == 0.125:
                relationship_name = "First Cousin or Great-Grandparent"
            elif 0.24 < coef < 0.26:  # Full siblings (theoretical 0.25, but can vary)
                relationship_name = "Full Siblings"
            
            relationship_results.append({
                'Individual 1': id1,
                'Individual 2': id2,
                'Relationship Coefficient': coef,
                'Relationship': relationship_name
            })

rel_df = pd.DataFrame(relationship_results)
rel_df

## 4. Optimization Algorithms in Bonsai

Bonsai uses sophisticated optimization algorithms to search for the pedigree structure that maximizes the likelihood of the observed IBD data. Let's implement a simplified version of these algorithms.

In [None]:
def calculate_pedigree_likelihood(up_node_dict, ibd_segments, min_cm=7):
    """Calculate the likelihood of a pedigree given IBD segment data.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        ibd_segments: Dictionary mapping pairs of individuals to their IBD segments
        min_cm: Minimum segment length to consider
        
    Returns:
        Log-likelihood of the pedigree
    """
    # This is a simplified placeholder implementation
    total_log_likelihood = 0.0
    
    # Process each pair of individuals
    for (id1, id2), segments in ibd_segments.items():
        # Skip if either individual is not in the pedigree
        if id1 not in up_node_dict or id2 not in up_node_dict:
            continue
            
        # Calculate expected relationship coefficient
        expected_coef = calculate_relationship_coefficient(up_node_dict, id1, id2)
        
        # Calculate observed moments
        moments = calculate_ibd_moments(segments, min_cm)
        
        # Skip pairs with no IBD sharing above threshold
        if moments['first_moment'] == 0:
            continue
            
        # Calculate expected moments
        expected_count = calculate_expected_segments(expected_coef, min_cm)
        expected_length = calculate_expected_length(expected_coef, min_cm)
        
        # Calculate likelihood using Poisson model for segment count
        count_log_like = poisson.logpmf(moments['first_moment'], expected_count) if expected_count > 0 else 0
        
        # Use a normal approximation for total length
        length_log_like = 0
        if expected_count > 0 and moments['first_moment'] > 0:
            length_log_like = norm.logpdf(moments['second_moment'], 
                                         expected_length, 
                                         expected_length / math.sqrt(expected_count))
        
        # Combine likelihoods
        pair_log_like = count_log_like * 0.7 + length_log_like * 0.3
        total_log_likelihood += pair_log_like
    
    return total_log_likelihood

def propose_pedigree_modification(up_node_dict, ids=None):
    """Propose a modification to the pedigree structure.
    
    Args:
        up_node_dict: Current up-node dictionary
        ids: List of individual IDs to consider (if None, uses all IDs)
        
    Returns:
        Modified up-node dictionary
    """
    # Create a deep copy to avoid modifying the original
    new_dict = {}
    for ind, parents in up_node_dict.items():
        new_dict[ind] = parents.copy()
    
    # If no IDs provided, use all individuals in the dictionary
    if ids is None:
        ids = [id for id in up_node_dict.keys() if isinstance(id, int) and id > 0]
    
    # Choose a random individual
    if not ids:
        return new_dict  # No individuals to modify
        
    ind = random.choice(ids)
    
    # Choose a modification type
    mod_type = random.choice(['add_parent', 'remove_parent', 'swap_parent'])
    
    if mod_type == 'add_parent':
        # Add a parent to the individual
        if len(new_dict[ind]) < 2:  # Can only add if fewer than 2 parents
            # Create a new latent parent (negative ID)
            new_parent = -random.randint(1, 1000)
            while new_parent in new_dict:  # Ensure unique ID
                new_parent = -random.randint(1, 1000)
                
            # Add the parent
            new_dict[ind][new_parent] = 1
            new_dict[new_parent] = {}  # Initialize parent with no ancestors
    
    elif mod_type == 'remove_parent':
        # Remove a parent if any exist
        if new_dict[ind]:
            parent = random.choice(list(new_dict[ind].keys()))
            del new_dict[ind][parent]
    
    elif mod_type == 'swap_parent':
        # Replace a parent with another individual or a new latent parent
        if new_dict[ind]:
            parent = random.choice(list(new_dict[ind].keys()))
            
            # Create a new latent parent
            new_parent = -random.randint(1, 1000)
            while new_parent in new_dict:  # Ensure unique ID
                new_parent = -random.randint(1, 1000)
                
            # Replace the parent
            del new_dict[ind][parent]
            new_dict[ind][new_parent] = 1
            new_dict[new_parent] = {}  # Initialize parent with no ancestors
    
    return new_dict

def build_pedigree_with_optimization(individuals, ibd_segments, min_cm=7):
    """Build a pedigree using optimization techniques.
    
    Args:
        individuals: List of individual IDs
        ibd_segments: Dictionary mapping pairs of individuals to their IBD segments
        min_cm: Minimum segment length to consider
        
    Returns:
        Tuple of (best pedigree, best likelihood)
    """
    # Initialize with empty pedigree
    pedigree = create_empty_up_node_dict(individuals)
    
    # Calculate initial likelihood
    current_likelihood = calculate_pedigree_likelihood(pedigree, ibd_segments, min_cm)
    best_pedigree = {k: v.copy() for k, v in pedigree.items()}
    best_likelihood = current_likelihood
    
    # Optimization parameters
    temperature = 1.0
    cooling_rate = 0.99
    iterations = 100  # Reduced for demonstration
    
    # Track progress
    likelihoods = [current_likelihood]
    
    for i in range(iterations):
        # Propose a modification to the pedigree
        new_pedigree = propose_pedigree_modification(pedigree)
        
        # Calculate new likelihood
        new_likelihood = calculate_pedigree_likelihood(new_pedigree, ibd_segments, min_cm)
        
        # Accept or reject based on likelihood and temperature
        if new_likelihood > current_likelihood:
            # Always accept improvements
            accept = True
        else:
            # Sometimes accept worse solutions based on temperature
            delta = new_likelihood - current_likelihood
            accept_probability = math.exp(delta / temperature)
            accept = random.random() < accept_probability
        
        if accept:
            pedigree = new_pedigree
            current_likelihood = new_likelihood
            
            # Update best pedigree if improved
            if current_likelihood > best_likelihood:
                best_pedigree = {k: v.copy() for k, v in pedigree.items()}
                best_likelihood = current_likelihood
        
        # Cool the temperature
        temperature *= cooling_rate
        
        # Track progress
        likelihoods.append(current_likelihood)
        
        # Occasionally print progress
        if (i+1) % 10 == 0:
            print(f"Iteration {i+1}: Current likelihood = {current_likelihood:.2f}, Best = {best_likelihood:.2f}")
    
    # Plot optimization progress
    plt.figure(figsize=(10, 5))
    plt.plot(likelihoods)
    plt.title('Optimization Progress')
    plt.xlabel('Iteration')
    plt.ylabel('Log Likelihood')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return best_pedigree, best_likelihood

### Preparing IBD Data for Optimization

In [None]:
# Convert synthetic data into the format expected by optimization
# In this case, a dictionary mapping (id1, id2) to list of segments

# First, generate a more complete set of synthetic relationships
synthetic_relationships = {
    (1000, 1003): generate_synthetic_ibd('parent-child'),
    (1001, 1003): generate_synthetic_ibd('parent-child'),
    (1001, 1004): generate_synthetic_ibd('parent-child'),
    (1002, 1004): generate_synthetic_ibd('parent-child'),
    (1003, 1004): generate_synthetic_ibd('siblings'),
    (1000, 1004): generate_synthetic_ibd('half-siblings'),
    (1000, 1005): generate_synthetic_ibd('first-cousins'),
    (1002, 1005): generate_synthetic_ibd('second-cousins')
}

# Display the data structure
for (id1, id2), segments in list(synthetic_relationships.items())[:2]:  # Show first two for brevity
    print(f"Relationship between {id1} and {id2}:")
    print(f"  Number of segments: {len(segments)}")
    print(f"  Total IBD length: {sum(seg['length'] for seg in segments):.1f} cM")
    print(f"  First few segments: {segments[:2]}")
    print()

In [None]:
# Run the optimization to reconstruct the pedigree
# Note: This is a simplified demonstration; actual Bonsai optimization is more complex
try:
    # We'll use a timeout to avoid running too long in the notebook
    import signal
    class TimeoutException(Exception): pass
    
    def timeout_handler(signum, frame):
        raise TimeoutException("Timed out!")
    
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(300)  # 5 minute timeout
    
    # Run the optimization
    inferred_pedigree, final_likelihood = build_pedigree_with_optimization(
        individuals, synthetic_relationships, min_cm=7
    )
    
    signal.alarm(0)  # Cancel the alarm
    
    # Visualize the inferred pedigree
    print("\nInferred Pedigree:")
    visualize_pedigree(inferred_pedigree)
    
    # Compare with the true pedigree
    print("\nTrue Pedigree:")
    visualize_pedigree(up_node_dict)
    
except TimeoutException:
    print("Optimization timed out. This is expected in the notebook demonstration.")
    print("For a full analysis, consider running the optimization with more carefully selected parameters.")
except Exception as e:
    print(f"Error during optimization: {e}")

## 5. Mathematical Extensions and Improvements

Let's explore some mathematical extensions that can improve Bonsai's performance, such as handling age constraints and incorporating additional relationship information.

In [None]:
def incorporate_age_constraints(up_node_dict, ages, min_parent_age=12):
    """Check if a pedigree satisfies age constraints.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        ages: Dictionary mapping individual IDs to ages
        min_parent_age: Minimum age difference between parent and child
        
    Returns:
        True if all constraints are satisfied, False otherwise
    """
    for child, parents in up_node_dict.items():
        if child < 0 or not parents:  # Skip inferred individuals or those without parents
            continue
            
        child_age = ages.get(child)
        if child_age is None:
            continue
            
        for parent in parents:
            if parent < 0:  # Skip inferred parents
                continue
                
            parent_age = ages.get(parent)
            if parent_age is None:
                continue
                
            # Check if parent is older than child by at least min_parent_age
            if parent_age <= child_age + min_parent_age:
                return False  # Age constraint violated
    
    return True  # All constraints satisfied

def calculate_pedigree_likelihood_with_constraints(up_node_dict, ibd_segments, ages=None, min_cm=7):
    """Calculate pedigree likelihood with additional constraints.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        ibd_segments: Dictionary mapping pairs of individuals to their IBD segments
        ages: Dictionary mapping individual IDs to ages (optional)
        min_cm: Minimum segment length to consider
        
    Returns:
        Log-likelihood of the pedigree
    """
    # Check age constraints if ages provided
    if ages is not None and not incorporate_age_constraints(up_node_dict, ages):
        return float('-inf')  # Invalid pedigree due to age constraints
    
    # Otherwise, calculate likelihood as before
    return calculate_pedigree_likelihood(up_node_dict, ibd_segments, min_cm)

In [None]:
# Example of using age constraints
# Assign ages to individuals
ages = {
    1000: 70,
    1001: 65,
    1002: 68,
    1003: 40,
    1004: 38,
    1005: 15
}

# Check if our example pedigree satisfies age constraints
age_valid = incorporate_age_constraints(up_node_dict, ages)
print(f"Pedigree satisfies age constraints: {age_valid}")

# Create an invalid pedigree for demonstration
invalid_pedigree = create_empty_up_node_dict(individuals)
invalid_pedigree = add_relationship(invalid_pedigree, 1001, 1003)  # Invalid: 1003 is younger than 1001

# Check if the invalid pedigree satisfies age constraints
age_valid = incorporate_age_constraints(invalid_pedigree, ages)
print(f"Invalid pedigree satisfies age constraints: {age_valid}")

## 6. Exercises

Complete the following exercises to deepen your understanding of the mathematical foundations of Bonsai.

### Exercise 1: Expected IBD Segment Distribution

Implement a function to calculate the expected distribution of IBD segment lengths for a given relationship coefficient. Plot this distribution for various relationship types.

In [None]:
# Your code for Exercise 1


### Exercise 2: Likelihood Sensitivity Analysis

Analyze how the likelihood function responds to changes in IBD segment patterns. Create a series of synthetic IBD patterns with varying degrees of noise and observe how the likelihood changes.

In [None]:
# Your code for Exercise 2


### Exercise 3: Improved Relationship Coefficient Calculation

Enhance the `calculate_relationship_coefficient` function to handle more complex pedigree structures, such as inbreeding. Test your implementation on examples with known inbreeding coefficients.

In [None]:
# Your code for Exercise 3


### Exercise 4: Optimization Algorithm Comparison

Implement an alternative optimization algorithm (e.g., genetic algorithm, hill climbing) and compare its performance to the simulated annealing approach.

In [None]:
# Your code for Exercise 4


### Exercise 5: Age-Constrained Pedigree Reconstruction

Enhance the pedigree reconstruction algorithm to incorporate age constraints during the optimization process, not just as a validation step afterward.

In [None]:
# Your code for Exercise 5


## Conclusion

In this lab, we explored the mathematical foundations of the Bonsai algorithm for pedigree reconstruction. We implemented key components of the algorithm, including likelihood functions, the up-node dictionary data structure, and optimization techniques. We also examined how additional constraints, such as age information, can be incorporated to improve reconstruction accuracy.

Key takeaways:
- Bonsai uses a Bayesian framework to find the most likely pedigree given observed IBD segment patterns
- Different relationship types have characteristic likelihood models based on theoretical expectations
- The up-node dictionary provides an efficient representation of pedigree structures
- Optimization algorithms like simulated annealing help search the vast space of possible pedigrees
- Additional constraints and information can be incorporated to improve reconstruction accuracy

In the next lab, we will explore the data structures used in Bonsai in more detail, focusing on how they enable efficient pedigree manipulation and analysis.