# Lab 13: Mathematical Foundations of Bonsai

Building upon our exploration of IBD segments in Lab 12, we now delve into the mathematical principles that underpin the Bonsai algorithm. This lab will focus on the probabilistic framework, likelihood functions, and optimization techniques that power Bonsai's pedigree reconstruction capabilities.

**Learning Objectives**:
- Master the probabilistic framework that underpins the Bonsai algorithm
- Understand how likelihood functions quantify the probability of observed IBD patterns
- Analyze the mathematical models for different relationship types
- Explore IBD moment calculations and their role in pedigree inference
- Examine Bonsai's optimization algorithms for finding maximum likelihood pedigrees
- Implement and interpret key mathematical components of the Bonsai algorithm

## Environment Setup

In [None]:
!poetry install --no-root

In [None]:
import os
import sys
import math
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from scipy.stats import poisson, expon, norm, multivariate_normal
from collections import defaultdict, deque
from pathlib import Path
from IPython.display import display, HTML
from dotenv import load_dotenv
import logging

In [None]:
# Environment setup code removed for JupyterLite compatibility
# In JupyterLite, files are accessed directly from the files directory


In [None]:
def configure_logging(log_filename, log_file_debug_level="INFO", console_debug_level="INFO"):
    """
    Configure logging for both file and console handlers.

    Args:
        log_filename (str): Path to the log file where logs will be written.
        log_file_debug_level (str): Logging level for the file handler.
        console_debug_level (str): Logging level for the console handler.
    """
    # Create a root logger
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)  # Capture all messages at the root level

    # Convert level names to numeric levels
    file_level = getattr(logging, log_file_debug_level.upper(), logging.INFO)
    console_level = getattr(logging, console_debug_level.upper(), logging.INFO)

    # File handler: Logs messages at file_level and above to the file
    file_handler = logging.FileHandler(log_filename)
    file_handler.setLevel(file_level)
    file_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    file_handler.setFormatter(file_formatter)

    # Console handler: Logs messages at console_level and above to the console
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setLevel(console_level)
    console_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    console_handler.setFormatter(console_formatter)

    # Add handlers to the root logger
    logger.addHandler(file_handler)
    logger.addHandler(console_handler)
    
def clear_logger():
    """Remove all handlers from the root logger."""
    logger = logging.getLogger()
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)

# Define logs directory
logs_directory = os.path.join(working_directory, "logs")

# Ensure the logs directory exists
if not os.path.exists(logs_directory):
    os.makedirs(logs_directory)
        
log_filename = os.path.join(logs_directory, "lab13_log.txt")
print(f"The Lab 13 log file is located at {log_filename}.")

# Check if the file exists; if not, create it
if not os.path.exists(log_filename):
    with open(log_filename, 'w') as file:
        pass  # The file is now created.
    
clear_logger() # Clear the logger before reconfiguring it
configure_logging(log_filename, log_file_debug_level="INFO", console_debug_level="INFO")

## 1. The Probabilistic Framework of Bonsai

At its core, Bonsai is a statistical algorithm designed to infer pedigree structures from genetic data, primarily relying on patterns of Identity-by-Descent (IBD) sharing between individuals. It uses a probabilistic approach to evaluate how well different potential pedigrees explain the observed genetic evidence.

### Likelihood-Based Pedigree Inference

The central component of Bonsai's statistical framework is the **likelihood function**. This function quantifies the probability of observing the actual IBD data *given* a specific, hypothesized pedigree structure. It is expressed as:

$$L(\text{Pedigree} | \text{IBD data}) = P(\text{IBD data} | \text{Pedigree})$$

where:
- $L(\text{Pedigree} | \text{IBD data})$ is the likelihood *of* the proposed pedigree structure, given the observed IBD data.
- $P(\text{IBD data} | \text{Pedigree})$ is the probability of the observed IBD segment patterns occurring if the proposed pedigree were the true underlying structure.

Bonsai's primary goal is to find the pedigree structure ($\hat{\mathcal{P}}$) that **maximizes this likelihood function**:

$$\hat{\mathcal{P}} = \arg\max_{\mathcal{P}} P(\text{IBD data} | \mathcal{P})$$

This process is known as **Maximum Likelihood Estimation (MLE)**. Finding this optimal pedigree involves:
1.  Developing accurate models for $P(\text{IBD data} | \text{Pedigree})$ that reflect genetic principles (e.g., recombination, segregation) for different relationship types within the pedigree.
2.  Employing sophisticated **optimization algorithms** to search the vast space of possible pedigree structures and identify the one yielding the highest likelihood score.

While the core is MLE, Bonsai often incorporates additional information and constraints:
*   **Structural Constraints:** It only considers pedigrees consistent with fundamental biological rules (e.g., Mendelian inheritance, individuals having at most two parents, no temporal paradoxes). These act as implicit hard priors, defining the valid search space.
*   **Data Integration:** Other data, such as individual ages or sexes, can be integrated into the likelihood model (e.g., evaluating $P(\text{IBD data}, \text{Ages} | \text{Pedigree})$), effectively penalizing pedigrees that imply biologically improbable age differences between relatives.
*   **Search Heuristics:** The optimization strategy itself might use heuristics that implicitly guide the search, sometimes favoring simpler or more plausible structures.

*(The rest of the notebook then delves into modeling the likelihood components and the optimization search.)*

In [None]:
# This implementation is inspired by the actual Bonsai codebase in bonsaitree/v3/likelihoods.py
# It's structured to be runnable in this notebook while demonstrating the key components
# of Bonsai's likelihood-based approach to pedigree inference.

def calculate_pedigree_likelihood(pedigree, ibd_data, age_data=None):
    """
    Calculate the total log-likelihood of a pedigree given IBD and age data.
    
    This is the core scoring function that Bonsai optimizes - it evaluates how well
    a proposed pedigree structure explains the observed IBD patterns. Follows the
    approach described in Jewett et al. (2021) AJHG paper.
    
    Args:
        pedigree: Dictionary representing pedigree structure (up-node dictionary)
        ibd_data: Dictionary mapping (id1,id2) pairs to their IBD segments
        age_data: Optional dictionary mapping individual IDs to ages
        
    Returns:
        Total log-likelihood score for the pedigree
    """
    # 1. Initialize total log-likelihood
    total_log_likelihood = 0.0
    
    # 2. Get all genotyped individuals (in Bonsai, these are positive integer IDs)
    genotyped_ids = [id for id in pedigree if isinstance(id, int) and id > 0]
    
    # 3. Iterate through all unique pairs of individuals
    from itertools import combinations
    for id1, id2 in combinations(genotyped_ids, 2):
        # Skip pairs with no IBD data
        if (id1, id2) not in ibd_data and (id2, id1) not in ibd_data:
            continue
        
        # 4. Determine the relationship implied by the pedigree structure
        relationship = determine_relationship(pedigree, id1, id2)
        
        # 5. Calculate the genetic component of likelihood (based on IBD patterns)
        ibd_segments = get_pair_ibd_segments(ibd_data, id1, id2)
        genetic_likelihood = calculate_genetic_likelihood(relationship, ibd_segments)
        
        # 6. Calculate the age component of likelihood (if age data is available)
        age_likelihood = 0.0
        if age_data and id1 in age_data and id2 in age_data:
            age_diff = age_data[id1] - age_data[id2]
            age_likelihood = calculate_age_likelihood(relationship, age_diff)
        
        # 7. Add this pair's contribution to the total score
        pair_likelihood = genetic_likelihood + age_likelihood
        total_log_likelihood += pair_likelihood
    
    return total_log_likelihood

def determine_relationship(pedigree, id1, id2):
    """
    Determine the relationship between two individuals in a pedigree.
    
    Bonsai typically represents relationships as tuples (up, down, num_ancs) where:
    - up: generations from id1 up to common ancestor
    - down: generations from common ancestor down to id2  
    - num_ancs: number of common ancestors (1 or 2)
    
    For example:
    - Parent-child: (0,1,1) or (1,0,1)
    - Full siblings: (1,1,2)
    - Half-siblings: (1,1,1)
    - First cousins: (2,2,2)
    
    Returns:
        Relationship tuple or None if unrelated
    """
    # In a real implementation, this would trace paths through the pedigree
    # to find common ancestors and calculate meiotic distances
    
    # For this simplified version, we'll check a few basic relationships
    
    # Check for self-relationship
    if id1 == id2:
        return (0, 0, 1)  # Self
    
    # Check for direct parent-child relationship
    if id1 in pedigree.get(id2, {}):
        return (1, 0, 1)  # id1 is parent of id2
    if id2 in pedigree.get(id1, {}):
        return (0, 1, 1)  # id2 is parent of id1
    
    # Check for sibling relationship (same parents)
    parents1 = set(pedigree.get(id1, {}).keys())
    parents2 = set(pedigree.get(id2, {}).keys())
    
    common_parents = parents1.intersection(parents2)
    if len(common_parents) > 0:
        if len(common_parents) == 2:
            return (1, 1, 2)  # Full siblings (share 2 parents)
        else:
            return (1, 1, 1)  # Half siblings (share 1 parent)
    
    # For more distant relationships, we'd need to trace through the pedigree
    # In this simplified version, we'll return None (unrelated) for any other case
    return None

def calculate_genetic_likelihood(relationship, ibd_segments):
    """
    Calculate the genetic component of likelihood based on IBD patterns.
    
    In Bonsai, this uses models based on:
    1. Number of segments (count)
    2. Total length of IBD sharing (cM)
    3. Type of IBD (IBD1 vs IBD2) 
    
    Different relationships have distinct expected patterns, e.g.:
    - Parent-child: ~100% IBD1, no IBD2
    - Siblings: ~50% IBD1, ~25% IBD2
    - First cousins: ~12.5% IBD1, no IBD2
    
    Returns:
        Log-likelihood score
    """
    # This is a simplified implementation for demonstration
    # Real Bonsai uses more sophisticated statistical models
    
    if not ibd_segments:
        return -10.0  # Penalty for no IBD when a relationship is expected
    
    # Calculate total IBD1 and IBD2 sharing
    total_ibd1 = sum(seg['length'] for seg in ibd_segments if seg['type'] == 'IBD1')
    total_ibd2 = sum(seg['length'] for seg in ibd_segments if seg['type'] == 'IBD2')
    
    # Genome length in centiMorgans (cM)
    genome_length = 3500.0
    
    # Proportion of genome sharing IBD1 and IBD2
    prop_ibd1 = total_ibd1 / genome_length
    prop_ibd2 = total_ibd2 / genome_length
    
    # If relationship is unknown (None), check if consistent with unrelated
    if relationship is None:
        # Unrelated pairs should have minimal IBD sharing
        if prop_ibd1 < 0.01 and prop_ibd2 < 0.001:
            return 0.0  # Good fit for unrelated
        else:
            return -5.0  # Too much IBD for unrelated individuals
    
    # Expected proportions for different relationship types
    expected_ibd1 = 0.0
    expected_ibd2 = 0.0
    
    if relationship == (0, 1, 1) or relationship == (1, 0, 1):  # Parent-child
        expected_ibd1 = 1.0
        expected_ibd2 = 0.0
    elif relationship == (1, 1, 2):  # Full siblings
        expected_ibd1 = 0.5
        expected_ibd2 = 0.25
    elif relationship == (1, 1, 1):  # Half siblings
        expected_ibd1 = 0.25
        expected_ibd2 = 0.0
    elif relationship == (2, 2, 2):  # First cousins
        expected_ibd1 = 0.125
        expected_ibd2 = 0.0
    
    # Calculate squared error from expected proportions
    error = ((prop_ibd1 - expected_ibd1) ** 2 + 
             (prop_ibd2 - expected_ibd2) ** 2)
    
    # Convert to log-likelihood (higher is better)
    return -10.0 * error

def calculate_age_likelihood(relationship, age_diff):
    """
    Calculate the age component of likelihood based on age difference.
    
    In Bonsai, different relationships have expected age difference distributions:
    - Parent-child: ~20-40 years (parent older)
    - Siblings: ~0-10 years difference
    - Grandparent-grandchild: ~40-80 years (grandparent older)
    
    Returns:
        Log-likelihood score
    """
    # This is a simplified implementation for demonstration
    import math
    
    # If no relationship is specified (unrelated), any age difference is plausible
    if relationship is None:
        return 0.0
    
    # Expected mean and standard deviation for different relationships
    mean_age_diff = 0.0
    std_age_diff = 10.0  # Default
    
    if relationship == (0, 1, 1):  # Parent-child (parent is id1)
        mean_age_diff = 30.0
        std_age_diff = 8.0
    elif relationship == (1, 0, 1):  # Child-parent (child is id1)
        mean_age_diff = -30.0
        std_age_diff = 8.0
    elif relationship == (1, 1, 2) or relationship == (1, 1, 1):  # Siblings
        mean_age_diff = 0.0
        std_age_diff = 5.0
    elif relationship == (2, 0, 1):  # Grandchild-grandparent
        mean_age_diff = -55.0
        std_age_diff = 12.0
    elif relationship == (0, 2, 1):  # Grandparent-grandchild
        mean_age_diff = 55.0
        std_age_diff = 12.0
    
    # Calculate log-likelihood using Gaussian model
    # Log pdf = -ln(σ√2π) - (x-μ)²/(2σ²)
    log_likelihood = -math.log(std_age_diff * math.sqrt(2 * math.pi))
    log_likelihood -= ((age_diff - mean_age_diff) ** 2) / (2 * std_age_diff ** 2)
    
    return log_likelihood

def get_pair_ibd_segments(ibd_data, id1, id2):
    """Helper function to get IBD segments for a pair, handling order."""
    if (id1, id2) in ibd_data:
        return ibd_data[(id1, id2)]
    elif (id2, id1) in ibd_data:
        return ibd_data[(id2, id1)]
    else:
        return []

# Example usage (to demonstrate the algorithm)
if __name__ == "__main__":
    # In a Jupyter notebook, code here will be executed when the cell is run
    
    # Example pedigree (up-node dictionary)
    pedigree = {
        101: {201: 1, 202: 1},  # Individual 101 has parents 201 and 202
        102: {201: 1, 202: 1},  # Individual 102 has the same parents (full sibling of 101)
        103: {201: 1, 203: 1},  # Individual 103 is half-sibling to 101 and 102
        201: {},  # No parents specified
        202: {},
        203: {}
    }
    
    # Example IBD data
    ibd_data = {
        (101, 102): [  # Full siblings
            {'type': 'IBD1', 'length': 1700},
            {'type': 'IBD2', 'length': 900}
        ],
        (101, 103): [  # Half siblings
            {'type': 'IBD1', 'length': 850}
        ]
    }
    
    # Example age data
    age_data = {
        101: 25,
        102: 27,
        103: 30,
        201: 55,
        202: 53,
        203: 58
    }
    
    # Calculate likelihood
    likelihood = calculate_pedigree_likelihood(pedigree, ibd_data, age_data)
    print(f"Pedigree log-likelihood: {likelihood:.2f}")
    
    # In a real implementation, an optimization algorithm would search for the
    # pedigree structure that maximizes this likelihood value

In [None]:
import math
import random
import itertools
import pprint
import numpy as np # Using numpy for easier stats generation
from scipy.stats import norm # Using scipy for logpdf calculation

# --- 1. Constants: Expected Statistics for Relationships ---
# Simplified means (mu) and standard deviations (sigma) for IBD stats and age diff
# IBD Stats: T3 (Total IBD3 Length), C3 (Count IBD3), T2 (Total IBD2), C2 (Count IBD2)
# Age Diff: Mean difference (id1_age - id2_age), Std Dev

RELATIONSHIP_PARAMS = {
    # (up, down, nc): {mu_T3, sigma_T3, mu_C3, sigma_C3, mu_T2, sigma_T2, mu_C2, sigma_C2, mu_Age, sigma_Age}
    (0, 0, 1):   {'mu_T3': 3500, 'sigma_T3': 50, 'mu_C3': 60, 'sigma_C3': 5,  # Self (approx)
                  'mu_T2': 3500, 'sigma_T2': 50, 'mu_C2': 60, 'sigma_C2': 5,
                  'mu_Age': 0,    'sigma_Age': 0.1}, # Need small sigma for age match
    (1, 0, 1):   {'mu_T3': 3500, 'sigma_T3': 100, 'mu_C3': 60, 'sigma_C3': 10, # Child-Parent
                  'mu_T2': 0,    'sigma_T2': 10,  'mu_C2': 0,  'sigma_C2': 1,
                  'mu_Age': -30,  'sigma_Age': 7},
    (0, 1, 1):   {'mu_T3': 3500, 'sigma_T3': 100, 'mu_C3': 60, 'sigma_C3': 10, # Parent-Child
                  'mu_T2': 0,    'sigma_T2': 10,  'mu_C2': 0,  'sigma_C2': 1,
                  'mu_Age': 30,   'sigma_Age': 7},
    (1, 1, 2):   {'mu_T3': 2600, 'sigma_T3': 250, 'mu_C3': 45, 'sigma_C3': 8,  # Full Siblings
                  'mu_T2': 800,  'sigma_T2': 200, 'mu_C2': 15, 'sigma_C2': 5,
                  'mu_Age': 0,    'sigma_Age': 7},
    (1, 1, 1):   {'mu_T3': 1750, 'sigma_T3': 200, 'mu_C3': 30, 'sigma_C3': 7,  # Half Siblings / Grandparent
                  'mu_T2': 0,    'sigma_T2': 20,  'mu_C2': 0,  'sigma_C2': 2,
                  'mu_Age': 0,    'sigma_Age': 9}, # Wider age range than full sibs generally
    (2, 0, 1):   {'mu_T3': 1750, 'sigma_T3': 200, 'mu_C3': 30, 'sigma_C3': 7,  # Grandchild-Grandparent
                  'mu_T2': 0,    'sigma_T2': 20,  'mu_C2': 0,  'sigma_C2': 2,
                  'mu_Age': -55,  'sigma_Age': 10},
     (0, 2, 1):  {'mu_T3': 1750, 'sigma_T3': 200, 'mu_C3': 30, 'sigma_C3': 7,  # Grandparent-Grandchild
                  'mu_T2': 0,    'sigma_T2': 20,  'mu_C2': 0,  'sigma_C2': 2,
                  'mu_Age': 55,   'sigma_Age': 10},
    (2, 2, 2):   {'mu_T3': 875,  'sigma_T3': 150, 'mu_C3': 20, 'sigma_C3': 6,  # First Cousins
                  'mu_T2': 0,    'sigma_T2': 30,  'mu_C2': 0,  'sigma_C2': 3,
                  'mu_Age': 0,    'sigma_Age': 12},
    None:        {'mu_T3': 15,   'sigma_T3': 10,  'mu_C3': 1,  'sigma_C3': 1,  # Unrelated (allowing small background)
                  'mu_T2': 0,    'sigma_T2': 5,   'mu_C2': 0,  'sigma_C2': 0.5,
                  'mu_Age': 0,    'sigma_Age': 25}, # Wide age variation
}
# Add relationship type string for clarity later
for r, p in RELATIONSHIP_PARAMS.items():
    if r == (0,0,1): p['name'] = 'Self'
    elif r == (1,0,1) or r == (0,1,1): p['name'] = 'Parent/Child'
    elif r == (1,1,2): p['name'] = 'Full Sib'
    elif r == (1,1,1): p['name'] = 'Half Sib'
    elif r == (2,0,1) or r == (0,2,1): p['name'] = 'Grandparent/Child'
    elif r == (2,2,2): p['name'] = 'First Cousin'
    elif r is None: p['name'] = 'Unrelated'
    else: p['name'] = str(r) # Default for others

# --- 2. IBD Simulation ---

class IBDStats:
    """Simple class to hold IBD summary stats."""
    def __init__(self, t3, c3, t2, c2):
        self.t3 = t3
        self.c3 = c3
        self.t2 = t2
        self.c2 = c2
    def __repr__(self):
        return f"IBD(T3={self.t3:.1f}, C3={self.c3}, T2={self.t2:.1f}, C2={self.c2})"

def simulate_ibd_stats(relationship_tuple):
    """Simulate IBD stats based on expected parameters for a relationship."""
    params = RELATIONSHIP_PARAMS.get(relationship_tuple, RELATIONSHIP_PARAMS[None]) # Default to unrelated

    # Simulate using normal distribution, ensuring non-negativity
    t3 = max(0, random.gauss(params['mu_T3'], params['sigma_T3']))
    c3 = max(0, int(round(random.gauss(params['mu_C3'], params['sigma_C3']))))
    t2 = max(0, random.gauss(params['mu_T2'], params['sigma_T2']))
    c2 = max(0, int(round(random.gauss(params['mu_C2'], params['sigma_C2']))))

    # Ensure basic consistency (e.g., T2 <= T3, C2 <= C3)
    t2 = min(t2, t3)
    c2 = min(c2, c3)
    if c3 == 0: t3 = 0
    if c2 == 0: t2 = 0

    return IBDStats(t3, c3, t2, c2)

# --- 3. Relationship Determination (Simplified) ---

def determine_relationship(pedigree, id1, id2):
    """
    Simplified relationship determination from pedigree dictionary.
    Handles self, parent/child, siblings. Returns None otherwise.
    """
    if id1 == id2: return (0, 0, 1)
    parents1 = set(pedigree.get(id1, {}).keys())
    parents2 = set(pedigree.get(id2, {}).keys())

    if id1 in parents2: return (0, 1, 1) # id1 is parent of id2
    if id2 in parents1: return (1, 0, 1) # id2 is parent of id1

    common_parents = parents1.intersection(parents2)
    if len(common_parents) == 2 and len(parents1) == 2 and len(parents2) == 2:
        return (1, 1, 2) # Full Siblings
    elif len(common_parents) == 1:
        # Could be half-sib or other complex case, simplified to half-sib
         return (1, 1, 1) # Half Siblings

    # --- Add grandparent check for illustration ---
    grandparents1 = set()
    for p1 in parents1:
        grandparents1.update(pedigree.get(p1, {}).keys())
    if id2 in grandparents1: return (2, 0, 1) # id2 is grandparent of id1

    grandparents2 = set()
    for p2 in parents2:
        grandparents2.update(pedigree.get(p2, {}).keys())
    if id1 in grandparents2: return (0, 2, 1) # id1 is grandparent of id2
    # --- End grandparent check ---

    # NOTE: This does NOT find cousins or more complex relationships.
    return None # Assume unrelated otherwise

# --- 4. Likelihood Calculation Functions ---

def calculate_pairwise_loglik(observed_ibd: IBDStats, age1: float | None, age2: float | None, relationship_tuple):
    """Calculates the composite pairwise log-likelihood."""
    expected_params = RELATIONSHIP_PARAMS.get(relationship_tuple, RELATIONSHIP_PARAMS[None])

    log_lik_g = 0.0
    # Calculate genetic component using Gaussian log PDF
    # Handling sigma=0 separately is important in practice, norm.logpdf does this
    log_lik_g += norm.logpdf(observed_ibd.t3, loc=expected_params['mu_T3'], scale=expected_params['sigma_T3']) if expected_params['sigma_T3'] > 0 else (0 if observed_ibd.t3==expected_params['mu_T3'] else -np.inf)
    log_lik_g += norm.logpdf(observed_ibd.c3, loc=expected_params['mu_C3'], scale=expected_params['sigma_C3']) if expected_params['sigma_C3'] > 0 else (0 if observed_ibd.c3==expected_params['mu_C3'] else -np.inf)
    log_lik_g += norm.logpdf(observed_ibd.t2, loc=expected_params['mu_T2'], scale=expected_params['sigma_T2']) if expected_params['sigma_T2'] > 0 else (0 if observed_ibd.t2==expected_params['mu_T2'] else -np.inf)
    log_lik_g += norm.logpdf(observed_ibd.c2, loc=expected_params['mu_C2'], scale=expected_params['sigma_C2']) if expected_params['sigma_C2'] > 0 else (0 if observed_ibd.c2==expected_params['mu_C2'] else -np.inf)

    log_lik_a = 0.0
    if age1 is not None and age2 is not None:
        age_diff = age1 - age2
        mu_age, sigma_age = expected_params['mu_Age'], expected_params['sigma_Age']
        log_lik_a += norm.logpdf(age_diff, loc=mu_age, scale=sigma_age) if sigma_age > 0 else (0 if age_diff==mu_age else -np.inf)

    # Check for -inf results which indicate impossibility under the model
    if np.isinf(log_lik_g) or np.isinf(log_lik_a):
      return -np.inf

    return log_lik_g + log_lik_a


def calculate_pedigree_log_likelihood(pedigree, all_pairwise_ibd_stats, all_ages):
    """Calculates the total pedigree log-likelihood."""
    total_log_likelihood = 0.0
    # Consider only individuals present in the pedigree keys/values
    ped_individuals = set(pedigree.keys())
    for parents_dict in pedigree.values():
        ped_individuals.update(parents_dict.keys())

    processed_pairs = set() # To avoid calculating twice if data access is symmetric

    for id1, id2 in itertools.combinations(ped_individuals, 2):
         # Ensure we process each pair only once
        pair_key = tuple(sorted((id1, id2)))
        if pair_key in processed_pairs:
            continue

        # Determine relationship based on the *proposed* pedigree structure
        relationship_tuple = determine_relationship(pedigree, id1, id2)

        # Get the *observed* (simulated) IBD data for this pair
        observed_ibd = all_pairwise_ibd_stats.get(pair_key)
        # If no IBD was simulated (maybe truly unrelated), create stats obj with zeros
        if observed_ibd is None:
             observed_ibd = IBDStats(0, 0, 0, 0)

        # Get ages
        age1 = all_ages.get(id1)
        age2 = all_ages.get(id2)

        # Calculate pairwise likelihood using observed data and relationship from pedigree
        pair_ll = calculate_pairwise_loglik(observed_ibd, age1, age2, relationship_tuple)
        total_log_likelihood += pair_ll
        processed_pairs.add(pair_key)

    return total_log_likelihood


# --- 5. Demonstration ---

if __name__ == "__main__":

    print("--- Setting up True Pedigree and Data ---")
    # Define a slightly more complex "true" pedigree
    true_pedigree = {
        # Gen 3
        101: {201: 1, 202: 1},
        102: {201: 1, 202: 1}, # Full sib to 101
        103: {201: 1, 203: 1}, # Half sib to 101, 102 (common father 201)
        104: {204: 1, 205: 1}, # Unrelated family line
        # Gen 2
        201: {301: 1, 302: 1}, # Father of 101, 102, 103
        202: {303: 1, 304: 1}, # Mother of 101, 102
        203: {305: 1, 306: 1}, # Mother of 103
        204: {307: 1, 308: 1}, # Parent of 104
        205: {309: 1, 310: 1}, # Parent of 104
        # Gen 1 (Founders)
        301: {}, 302: {}, 303: {}, 304: {}, 305: {}, 306: {},
        307: {}, 308: {}, 309: {}, 310: {},
    }

    # Define ages consistent with the pedigree
    ages = {
        101: 25, 102: 27, 103: 30, 104: 26, # Gen 3
        201: 55, 202: 53, 203: 58, 204: 54, 205: 56, # Gen 2
        301: 80, 302: 79, 303: 78, 304: 81, 305: 82, 306: 77, # Gen 1
        307: 80, 308: 79, 309: 78, 310: 81,
    }

    # Get all individuals involved
    all_individuals = set(true_pedigree.keys())
    for parents_dict in true_pedigree.values():
        all_individuals.update(parents_dict.keys())
    all_individuals = sorted(list(all_individuals))

    print("True Pedigree Structure:")
    pprint.pprint(true_pedigree)
    print("\nAges:")
    pprint.pprint(ages)

    print("\n--- Simulating Pairwise IBD Data based on True Pedigree ---")
    # Simulate IBD data for ALL pairs based on their TRUE relationship
    simulated_pairwise_ibd = {}
    for id1, id2 in itertools.combinations(all_individuals, 2):
        true_relationship = determine_relationship(true_pedigree, id1, id2)
        sim_ibd = simulate_ibd_stats(true_relationship)
        # Store only if significant IBD simulated (or based on relationship)
        # Here we store all for completeness in calculation, even zeros
        simulated_pairwise_ibd[tuple(sorted((id1, id2)))] = sim_ibd
        # Print some examples
        if true_relationship is not None and random.random() < 0.1: # Print ~10% of related pairs
             rel_name = RELATIONSHIP_PARAMS.get(true_relationship, {}).get('name', str(true_relationship))
             print(f"  Simulated IBD for ({id1}, {id2}) (True Rel: {rel_name}): {sim_ibd}")
        elif true_relationship is None and random.random() < 0.01: # Print ~1% of unrelated pairs
             print(f"  Simulated IBD for ({id1}, {id2}) (True Rel: Unrelated): {sim_ibd}")


    print("\n--- Evaluating Pedigree Likelihoods ---")

    # 1. Evaluate the TRUE pedigree
    ll_true = calculate_pedigree_log_likelihood(true_pedigree, simulated_pairwise_ibd, ages)
    print(f"Log-Likelihood of TRUE Pedigree: {ll_true:.2f}")

    # 2. Evaluate an INCORRECT pedigree (e.g., swap parents for 103)
    incorrect_pedigree_1 = true_pedigree.copy() # Shallow copy is okay for this change
    incorrect_pedigree_1[103] = {202: 1, 203: 1} # Incorrect: Makes 103 full sib to 101/102
                                                # and child of 202, not 201.
    ll_incorrect_1 = calculate_pedigree_log_likelihood(incorrect_pedigree_1, simulated_pairwise_ibd, ages)
    print(f"Log-Likelihood of INCORRECT Pedigree 1 (103 Full Sib): {ll_incorrect_1:.2f}")

    # 3. Evaluate another INCORRECT pedigree (e.g., make unrelated line related)
    incorrect_pedigree_2 = true_pedigree.copy()
    incorrect_pedigree_2[104] = {201: 1, 205: 1} # Incorrect: Makes 104 child of 201 (half-sib to 101/102/103)
                                                # instead of child of 204/205.
    ll_incorrect_2 = calculate_pedigree_log_likelihood(incorrect_pedigree_2, simulated_pairwise_ibd, ages)
    print(f"Log-Likelihood of INCORRECT Pedigree 2 (104 Half Sib): {ll_incorrect_2:.2f}")

    print("\n--- Conclusion ---")
    print("The pedigree structure yielding the HIGHEST (least negative) log-likelihood")
    print("is considered the best fit to the observed genetic and age data.")
    print("Ideally, the score for the 'TRUE Pedigree' should be significantly higher")
    print("than the scores for the incorrect structures.")

## Discussion of the Runnable Likelihood Demonstration

This code cell provides a runnable demonstration of the core principle behind Bonsai's evaluation of pedigrees: **calculating a pedigree's log-likelihood by summing pairwise contributions.** It illustrates how different proposed pedigree structures are scored against observed genetic (IBD) and demographic (age) data.

**Code Breakdown:**

1.  **Constants (`RELATIONSHIP_PARAMS`):** This dictionary defines *simplified expectations* (mean `mu` and standard deviation `sigma`) for various IBD summary statistics (Total Length and Count for IBD3/IBD2) and age differences for common relationship types.
    *   *Note:* In a real Bonsai implementation (Jewett et al. 2021), these parameters are derived empirically from vast datasets and simulations, not hardcoded simple values. The structure here, however, represents the *kind* of information needed.

2.  **IBD Simulation (`simulate_ibd_stats`):** This function generates *plausible "observed" IBD data* for a specific relationship. It uses the defined constants and Gaussian sampling to create `IBDStats` objects (T3, C3, T2, C2). This simulates the output you might get from an IBD detection tool applied to real genetic data for pairs with known relationships.

3.  **Relationship Determination (`determine_relationship`):** Given a *candidate* pedigree structure (represented as an up-node dictionary), this function determines the specific genealogical relationship `(up, down, num_ancs)` between two individuals *within that structure*.
    *   *Note:* The provided version is **highly simplified** and only correctly identifies self, parent/child, full/half-siblings, and grandparent relationships based on direct links. A real implementation needs complex graph traversal algorithms to find the Most Recent Common Ancestor(s) (MRCAs) and calculate meiotic paths for *all* possible relationships (cousins, avuncular, etc.).

4.  **Pairwise Likelihood (`calculate_pairwise_loglik`):** This is the core calculation for a single pair. It takes:
    *   The *observed* (simulated) IBD statistics for the pair.
    *   The *observed* ages for the pair.
    *   The *relationship implied by the pedigree being tested*.
    It then calculates a log-likelihood score based on how well the observed IBD and age difference match the *expected* values (from `RELATIONSHIP_PARAMS`) for that specific relationship, using Gaussian probability density functions (via `scipy.stats.norm.logpdf`). It sums the genetic and age components.

5.  **Pedigree Likelihood (`calculate_pedigree_log_likelihood`):** This function orchestrates the overall scoring. It iterates through all unique pairs of individuals present in the *candidate* pedigree, determines their relationship *within that structure* using `determine_relationship`, fetches the *observed* (simulated) data for that pair, calculates their pairwise log-likelihood using `calculate_pairwise_loglik`, and sums these scores to get the total log-likelihood for the entire pedigree.

6.  **Demonstration (`if __name__ == "__main__":`)**
    *   Sets up a "true" pedigree and corresponding ages.
    *   Generates simulated IBD data for all pairs based on their *true* relationships.
    *   Calculates the log-likelihood score for the `true_pedigree` using the simulated data.
    *   Creates two `incorrect_pedigree` structures.
    *   Calculates the log-likelihood scores for these incorrect pedigrees using the *same* simulated data.

**Connecting to Theory (ERSA, Distributions, etc.):**

This demonstration uses simplified Gaussian models for IBD statistics for ease of implementation. However, it illustrates the *principle* used in more sophisticated methods:

*   **Likelihood Principle:** The core idea is $P(\text{Data} | \text{Model})$, where the "Data" is the observed IBD/age for a pair, and the "Model" is the specific relationship implied by the pedigree.
*   **IBD Models:** More rigorous methods model IBD segment properties based on genetic theory:
    *   **Segment Lengths:** Often modeled using the **Exponential distribution** (for segments arising from a single meiosis path) or the **Gamma distribution** (for total length, which is a sum of segment lengths). The mean length is inversely proportional to the number of meioses separating the individuals.
    *   **Segment Counts:** Often modeled using the **Poisson distribution** (as in the ERSA method by Huff et al. 2011) or the more flexible **Negative Binomial distribution** (as discussed in Jewett et al. 2024 preprint) to account for overdispersion. The expected count decreases rapidly with distance.
*   **This Notebook's Context:** Ideally, explorations of the Exponential distribution for segment lengths and Poisson for counts would precede this demonstration to provide theoretical grounding. This script uses Gaussian approximations for simplicity but embodies the same goal: quantifying the match between observed IBD and the expectation under a given relationship. The use of summary statistics (Total Length, Count) aligns with methods like ERSA and the inputs to the Kirkpatrick/Huff Bonsai, whereas the 23andMe Bonsai (Jewett et al. 2021) also uses these stats (fitted empirically) in its likelihood calculation.

**Interpreting the Output:**

The script calculates and prints the total log-likelihood scores for the true pedigree and two incorrect variations, using the *same* underlying simulated IBD/age data (which was generated based on the true pedigree).

*   **Log-Likelihood Values:** These scores are typically negative (since they are logs of probabilities, which are <= 1). **Higher values (closer to zero) indicate a better fit.**
*   **Expected Result:** The `Log-Likelihood of TRUE Pedigree` should be significantly higher (less negative) than the scores for the `INCORRECT Pedigree` examples.
*   **Why?** When evaluating the true pedigree, the `determine_relationship` function finds the correct relationship for each pair. The `calculate_pairwise_loglik` then compares the observed (simulated) data against the *correct* expected parameters, leading to a relatively good fit and a higher likelihood score. When evaluating an incorrect pedigree, `determine_relationship` will find *incorrect* relationships for some pairs. `calculate_pairwise_loglik` then compares the observed data (generated under the *true* relationship) against the expected parameters for the *wrong* relationship, leading to a poor fit and a lower (more negative) likelihood score for those pairs, dragging down the total pedigree score.

**Overall Significance:**

This demonstration highlights how the log-likelihood score acts as an objective function. It provides a quantitative measure of how well any given pedigree structure explains the observed data. Optimization algorithms used in Bonsai (like Simulated Annealing or Integer Linear Programming) explore the vast space of possible pedigrees, using this log-likelihood calculation repeatedly to guide their search towards the structure(s) that best fit the evidence.

## 2. Likelihood Functions: Quantifying the Evidence

The likelihood function measures how well a proposed pedigree explains the observed IBD data. In Bonsai, this function is built from probabilistic models of IBD segment inheritance.

### IBD Likelihood Models

For a pair of individuals, the likelihood of their IBD sharing given a specific relationship can be expressed as:

$$L(r | \text{IBD}) = P(\text{IBD} | r)$$

where:
- $r$ is the relationship type (e.g., parent-child, siblings, cousins)
- IBD represents the observed IBD segments between the individuals

Let's implement some basic likelihood models for different relationship types.

In [None]:
# --- Illustration: Expected IBD Proportions and Total Sharing ---
import matplotlib.pyplot as plt
import pandas as pd

# Theoretical expectations
relationships = ['Parent-Child', 'Full Siblings', 'Half-Siblings/\nGrandparent', 'First Cousins', 'Unrelated']
expected_ibd0 = [0.0, 0.25, 0.50, 0.75, 1.0]
expected_ibd1 = [1.0, 0.50, 0.50, 0.25, 0.0]
expected_ibd2 = [0.0, 0.25, 0.0,  0.0,  0.0]
expected_total_cm = [3500, 2600, 1750, 875, 0] # Approximate total cM shared

exp_df = pd.DataFrame({
    'Relationship': relationships,
    'IBD0 (%)': [p * 100 for p in expected_ibd0],
    'IBD1 (%)': [p * 100 for p in expected_ibd1],
    'IBD2 (%)': [p * 100 for p in expected_ibd2],
    'Total Shared (cM)': expected_total_cm
})

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot IBD Proportions
exp_df.plot(x='Relationship', y=['IBD0 (%)', 'IBD1 (%)', 'IBD2 (%)'], kind='bar',
            stacked=True, ax=axes[0], colormap='viridis')
axes[0].set_title('Expected IBD State Proportions')
axes[0].set_ylabel('Genome Proportion (%)')
axes[0].tick_params(axis='x', rotation=45)
axes[0].legend(title='IBD State')

# Plot Total Sharing
exp_df.plot(x='Relationship', y='Total Shared (cM)', kind='bar', ax=axes[1], color='skyblue')
axes[1].set_title('Expected Total Shared IBD')
axes[1].set_ylabel('Approx. Centimorgans (cM)')
axes[1].tick_params(axis='x', rotation=45)
axes[1].get_legend().remove()

plt.suptitle('Theoretical IBD Sharing Expectations by Relationship', fontsize=14)
plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust layout to prevent title overlap
plt.show()

print("Note: These are theoretical averages. Actual sharing varies due to randomness of recombination.")

### Comparing Likelihood Models for Different Relationships

Let's compare how these likelihood models behave for different relationship scenarios. We'll create synthetic IBD data for various relationships and see how well our models can distinguish them.

In [None]:
def generate_synthetic_ibd(relationship_type, noise_level=0.1):
    """Generate synthetic IBD data for a specific relationship type.
    
    Args:
        relationship_type: One of 'parent-child', 'siblings', 'grandparent', 
                          'half-siblings', 'first-cousins', 'second-cousins'
        noise_level: Level of noise/randomness to add (0.0-1.0)
        
    Returns:
        List of synthetic IBD segments
    """
    genome_length = 3500  # cM
    segments = []
    
    if relationship_type == 'parent-child':
        # Parent-child: ~100% IBD1, no IBD2
        # Create segments covering the entire genome with some fragmentation
        remaining = genome_length
        while remaining > 0:
            # Create segments of varying sizes but maintain total coverage
            seg_length = min(remaining, random.uniform(50, 200))
            segments.append({
                'type': 'IBD1',
                'length': seg_length * (1.0 + random.uniform(-noise_level, noise_level))
            })
            remaining -= seg_length
            
    elif relationship_type == 'siblings':
        # Siblings: ~25% IBD0, ~50% IBD1, ~25% IBD2
        ibd0_target = 0.25 * genome_length
        ibd1_target = 0.50 * genome_length
        ibd2_target = 0.25 * genome_length
        
        # Add noise to targets
        ibd0_target *= (1.0 + random.uniform(-noise_level, noise_level))
        ibd1_target *= (1.0 + random.uniform(-noise_level, noise_level))
        ibd2_target *= (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 10:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(10, 100))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
            
        # Create IBD2 segments
        ibd2_remaining = ibd2_target
        while ibd2_remaining > 10:  # Minimum segment size
            seg_length = min(ibd2_remaining, random.uniform(10, 100))
            segments.append({
                'type': 'IBD2',
                'length': seg_length
            })
            ibd2_remaining -= seg_length
            
    elif relationship_type in ['half-siblings', 'grandparent']:
        # Half-siblings/Grandparent: ~50% IBD0, ~50% IBD1, no IBD2
        ibd1_target = 0.50 * genome_length * (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 10:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(10, 100))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
            
    elif relationship_type == 'first-cousins':
        # First cousins: ~12.5% IBD1
        ibd1_target = 0.125 * genome_length * (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 7:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(7, 50))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
            
    elif relationship_type == 'second-cousins':
        # Second cousins: ~3.125% IBD1
        ibd1_target = 0.03125 * genome_length * (1.0 + random.uniform(-noise_level, noise_level))
        
        # Create IBD1 segments
        ibd1_remaining = ibd1_target
        while ibd1_remaining > 7:  # Minimum segment size
            seg_length = min(ibd1_remaining, random.uniform(7, 30))
            segments.append({
                'type': 'IBD1',
                'length': seg_length
            })
            ibd1_remaining -= seg_length
    
    # Sort segments by length (largest first)
    return sorted(segments, key=lambda x: x['length'], reverse=True)

In [None]:
# Generate synthetic IBD data for different relationships
relationship_types = ['parent-child', 'siblings', 'half-siblings', 'first-cousins', 'second-cousins']
synthetic_data = {}

for rel_type in relationship_types:
    synthetic_data[rel_type] = generate_synthetic_ibd(rel_type)
    
    # Print summary statistics
    total_ibd1 = sum(seg['length'] for seg in synthetic_data[rel_type] if seg['type'] == 'IBD1')
    total_ibd2 = sum(seg['length'] for seg in synthetic_data[rel_type] if seg['type'] == 'IBD2')
    ibd1_count = sum(1 for seg in synthetic_data[rel_type] if seg['type'] == 'IBD1')
    ibd2_count = sum(1 for seg in synthetic_data[rel_type] if seg['type'] == 'IBD2')
    
    print(f"{rel_type}:")
    print(f"  IBD1: {total_ibd1:.1f} cM in {ibd1_count} segments")
    print(f"  IBD2: {total_ibd2:.1f} cM in {ibd2_count} segments")
    print(f"  Total: {total_ibd1 + total_ibd2:.1f} cM")
    print()

In [None]:
# Calculate likelihoods for each relationship type using each model
results = []

for true_rel in relationship_types:
    data = synthetic_data[true_rel]
    
    # Calculate likelihoods
    pc_like = parent_child_likelihood(data)
    sib_like = sibling_likelihood(data)
    
    # Distant relationship likelihoods
    hs_like = distant_relationship_likelihood(data, 2)  # Half-siblings: 2 meioses
    fc_like = distant_relationship_likelihood(data, 4)  # First cousins: 4 meioses
    sc_like = distant_relationship_likelihood(data, 6)  # Second cousins: 6 meioses
    
    results.append({
        'True Relationship': true_rel,
        'Parent-Child Log-Likelihood': pc_like,
        'Sibling Log-Likelihood': sib_like,
        'Half-Sibling Log-Likelihood': hs_like,
        'First-Cousin Log-Likelihood': fc_like,
        'Second-Cousin Log-Likelihood': sc_like
    })

# Convert to DataFrame for easier analysis
likelihood_df = pd.DataFrame(results)
likelihood_df

In [None]:
# Identify the highest likelihood model for each relationship
likelihood_columns = [
    'Parent-Child Log-Likelihood',
    'Sibling Log-Likelihood',
    'Half-Sibling Log-Likelihood',
    'First-Cousin Log-Likelihood',
    'Second-Cousin Log-Likelihood'
]

# Find max likelihood for each row
likelihood_df['Max Likelihood'] = likelihood_df[likelihood_columns].max(axis=1)
likelihood_df['Predicted Relationship'] = likelihood_df[likelihood_columns].idxmax(axis=1)

# Clean up the predicted relationship string
likelihood_df['Predicted Relationship'] = likelihood_df['Predicted Relationship'].str.replace('-Log-Likelihood', '')

# Display results
likelihood_df[['True Relationship', 'Predicted Relationship', 'Max Likelihood']]

In [None]:
# --- Imports for Likelihood Calculations ---
import math
import numpy as np
from scipy.stats import poisson, norm
# Note: Ensure 'synthetic_data' dictionary is defined in a previous cell

# --- Simplified Likelihood Function Definitions ---

def parent_child_likelihood(segments):
    """Calculate likelihood of a parent-child relationship (simplified model)."""
    # For parent-child: Expect ~100% IBD1, no IBD2
    ibd1_length = sum(seg.get('length', 0) for seg in segments if seg.get('type') == 'IBD1')
    ibd2_count = sum(1 for seg in segments if seg.get('type') == 'IBD2')
    genome_length = 3500.0
    coverage = min(1.0, ibd1_length / genome_length if genome_length > 0 else 0)

    # Heuristic scoring based on closeness to expectation
    if coverage > 0.95 and ibd2_count == 0:
        # Very likely Parent-Child if high coverage and no IBD2
        # Return a high log-likelihood (close to 0)
        # Using log(0.99) might be too specific, let's use a scale
        return -0.1 # High likelihood (small negative number)
    else:
        # Penalize deviations from expected pattern
        # Penalty for low coverage OR presence of IBD2
        coverage_penalty = ((1.0 - coverage) ** 2) * 50 # Penalize low coverage heavily
        ibd2_penalty = (ibd2_count ** 2) * 5         # Penalize IBD2 segments
        # Return a lower log-likelihood (more negative)
        # Ensure it's clearly distinct from the "good" case
        return -(coverage_penalty + ibd2_penalty + 1.0) # Base penalty + deviation penalties

def sibling_likelihood(segments):
    """Calculate likelihood of a full sibling relationship (simplified model)."""
    # For full siblings: Expect ~25% IBD0, ~50% IBD1, ~25% IBD2
    ibd1_length = sum(seg.get('length', 0) for seg in segments if seg.get('type') == 'IBD1')
    ibd2_length = sum(seg.get('length', 0) for seg in segments if seg.get('type') == 'IBD2')
    genome_length = 3500.0

    # Calculate observed proportions
    # Ensure genome_length is positive to avoid division by zero
    prop_ibd1 = ibd1_length / genome_length if genome_length > 0 else 0
    prop_ibd2 = ibd2_length / genome_length if genome_length > 0 else 0
    # IBD0 is the remaining fraction of the genome
    prop_ibd0 = max(0.0, 1.0 - (prop_ibd1 + prop_ibd2))

    # Expected proportions for full siblings
    expected_ibd0 = 0.25
    expected_ibd1 = 0.50
    expected_ibd2 = 0.25

    # Calculate squared error from expected proportions
    error = ((prop_ibd0 - expected_ibd0) ** 2 +
             (prop_ibd1 - expected_ibd1) ** 2 +
             (prop_ibd2 - expected_ibd2) ** 2)

    # Convert error to a log-likelihood score (higher error = lower likelihood)
    # Scale the error to make differences more apparent
    return -20.0 * error # Larger negative value for larger error

# --- Helper functions for distant likelihood ---
def calculate_expected_segments(relatedness, min_cm=7):
    """Calculate expected number of IBD segments (>min_cm) (simplified)."""
    if relatedness <= 0 or relatedness > 1: return 0.1 # Handle invalid input
    # Avoid log(0) for unrelated
    if relatedness < 1e-9: return 0.1
    try:
        meioses = -math.log2(relatedness)
    except ValueError:
        return 0.1 # Should not happen with checks above

    genome_length_cm = 3500.0
    chromosomes = 22
    # Heuristic formula from previous example - NOT theoretically rigorous for m=1, 2
    # but aims to show decreasing trend for distant relatives
    decay_factor = math.exp(-(meioses / 100.0) * min_cm) # Simplified decay
    # Adjusting formula slightly - use '2 * relatedness * genome_length' as base rate?
    # Let's try a simpler form for distant: lambda ~ 2*phi*L / E[Length]
    # E[Length | > min_cm] ~ (100/m) * (1 + m/100 * min_cm) / exp(-m/100*min_cm) -- complex
    # Using the heuristic from the previous plot's code for consistency in demo:
    expected = (genome_length_cm / 100.0) * (meioses + chromosomes) * relatedness * decay_factor
    return max(0.1, expected) # Ensure lambda > 0 for Poisson

def calculate_expected_length(relatedness, min_cm=7):
    """Calculate expected total length of IBD segments (>min_cm) (simplified)."""
    if relatedness <= 0 or relatedness > 1: return 0.1 # Handle invalid input
    if relatedness < 1e-9: return 0.1
    try:
        meioses = -math.log2(relatedness)
    except ValueError:
        return 0.1

    genome_length_cm = 3500.0
    total_expected_uncond = genome_length_cm * relatedness
    # Apply a heuristic correction factor for the minimum length threshold
    decay_factor = math.exp(-(meioses / 100.0) * min_cm)
    length_increase_factor = (1 + (meioses / 100.0) * min_cm) # Accounts for longer avg length when conditioned
    # Combine: Expected total = Unconditional Total * Prob(Seg > min_cm) * Length Increase factor? No...
    # Let's use the heuristic from the previous code block for consistency:
    expected = total_expected_uncond * decay_factor * length_increase_factor
    return max(0.1, expected) # Ensure positive length

def distant_relationship_likelihood(segments, meioses):
    """Calculate likelihood for distant relations using Poisson/Normal approx."""
    if meioses <= 0: # Invalid input
        return -float('inf')

    min_cm = 7
    # Ensure segments is iterable and dicts have keys
    if not hasattr(segments, '__iter__'): segments = []
    filtered_segments = [seg for seg in segments
                         if isinstance(seg, dict) and seg.get('type') == 'IBD1' and seg.get('length', 0) >= min_cm]

    segment_count = len(filtered_segments)
    total_length = sum(seg['length'] for seg in filtered_segments)

    # Calculate relatedness safely
    relatedness = 2.0 ** (-meioses)

    # Get expected values using helper functions
    expected_count = calculate_expected_segments(relatedness, min_cm)
    expected_length = calculate_expected_length(relatedness, min_cm)

    # --- Calculate Poisson log-likelihood for count ---
    # Handle edge cases for poisson.logpmf(k, mu)
    # If mu=0, logpmf is 0 only if k=0, else -inf.
    # If mu>0, calculation is standard.
    count_log_like = 0.0
    if expected_count < 1e-9: # Treat as zero
        count_log_like = 0.0 if segment_count == 0 else -float('inf')
    else:
        # poisson.logpmf requires k>=0 integer, mu>=0
        if segment_count >= 0:
             try:
                 count_log_like = poisson.logpmf(segment_count, expected_count)
             except (ValueError, TypeError): # k might not be integer if data is weird
                 count_log_like = -float('inf')
        else:
             count_log_like = -float('inf') # Negative count is impossible

    # --- Calculate Normal log-likelihood for total length ---
    length_log_like = 0.0
    # Only calculate if there are segments observed AND expected
    # Also need expected_length > 0 and expected_count > 0 for scale calculation
    if segment_count > 0 and expected_count > 1e-9 and expected_length > 1e-9:
        # Calculate scale (std dev) - proportional to expected_length / sqrt(expected_count)
        # Add epsilon to avoid division by zero if expected_count is tiny but positive
        scale = expected_length / math.sqrt(expected_count + 1e-9)
        if scale > 1e-9: # Ensure scale is reasonably positive
            try:
                length_log_like = norm.logpdf(total_length, loc=expected_length, scale=scale)
            except (ValueError, TypeError):
                length_log_like = -float('inf')
        else: # If scale is effectively zero, treat as delta function
             length_log_like = 0.0 if abs(total_length - expected_length) < 1e-6 else -float('inf')
    elif segment_count == 0 and expected_count < 1e-9:
        # If no segments observed and none expected, this is consistent.
        # Length component likelihood is neutral (log(1)=0) or based on P(Count=0)
        # which is already handled by count_log_like. So set length_log_like = 0.
        length_log_like = 0.0
    elif segment_count > 0 and expected_count < 1e-9:
        # Observed segments but none expected - this is inconsistent via count_log_like already
        length_log_like = 0.0 # Let count_log_like dominate
    elif segment_count == 0 and expected_count > 1e-9:
         # No segments observed, but some expected. Count likelihood handles this.
         length_log_like = 0.0 # Let count_log_like dominate

    # --- Combine likelihoods (weighted) ---
    # Check for NaN/Inf before weighting to avoid issues like inf * 0
    if np.isinf(count_log_like) or np.isnan(count_log_like):
        w_count_ll = -10000.0 # Assign large penalty if count is impossible/undefined
    else:
        w_count_ll = count_log_like * 0.7

    if np.isinf(length_log_like) or np.isnan(length_log_like):
        w_length_ll = -10000.0 # Assign large penalty if length is impossible/undefined
    else:
        w_length_ll = length_log_like * 0.3

    # Final combined log-likelihood
    final_ll = w_count_ll + w_length_ll

    # Ensure it doesn't become positive infinity if both somehow were +inf
    if np.isinf(final_ll) and final_ll > 0:
        final_ll = -float('inf')

    # Handle case where both components are valid but sum is NaN (shouldn't happen with checks)
    if np.isnan(final_ll):
        final_ll = -float('inf')

    return final_ll

# --- Demonstration using Synthetic Data ---
# (Assumes Cell 14 creating 'synthetic_data' has been run)

print("\n--- Demonstrating Simplified Likelihood Functions ---")

# Check if synthetic_data exists in the current scope
if 'synthetic_data' in locals() and isinstance(synthetic_data, dict):
    # Get example data, providing empty list if key is missing
    pc_data = synthetic_data.get('parent-child', [])
    sib_data = synthetic_data.get('siblings', [])
    fc_data = synthetic_data.get('first-cousins', [])
    sc_data = synthetic_data.get('second-cousins', [])

    # Calculate likelihoods for matching data/models
    ll_pc_as_pc = parent_child_likelihood(pc_data)
    ll_sib_as_sib = sibling_likelihood(sib_data)
    ll_fc_as_fc = distant_relationship_likelihood(fc_data, 4) # 4 meioses for 1st cousins
    ll_sc_as_sc = distant_relationship_likelihood(sc_data, 6) # 6 meioses for 2nd cousins

    print(f"LogLik(Parent-Child Data | Parent-Child Model) : {ll_pc_as_pc:.2f}")
    print(f"LogLik(Sibling Data       | Sibling Model)        : {ll_sib_as_sib:.2f}")
    print(f"LogLik(1st Cousin Data    | Distant Model, m=4)   : {ll_fc_as_fc:.2f}")
    print(f"LogLik(2nd Cousin Data    | Distant Model, m=6)   : {ll_sc_as_sc:.2f}")

    # Calculate likelihoods for mismatching data/models
    ll_pc_as_sib = sibling_likelihood(pc_data)
    ll_sib_as_pc = parent_child_likelihood(sib_data)
    ll_sib_as_fc = distant_relationship_likelihood(sib_data, 4)
    ll_fc_as_sib = sibling_likelihood(fc_data)

    print(f"\n--- Examples of Mismatches ---")
    print(f"LogLik(Parent-Child Data | Sibling Model)        : {ll_pc_as_sib:.2f} (Expected lower than {ll_pc_as_pc:.2f})")
    print(f"LogLik(Sibling Data       | Parent-Child Model)   : {ll_sib_as_pc:.2f} (Expected lower than {ll_sib_as_sib:.2f})")
    print(f"LogLik(Sibling Data       | Distant Model, m=4)   : {ll_sib_as_fc:.2f} (Expected lower than {ll_sib_as_sib:.2f})")
    print(f"LogLik(1st Cousin Data    | Sibling Model)        : {ll_fc_as_sib:.2f} (Expected lower than {ll_fc_as_fc:.2f})")


    print("\nNote: Higher (less negative) log-likelihood indicates a better fit between")
    print("the data and the assumed relationship model.")
    print("These functions use simplified models for demonstration purposes.")
    print("Real likelihood models are more complex and often empirically derived.")

else:
    print("Error: 'synthetic_data' dictionary not found.")
    print("Please ensure the cell that generates 'synthetic_data' (e.g., Cell 13/14) has been run successfully.")

### IBD Moments Model

A key innovation in Bonsai is the use of "IBD moments" to summarize the IBD sharing between individuals. Let's implement a function to calculate these moments from IBD segment data.

In [None]:
def calculate_ibd_moments(segment_list, min_length=7):
    """Calculate IBD moments from a list of segments.
    
    Args:
        segment_list: List of IBD segments with 'length' attribute
        min_length: Minimum segment length to consider
    
    Returns:
        Dictionary with first moment (count) and second moment (total length)
    """
    # Filter segments by minimum length
    filtered_segments = [seg for seg in segment_list if seg['length'] >= min_length]
    
    # First moment: number of segments
    first_moment = len(filtered_segments)
    
    # Second moment: total length of segments
    second_moment = sum(seg['length'] for seg in filtered_segments)
    
    # Third moment (optional): sum of squared lengths
    third_moment = sum(seg['length']**2 for seg in filtered_segments)
    
    return {
        "first_moment": first_moment,
        "second_moment": second_moment,
        "third_moment": third_moment
    }

# Calculate moments for each relationship type
moments_results = []

for rel_type, data in synthetic_data.items():
    moments = calculate_ibd_moments(data)
    moments_results.append({
        'Relationship': rel_type,
        'Segment Count': moments['first_moment'],
        'Total Length (cM)': moments['second_moment'],
        'Mean Segment Length': moments['second_moment'] / max(1, moments['first_moment'])
    })

moments_df = pd.DataFrame(moments_results)
moments_df

In [None]:
# Visualize the moments by relationship type
plt.figure(figsize=(12, 6))

# Sort by total length for better visualization
moments_df = moments_df.sort_values('Total Length (cM)', ascending=False)

# Plot segment count and total length
ax = plt.subplot(1, 2, 1)
moments_df.plot(x='Relationship', y='Segment Count', kind='bar', ax=ax)
plt.title('First Moment: Segment Count')
plt.ylabel('Number of Segments (>7cM)')
plt.xticks(rotation=45)

ax = plt.subplot(1, 2, 2)
moments_df.plot(x='Relationship', y='Total Length (cM)', kind='bar', ax=ax)
plt.title('Second Moment: Total IBD Length')
plt.ylabel('Total Length (cM)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 3. The Up-Node Dictionary: Encoding Pedigree Structures

A key data structure in Bonsai is the "up-node dictionary," which encodes the pedigree structure in a way that facilitates efficient likelihood calculations and structural modifications.

In [None]:
def create_empty_up_node_dict(individuals):
    """Create an empty up-node dictionary for a set of individuals.
    
    Args:
        individuals: List of individual IDs
        
    Returns:
        Empty up-node dictionary
    """
    up_node_dict = {}
    for ind in individuals:
        up_node_dict[ind] = {}  # Empty dictionary indicates no parents
    return up_node_dict

def add_relationship(up_node_dict, child, parent1, parent2=None):
    """Add a parent-child relationship to the up-node dictionary.
    
    Args:
        up_node_dict: The up-node dictionary to modify
        child: ID of the child
        parent1: ID of the first parent
        parent2: ID of the second parent (optional)
        
    Returns:
        Modified up-node dictionary
    """
    if child not in up_node_dict:
        up_node_dict[child] = {}
    
    # Add first parent
    up_node_dict[child][parent1] = 1
    
    # Add second parent if provided
    if parent2 is not None:
        up_node_dict[child][parent2] = 1
    
    # Make sure parents exist in the dictionary
    if parent1 not in up_node_dict:
        up_node_dict[parent1] = {}
    if parent2 is not None and parent2 not in up_node_dict:
        up_node_dict[parent2] = {}
    
    return up_node_dict

def visualize_pedigree(up_node_dict):
    """Create a visualization of the pedigree from an up-node dictionary.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
    """
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add nodes and edges
    for child, parents in up_node_dict.items():
        G.add_node(child)
        for parent in parents:
            G.add_node(parent)
            G.add_edge(parent, child)  # Direction from parent to child
    
    # Set node colors: green for real individuals (positive IDs), white for latent (negative IDs)
    node_colors = ['lightgreen' if isinstance(node, int) and node > 0 else 'lightgray' 
                   for node in G.nodes()]
    
    # Calculate layout
    pos = nx.nx_agraph.graphviz_layout(G, prog='dot') if nx.nx_agraph else nx.spring_layout(G)
    
    # Draw the graph
    plt.figure(figsize=(12, 8))
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            node_size=700, font_size=10, arrows=True)
    plt.title('Pedigree Structure')
    plt.show()

In [None]:
# Create a simple example pedigree
individuals = [1000, 1001, 1002, 1003, 1004, 1005]
up_node_dict = create_empty_up_node_dict(individuals)

# Add relationships
up_node_dict = add_relationship(up_node_dict, 1003, 1001, 1002)  # 1003 has parents 1001 and 1002
up_node_dict = add_relationship(up_node_dict, 1004, 1001, 1002)  # 1004 has the same parents
up_node_dict = add_relationship(up_node_dict, 1005, 1000, 1003)  # 1005 has parents 1000 and 1003

# Visualize the pedigree
visualize_pedigree(up_node_dict)

### Calculating Genetic Relationships Using the Up-Node Dictionary

One of the key operations in Bonsai is calculating the genetic relationship coefficient between individuals based on the pedigree structure. Let's implement this calculation using the up-node dictionary.

In [None]:
def get_genetic_paths(up_node_dict, individual, path=None, paths=None, ancestor=None):
    """Find all paths from an individual to their ancestors.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        individual: ID of the individual to trace
        path: Current path being explored (for recursion)
        paths: Dictionary of collected paths (for recursion)
        ancestor: Current ancestor being considered (for recursion)
        
    Returns:
        Dictionary mapping ancestor IDs to lists of paths
    """
    if path is None:
        path = []
    if paths is None:
        paths = {individual: [[]]}  # Start with self path
    
    # If this individual has no parents, return current paths
    if individual not in up_node_dict or not up_node_dict[individual]:
        return paths
    
    # Process each parent
    for parent in up_node_dict[individual]:
        # Create a new path for this parent
        new_path = path + [parent]
        
        # Add this path to the parent's paths
        if parent not in paths:
            paths[parent] = []
        paths[parent].append(new_path)
        
        # Recursively process this parent's ancestors
        get_genetic_paths(up_node_dict, parent, new_path, paths, parent)
    
    return paths

def calculate_relationship_coefficient(up_node_dict, id1, id2):
    """Calculate the relationship coefficient between two individuals.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        id1: ID of the first individual
        id2: ID of the second individual
        
    Returns:
        Relationship coefficient (proportion of shared genetic material)
    """
    if id1 == id2:
        return 1.0  # Self-relationship is 1.0
    
    # Direct parent-child relationship check
    if id1 in up_node_dict.get(id2, {}) or id2 in up_node_dict.get(id1, {}):
        return 0.5  # Parent-child share 50%
    
    # Get genetic paths to ancestors for each individual
    paths1 = get_genetic_paths(up_node_dict, id1)
    paths2 = get_genetic_paths(up_node_dict, id2)
    
    # Find common ancestors and calculate contributions
    relatedness = 0.0
    common_ancestors = set(paths1.keys()) & set(paths2.keys())
    
    for ancestor in common_ancestors:
        if ancestor == id1 or ancestor == id2:
            continue  # Skip self-paths
            
        # Each path contributes 0.5^(length of path)
        for path1 in paths1[ancestor]:
            for path2 in paths2[ancestor]:
                contribution = 0.5**(len(path1) + len(path2))
                relatedness += contribution
    
    return relatedness

In [None]:
# Calculate and display relationship coefficients for all pairs
relationship_results = []

for id1 in individuals:
    for id2 in individuals:
        if id1 < id2:  # Avoid duplicates and self-relationships
            coef = calculate_relationship_coefficient(up_node_dict, id1, id2)
            relationship_name = "Unknown"
            
            # Map coefficient to relationship name
            if coef == 0.5:
                relationship_name = "Parent-Child"
            elif coef == 0.25:
                relationship_name = "Grandparent or Half-Sibling"
            elif coef == 0.125:
                relationship_name = "First Cousin or Great-Grandparent"
            elif 0.24 < coef < 0.26:  # Full siblings (theoretical 0.25, but can vary)
                relationship_name = "Full Siblings"
            
            relationship_results.append({
                'Individual 1': id1,
                'Individual 2': id2,
                'Relationship Coefficient': coef,
                'Relationship': relationship_name
            })

rel_df = pd.DataFrame(relationship_results)
rel_df

## 4. Optimization Algorithms in Bonsai

Bonsai uses sophisticated optimization algorithms to search for the pedigree structure that maximizes the likelihood of the observed IBD data. Let's implement a simplified version of these algorithms.

In [None]:
def calculate_pedigree_likelihood(up_node_dict, ibd_segments, min_cm=7):
    """Calculate the likelihood of a pedigree given IBD segment data.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        ibd_segments: Dictionary mapping pairs of individuals to their IBD segments
        min_cm: Minimum segment length to consider
        
    Returns:
        Log-likelihood of the pedigree
    """
    # This is a simplified placeholder implementation
    total_log_likelihood = 0.0
    
    # Process each pair of individuals
    for (id1, id2), segments in ibd_segments.items():
        # Skip if either individual is not in the pedigree
        if id1 not in up_node_dict or id2 not in up_node_dict:
            continue
            
        # Calculate expected relationship coefficient
        expected_coef = calculate_relationship_coefficient(up_node_dict, id1, id2)
        
        # Calculate observed moments
        moments = calculate_ibd_moments(segments, min_cm)
        
        # Skip pairs with no IBD sharing above threshold
        if moments['first_moment'] == 0:
            continue
            
        # Calculate expected moments
        expected_count = calculate_expected_segments(expected_coef, min_cm)
        expected_length = calculate_expected_length(expected_coef, min_cm)
        
        # Calculate likelihood using Poisson model for segment count
        count_log_like = poisson.logpmf(moments['first_moment'], expected_count) if expected_count > 0 else 0
        
        # Use a normal approximation for total length
        length_log_like = 0
        if expected_count > 0 and moments['first_moment'] > 0:
            length_log_like = norm.logpdf(moments['second_moment'], 
                                         expected_length, 
                                         expected_length / math.sqrt(expected_count))
        
        # Combine likelihoods
        pair_log_like = count_log_like * 0.7 + length_log_like * 0.3
        total_log_likelihood += pair_log_like
    
    return total_log_likelihood

def propose_pedigree_modification(up_node_dict, ids=None):
    """Propose a modification to the pedigree structure.
    
    Args:
        up_node_dict: Current up-node dictionary
        ids: List of individual IDs to consider (if None, uses all IDs)
        
    Returns:
        Modified up-node dictionary
    """
    # Create a deep copy to avoid modifying the original
    new_dict = {}
    for ind, parents in up_node_dict.items():
        new_dict[ind] = parents.copy()
    
    # If no IDs provided, use all individuals in the dictionary
    if ids is None:
        ids = [id for id in up_node_dict.keys() if isinstance(id, int) and id > 0]
    
    # Choose a random individual
    if not ids:
        return new_dict  # No individuals to modify
        
    ind = random.choice(ids)
    
    # Choose a modification type
    mod_type = random.choice(['add_parent', 'remove_parent', 'swap_parent'])
    
    if mod_type == 'add_parent':
        # Add a parent to the individual
        if len(new_dict[ind]) < 2:  # Can only add if fewer than 2 parents
            # Create a new latent parent (negative ID)
            new_parent = -random.randint(1, 1000)
            while new_parent in new_dict:  # Ensure unique ID
                new_parent = -random.randint(1, 1000)
                
            # Add the parent
            new_dict[ind][new_parent] = 1
            new_dict[new_parent] = {}  # Initialize parent with no ancestors
    
    elif mod_type == 'remove_parent':
        # Remove a parent if any exist
        if new_dict[ind]:
            parent = random.choice(list(new_dict[ind].keys()))
            del new_dict[ind][parent]
    
    elif mod_type == 'swap_parent':
        # Replace a parent with another individual or a new latent parent
        if new_dict[ind]:
            parent = random.choice(list(new_dict[ind].keys()))
            
            # Create a new latent parent
            new_parent = -random.randint(1, 1000)
            while new_parent in new_dict:  # Ensure unique ID
                new_parent = -random.randint(1, 1000)
                
            # Replace the parent
            del new_dict[ind][parent]
            new_dict[ind][new_parent] = 1
            new_dict[new_parent] = {}  # Initialize parent with no ancestors
    
    return new_dict

def build_pedigree_with_optimization(individuals, ibd_segments, min_cm=7):
    """Build a pedigree using optimization techniques.
    
    Args:
        individuals: List of individual IDs
        ibd_segments: Dictionary mapping pairs of individuals to their IBD segments
        min_cm: Minimum segment length to consider
        
    Returns:
        Tuple of (best pedigree, best likelihood)
    """
    # Initialize with empty pedigree
    pedigree = create_empty_up_node_dict(individuals)
    
    # Calculate initial likelihood
    current_likelihood = calculate_pedigree_likelihood(pedigree, ibd_segments, min_cm)
    best_pedigree = {k: v.copy() for k, v in pedigree.items()}
    best_likelihood = current_likelihood
    
    # Optimization parameters
    temperature = 1.0
    cooling_rate = 0.99
    iterations = 100  # Reduced for demonstration
    
    # Track progress
    likelihoods = [current_likelihood]
    
    for i in range(iterations):
        # Propose a modification to the pedigree
        new_pedigree = propose_pedigree_modification(pedigree)
        
        # Calculate new likelihood
        new_likelihood = calculate_pedigree_likelihood(new_pedigree, ibd_segments, min_cm)
        
        # Accept or reject based on likelihood and temperature
        if new_likelihood > current_likelihood:
            # Always accept improvements
            accept = True
        else:
            # Sometimes accept worse solutions based on temperature
            delta = new_likelihood - current_likelihood
            accept_probability = math.exp(delta / temperature)
            accept = random.random() < accept_probability
        
        if accept:
            pedigree = new_pedigree
            current_likelihood = new_likelihood
            
            # Update best pedigree if improved
            if current_likelihood > best_likelihood:
                best_pedigree = {k: v.copy() for k, v in pedigree.items()}
                best_likelihood = current_likelihood
        
        # Cool the temperature
        temperature *= cooling_rate
        
        # Track progress
        likelihoods.append(current_likelihood)
        
        # Occasionally print progress
        if (i+1) % 10 == 0:
            print(f"Iteration {i+1}: Current likelihood = {current_likelihood:.2f}, Best = {best_likelihood:.2f}")
    
    # Plot optimization progress
    plt.figure(figsize=(10, 5))
    plt.plot(likelihoods)
    plt.title('Optimization Progress')
    plt.xlabel('Iteration')
    plt.ylabel('Log Likelihood')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return best_pedigree, best_likelihood

### Preparing IBD Data for Optimization

In [None]:
# Convert synthetic data into the format expected by optimization
# In this case, a dictionary mapping (id1, id2) to list of segments

# First, generate a more complete set of synthetic relationships
synthetic_relationships = {
    (1000, 1003): generate_synthetic_ibd('parent-child'),
    (1001, 1003): generate_synthetic_ibd('parent-child'),
    (1001, 1004): generate_synthetic_ibd('parent-child'),
    (1002, 1004): generate_synthetic_ibd('parent-child'),
    (1003, 1004): generate_synthetic_ibd('siblings'),
    (1000, 1004): generate_synthetic_ibd('half-siblings'),
    (1000, 1005): generate_synthetic_ibd('first-cousins'),
    (1002, 1005): generate_synthetic_ibd('second-cousins')
}

# Display the data structure
for (id1, id2), segments in list(synthetic_relationships.items())[:2]:  # Show first two for brevity
    print(f"Relationship between {id1} and {id2}:")
    print(f"  Number of segments: {len(segments)}")
    print(f"  Total IBD length: {sum(seg['length'] for seg in segments):.1f} cM")
    print(f"  First few segments: {segments[:2]}")
    print()

In [None]:
# Run the optimization to reconstruct the pedigree
# Note: This is a simplified demonstration; actual Bonsai optimization is more complex
try:
    # We'll use a timeout to avoid running too long in the notebook
    import signal
    class TimeoutException(Exception): pass
    
    def timeout_handler(signum, frame):
        raise TimeoutException("Timed out!")
    
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(300)  # 5 minute timeout
    
    # Run the optimization
    inferred_pedigree, final_likelihood = build_pedigree_with_optimization(
        individuals, synthetic_relationships, min_cm=7
    )
    
    signal.alarm(0)  # Cancel the alarm
    
    # Visualize the inferred pedigree
    print("\nInferred Pedigree:")
    visualize_pedigree(inferred_pedigree)
    
    # Compare with the true pedigree
    print("\nTrue Pedigree:")
    visualize_pedigree(up_node_dict)
    
except TimeoutException:
    print("Optimization timed out. This is expected in the notebook demonstration.")
    print("For a full analysis, consider running the optimization with more carefully selected parameters.")
except Exception as e:
    print(f"Error during optimization: {e}")

## 5. Mathematical Extensions and Improvements

Let's explore some mathematical extensions that can improve Bonsai's performance, such as handling age constraints and incorporating additional relationship information.

In [None]:
def incorporate_age_constraints(up_node_dict, ages, min_parent_age=12):
    """Check if a pedigree satisfies age constraints.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        ages: Dictionary mapping individual IDs to ages
        min_parent_age: Minimum age difference between parent and child
        
    Returns:
        True if all constraints are satisfied, False otherwise
    """
    for child, parents in up_node_dict.items():
        if child < 0 or not parents:  # Skip inferred individuals or those without parents
            continue
            
        child_age = ages.get(child)
        if child_age is None:
            continue
            
        for parent in parents:
            if parent < 0:  # Skip inferred parents
                continue
                
            parent_age = ages.get(parent)
            if parent_age is None:
                continue
                
            # Check if parent is older than child by at least min_parent_age
            if parent_age <= child_age + min_parent_age:
                return False  # Age constraint violated
    
    return True  # All constraints satisfied

def calculate_pedigree_likelihood_with_constraints(up_node_dict, ibd_segments, ages=None, min_cm=7):
    """Calculate pedigree likelihood with additional constraints.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        ibd_segments: Dictionary mapping pairs of individuals to their IBD segments
        ages: Dictionary mapping individual IDs to ages (optional)
        min_cm: Minimum segment length to consider
        
    Returns:
        Log-likelihood of the pedigree
    """
    # Check age constraints if ages provided
    if ages is not None and not incorporate_age_constraints(up_node_dict, ages):
        return float('-inf')  # Invalid pedigree due to age constraints
    
    # Otherwise, calculate likelihood as before
    return calculate_pedigree_likelihood(up_node_dict, ibd_segments, min_cm)

In [None]:
# Example of using age constraints
# Assign ages to individuals
ages = {
    1000: 70,
    1001: 65,
    1002: 68,
    1003: 40,
    1004: 38,
    1005: 15
}

# Check if our example pedigree satisfies age constraints
age_valid = incorporate_age_constraints(up_node_dict, ages)
print(f"Pedigree satisfies age constraints: {age_valid}")

# Create an invalid pedigree for demonstration
invalid_pedigree = create_empty_up_node_dict(individuals)
invalid_pedigree = add_relationship(invalid_pedigree, 1001, 1003)  # Invalid: 1003 is younger than 1001

# Check if the invalid pedigree satisfies age constraints
age_valid = incorporate_age_constraints(invalid_pedigree, ages)
print(f"Invalid pedigree satisfies age constraints: {age_valid}")

## Conclusion

In this lab, we explored the mathematical foundations of the Bonsai algorithm for pedigree reconstruction. We implemented key components of the algorithm, including likelihood functions, the up-node dictionary data structure, and optimization techniques. We also examined how additional constraints, such as age information, can be incorporated to improve reconstruction accuracy.

Key takeaways:
- Bonsai uses a Bayesian framework to find the most likely pedigree given observed IBD segment patterns
- Different relationship types have characteristic likelihood models based on theoretical expectations
- The up-node dictionary provides an efficient representation of pedigree structures
- Optimization algorithms like simulated annealing help search the vast space of possible pedigrees
- Additional constraints and information can be incorporated to improve reconstruction accuracy

In the next lab, we will explore the data structures used in Bonsai in more detail, focusing on how they enable efficient pedigree manipulation and analysis.