# Lab 16: Bonsai Architecture and Implementation

In this lab, we will explore the architectural design and implementation details of the Bonsai algorithm. Building upon our understanding of IBD segments and pedigree reconstruction concepts from previous labs, we'll examine how Bonsai's modular components work together to build accurate pedigree structures from genetic data.

## Why This Matters

Understanding the architecture of Bonsai is crucial for computational genetic genealogists who want to:
- Modify the algorithm for specific research needs
- Optimize performance for large-scale pedigree reconstruction
- Extend the algorithm with new features or relationship models
- Debug issues in pedigree reconstruction
- Implement similar algorithms for related applications

**Learning Objectives**:
- Understand Bonsai's overall modular architecture
- Examine the core components and their interactions
- Analyze the implementation of key algorithms
- Explore optimization strategies for performance and memory usage
- Investigate extension mechanisms for customizing Bonsai
- Practice modifying components to address specific use cases

## Environment Setup

In [None]:
import os
from collections import Counter, defaultdict
import logging
import sys
from pathlib import Path
import subprocess
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import IPython
import pandas as pd
import numpy as np
import networkx as nx
import json
import pygraphviz as pgv
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import time
import random
from dotenv import load_dotenv

## 1. Bonsai's Overall Architecture

Before diving into specific components, let's understand the overall architecture of Bonsai. The algorithm follows a modular design pattern that separates concerns and allows for flexibility in implementation.

### 1.1 Architectural Overview

Bonsai's architecture consists of several interconnected modules, each with a specific responsibility:

1. **Input Processing Module**: Handles reading and validating input data (IBD segments, bioinfo)
2. **Relationship Model Module**: Defines genetic relationship patterns and likelihoods
3. **Up-Node Dictionary Module**: Represents the pedigree structure as a dictionary of nodes
4. **Optimization Engine**: Searches for the optimal pedigree structure
5. **Constraint Handler**: Ensures biological and logical constraints are met
6. **Output Generator**: Formats and returns the final pedigree structure

Let's visualize this architecture:

In [None]:
# Create a directed graph for the Bonsai architecture
G = nx.DiGraph()

# Add nodes for each component
components = [
    "Input Data", "Input Processing", "Relationship Model", 
    "Up-Node Dictionary", "Optimization Engine", "Constraint Handler",
    "Output Generator", "Final Pedigree"
]

for component in components:
    G.add_node(component)

# Add edges to show data flow
edges = [
    ("Input Data", "Input Processing"),
    ("Input Processing", "Relationship Model"),
    ("Input Processing", "Up-Node Dictionary"),
    ("Relationship Model", "Optimization Engine"),
    ("Up-Node Dictionary", "Optimization Engine"),
    ("Optimization Engine", "Constraint Handler"),
    ("Constraint Handler", "Optimization Engine"),
    ("Optimization Engine", "Up-Node Dictionary"),
    ("Up-Node Dictionary", "Output Generator"),
    ("Output Generator", "Final Pedigree")
]

G.add_edges_from(edges)

# Create a hierarchical layout
pos = nx.nx_agraph.graphviz_layout(G, prog="dot")

plt.figure(figsize=(12, 8))
nx.draw(G, pos, with_labels=True, node_color="lightblue", node_size=3000, 
        font_size=10, font_weight="bold", arrows=True, arrowsize=20)
plt.title("Bonsai Algorithm Architecture", fontsize=16)
plt.show()

### 1.2 Module Interactions

Let's explore how these modules interact with a simplified code example that traces the flow of data through the Bonsai algorithm:

In [None]:
# Import Bonsai module
import sys
sys.path.append(utils_directory)
from bonsaitree.bonsaitree.v3 import bonsai

# Examine the main entry point function
print("Bonsai's main entry point function:")
print(bonsai.build_pedigree.__doc__)

### 1.3 The Build Pedigree Function

The `build_pedigree` function is the main entry point for the Bonsai algorithm. Let's create a simplified version to understand its overall structure and flow:

In [None]:
def simplified_build_pedigree(bio_info, unphased_ibd_seg_list, min_seg_len=7):
    """Simplified version of Bonsai's build_pedigree function for educational purposes.
    
    Args:
        bio_info: List of dictionaries with information about individuals
        unphased_ibd_seg_list: List of IBD segments
        min_seg_len: Minimum segment length to consider
        
    Returns:
        List of (pedigree, log_likelihood) tuples
    """
    # 1. Input Processing
    print("1. Input Processing")
    print(f"  Processing {len(bio_info)} individuals")
    print(f"  Processing {len(unphased_ibd_seg_list)} IBD segments")
    print(f"  Using minimum segment length of {min_seg_len} cM")
    
    # 2. Initialize Up-Node Dictionary
    print("\n2. Initialize Up-Node Dictionary")
    up_dict = {}
    # In the real implementation, this would create initial nodes for each individual
    for i, person in enumerate(bio_info):
        genotype_id = person['genotype_id']
        up_dict[genotype_id] = {}
        print(f"  Added individual {genotype_id} to up_dict")
        if i >= 2:  # Just show a few for demonstration
            print("  ...")
            break
    
    # 3. Initialize Relationship Model
    print("\n3. Initialize Relationship Model")
    # In the real implementation, this would create models for different relationship types
    print("  Initializing models for: Parent-Child, Siblings, Cousins, etc.")
    
    # 4. Run Optimization Engine
    print("\n4. Run Optimization Engine")
    print("  Starting pedigree search...")
    print("  Iteratively building and refining pedigree structure")
    print("  Applying constraints during optimization")
    
    # 5. Generate Output
    print("\n5. Generate Output")
    print("  Finalizing pedigree structure")
    print("  Calculating final log likelihood")
    
    # Return a simplified result
    mock_pedigree = {i: {} for i in range(1, 4)}  # Mock pedigree with 3 individuals
    mock_log_like = -100.5  # Mock log likelihood
    
    return [(mock_pedigree, mock_log_like)]

# Create some dummy data for demonstration
dummy_bio_info = [
    {'genotype_id': 1, 'age': 30, 'sex': 'M'},
    {'genotype_id': 2, 'age': 60, 'sex': 'F'},
    {'genotype_id': 3, 'age': 35, 'sex': 'F'},
    {'genotype_id': 4, 'age': 10, 'sex': 'M'}
]

dummy_ibd_seg_list = [
    [1, 2, '1', 1000000, 5000000, False, 10.5],
    [1, 3, '2', 2000000, 7000000, False, 15.2],
    [2, 3, '3', 3000000, 9000000, False, 8.7]
]

# Run the simplified function
result = simplified_build_pedigree(dummy_bio_info, dummy_ibd_seg_list)
print("\nResult:", result)

## 2. Core Components

Now that we understand the overall architecture, let's examine each of the core components in more detail.

### 2.1 Input Processing Module

The Input Processing Module is responsible for reading, validating, and preparing the input data for use by the rest of the algorithm. The two main inputs to Bonsai are:

1. **IBD Segments**: Detected identity-by-descent segments between pairs of individuals
2. **Bioinfo**: Biological information about individuals (age, sex, etc.)

Let's look at how Bonsai processes these inputs:

In [None]:
def process_ibd_segments(unphased_ibd_seg_list, min_seg_len=7):
    """Process and filter IBD segments.
    
    Args:
        unphased_ibd_seg_list: List of IBD segments in format [id1, id2, chr, start, end, is_full, length]
        min_seg_len: Minimum segment length in centiMorgans
        
    Returns:
        Dictionary mapping pairs of individuals to their shared IBD segments
    """
    # Create a dictionary to store segments between pairs
    ibd_dict = defaultdict(list)
    
    # Count of filtered segments
    total_segments = len(unphased_ibd_seg_list)
    filtered_segments = 0
    
    # Process each segment
    for segment in unphased_ibd_seg_list:
        id1, id2, chrom, start, end, is_full, length = segment
        
        # Filter segments by length
        if length < min_seg_len:
            filtered_segments += 1
            continue
        
        # Store the segment (ensure consistent ordering of pairs)
        pair = tuple(sorted([id1, id2]))
        ibd_dict[pair].append({
            'chromosome': chrom,
            'start': start,
            'end': end,
            'is_ibd2': is_full,
            'length_cm': length
        })
    
    print(f"Processed {total_segments} segments")
    print(f"Filtered out {filtered_segments} segments below {min_seg_len} cM")
    print(f"Retained {total_segments - filtered_segments} segments")
    print(f"Found IBD sharing between {len(ibd_dict)} pairs of individuals")
    
    return ibd_dict

def process_bioinfo(bio_info):
    """Process biological information for individuals.
    
    Args:
        bio_info: List of dictionaries with information about individuals
        
    Returns:
        Dictionary mapping individual IDs to their biological information
    """
    bioinfo_dict = {}
    
    for person in bio_info:
        genotype_id = person['genotype_id']
        
        # Extract required fields
        processed_info = {
            'sex': person.get('sex', 'U'),  # Default to Unknown if not provided
            'age': person.get('age', None),
            'birth_year': person.get('birth_year', None)
        }
        
        # Validate sex information
        if processed_info['sex'] not in ['M', 'F', 'U']:
            print(f"Warning: Invalid sex '{processed_info['sex']}' for individual {genotype_id}. Setting to Unknown.")
            processed_info['sex'] = 'U'
            
        bioinfo_dict[genotype_id] = processed_info
    
    print(f"Processed bioinfo for {len(bioinfo_dict)} individuals")
    return bioinfo_dict

# Test with our dummy data
ibd_dict = process_ibd_segments(dummy_ibd_seg_list, min_seg_len=7)
bioinfo_dict = process_bioinfo(dummy_bio_info)

# Display a sample of the processed data
print("\nSample of IBD data:")
for pair, segments in list(ibd_dict.items())[:2]:  # Show first 2 pairs
    print(f"  {pair}: {segments}")
    
print("\nSample of bioinfo data:")
for id, info in list(bioinfo_dict.items())[:2]:  # Show first 2 individuals
    print(f"  {id}: {info}")

### 2.2 Relationship Model Module

The Relationship Model Module defines the genetic relationship patterns and calculates likelihoods for different relationship hypotheses. This is one of the most critical components of Bonsai, as it determines how genetic evidence is evaluated.

Let's implement a simplified version of the relationship model:

In [None]:
class RelationshipModel:
    """Models genetic relationships and calculates likelihoods based on IBD sharing."""
    
    def __init__(self):
        """Initialize relationship model parameters."""
        # Define expected sharing for common relationships
        # Format: (total_cm, num_segments)
        self.expected_sharing = {
            'parent-child': (3500, 40),
            'full-siblings': (2500, 35),
            'half-siblings': (1700, 25),
            'grandparent': (1700, 25),
            'aunt-uncle': (1700, 25),
            'first-cousins': (850, 15),
            'second-cousins': (212, 5),
            'unrelated': (0, 0)
        }
        
        # Standard deviations for expected sharing
        self.sharing_sd = {
            'total_cm': 0.1,  # As fraction of expected value
            'num_segments': 0.2  # As fraction of expected value
        }
    
    def calculate_relationship_likelihood(self, pair, segments, relationship_type):
        """Calculate likelihood of a specific relationship given IBD sharing.
        
        Args:
            pair: Tuple of individual IDs
            segments: List of IBD segments between the pair
            relationship_type: String indicating relationship type
            
        Returns:
            Log likelihood of the relationship
        """
        if relationship_type not in self.expected_sharing:
            raise ValueError(f"Unknown relationship type: {relationship_type}")
        
        # Calculate observed sharing
        total_cm = sum(seg['length_cm'] for seg in segments)
        num_segments = len(segments)
        
        # Get expected sharing for this relationship
        expected_cm, expected_segments = self.expected_sharing[relationship_type]
        
        # Calculate standard deviations
        sd_cm = expected_cm * self.sharing_sd['total_cm'] if expected_cm > 0 else 1
        sd_segments = expected_segments * self.sharing_sd['num_segments'] if expected_segments > 0 else 1
        
        # Calculate log likelihood using normal distribution
        from scipy.stats import norm
        
        # Log likelihood for total cM
        ll_cm = norm.logpdf(total_cm, expected_cm, sd_cm) if expected_cm > 0 else 0
        
        # Log likelihood for number of segments
        ll_segments = norm.logpdf(num_segments, expected_segments, sd_segments) if expected_segments > 0 else 0
        
        # Combined log likelihood
        log_likelihood = ll_cm + ll_segments
        
        return log_likelihood
    
    def find_most_likely_relationship(self, pair, segments):
        """Find the most likely relationship type given IBD sharing.
        
        Args:
            pair: Tuple of individual IDs
            segments: List of IBD segments between the pair
            
        Returns:
            Tuple of (relationship_type, log_likelihood)
        """
        best_relationship = None
        best_likelihood = float('-inf')
        
        for relationship in self.expected_sharing.keys():
            likelihood = self.calculate_relationship_likelihood(pair, segments, relationship)
            
            if likelihood > best_likelihood:
                best_likelihood = likelihood
                best_relationship = relationship
        
        return (best_relationship, best_likelihood)

# Create a relationship model
relationship_model = RelationshipModel()

# Test with a sample pair and their segments
sample_pair = next(iter(ibd_dict.keys()))
sample_segments = ibd_dict[sample_pair]

print(f"Testing relationship model with pair {sample_pair} who share {len(sample_segments)} segments")
print(f"Total shared cM: {sum(seg['length_cm'] for seg in sample_segments):.2f}")

# Calculate likelihoods for different relationships
print("\nRelationship likelihoods:")
for relationship in relationship_model.expected_sharing.keys():
    likelihood = relationship_model.calculate_relationship_likelihood(sample_pair, sample_segments, relationship)
    print(f"  {relationship}: {likelihood:.2f}")

# Find most likely relationship
most_likely = relationship_model.find_most_likely_relationship(sample_pair, sample_segments)
print(f"\nMost likely relationship: {most_likely[0]} (log likelihood: {most_likely[1]:.2f})")

### 2.3 Up-Node Dictionary Module

The Up-Node Dictionary is the core data structure that represents the pedigree in Bonsai. It's called an "Up-Node" dictionary because it represents individuals and their upward connections in the pedigree (i.e., to parents, grandparents, etc.).

Let's explore this data structure:

In [ ]:
class UpNodeDictionary:
    """Represents a pedigree structure using an Up-Node Dictionary."""
    
    def __init__(self, bioinfo_dict):
        """Initialize an empty pedigree with known individuals.
        
        Args:
            bioinfo_dict: Dictionary mapping individual IDs to biological information
        """
        self.up_dict = {}
        self.next_internal_id = -1  # Negative IDs for inferred ancestors
        
        # Add known individuals to the pedigree
        for individual_id, bio_info in bioinfo_dict.items():
            self.up_dict[individual_id] = {
                'father': None,
                'mother': None,
                'sex': bio_info['sex'],
                'age': bio_info.get('age'),
                'type': 'known'  # Known individual, not inferred
            }
    
    def get_next_internal_id(self):
        """Get the next available internal ID for inferred ancestors."""
        internal_id = self.next_internal_id
        self.next_internal_id -= 1
        return internal_id
    
    def add_parent(self, child_id, parent_sex):
        """Add a parent to an individual.
        
        Args:
            child_id: ID of the child
            parent_sex: Sex of the parent ('M' for father, 'F' for mother)
            
        Returns:
            ID of the added parent
        """
        if child_id not in self.up_dict:
            raise ValueError(f"Individual {child_id} not found in pedigree")
        
        # Create a new parent node
        parent_id = self.get_next_internal_id()
        self.up_dict[parent_id] = {
            'father': None,
            'mother': None,
            'sex': parent_sex,
            'age': None,  # Unknown age for inferred ancestors
            'type': 'inferred'  # Inferred ancestor
        }
        
        # Update the child's parent
        if parent_sex == 'M':
            self.up_dict[child_id]['father'] = parent_id
        elif parent_sex == 'F':
            self.up_dict[child_id]['mother'] = parent_id
        else:
            raise ValueError(f"Invalid parent sex: {parent_sex}. Must be 'M' or 'F'.")
        
        return parent_id
    
    def add_relationship(self, id1, id2, relationship):
        """Add a relationship between two individuals by adding common ancestors.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationship: Relationship type ('sibling', 'half-sibling', etc.)
            
        Returns:
            IDs of added ancestors
        """
        if relationship == 'full-siblings':
            # Add common father and mother
            father_id = self.add_parent(id1, 'M')
            mother_id = self.add_parent(id1, 'F')
            
            # Connect second individual to the same parents
            self.up_dict[id2]['father'] = father_id
            self.up_dict[id2]['mother'] = mother_id
            
            return [father_id, mother_id]
            
        elif relationship == 'half-siblings':
            # Add common parent (arbitrarily choose father)
            father_id = self.add_parent(id1, 'M')
            
            # Connect second individual to the same father
            self.up_dict[id2]['father'] = father_id
            
            # Add separate mothers
            mother1_id = self.add_parent(id1, 'F')
            mother2_id = self.add_parent(id2, 'F')
            
            return [father_id, mother1_id, mother2_id]
            
        elif relationship == 'first-cousins':
            # Create grandparents
            grandfather_id = self.get_next_internal_id()
            grandmother_id = self.get_next_internal_id()
            
            # Create parents for id1
            parent1_id = self.add_parent(id1, 'M')  # Father
            self.up_dict[parent1_id]['father'] = grandfather_id
            self.up_dict[parent1_id]['mother'] = grandmother_id
            
            # Create parents for id2
            parent2_id = self.add_parent(id2, 'F')  # Mother (different from id1's parent)
            self.up_dict[parent2_id]['father'] = grandfather_id
            self.up_dict[parent2_id]['mother'] = grandmother_id
            
            # Add grandparents to the dictionary
            self.up_dict[grandfather_id] = {
                'father': None, 'mother': None, 'sex': 'M', 'age': None, 'type': 'inferred'
            }
            self.up_dict[grandmother_id] = {
                'father': None, 'mother': None, 'sex': 'F', 'age': None, 'type': 'inferred'
            }
            
            return [grandfather_id, grandmother_id, parent1_id, parent2_id]
            
        else:
            raise ValueError(f"Unsupported relationship type: {relationship}")
    
    def visualize_pedigree(self):
        """Visualize the pedigree as a graph."""
        G = nx.DiGraph()
        
        # Add nodes
        for individual_id, info in self.up_dict.items():
            node_type = info['type']
            sex = info['sex']
            
            # Choose node shape and color based on sex and type
            if sex == 'M':
                node_shape = 's'  # Square for males
                node_color = 'lightblue' if node_type == 'known' else 'skyblue'
            elif sex == 'F':
                node_shape = 'o'  # Circle for females
                node_color = 'pink' if node_type == 'known' else 'lightpink'
            else:
                node_shape = 'd'  # Diamond for unknown
                node_color = 'lightgray' if node_type == 'known' else 'gray'
            
            G.add_node(individual_id, shape=node_shape, color=node_color, type=node_type)
            
            # Add edges to parents
            if info['father'] is not None:
                G.add_edge(info['father'], individual_id)
            if info['mother'] is not None:
                G.add_edge(info['mother'], individual_id)
        
        # Create positions using a hierarchical layout
        pos = nx.nx_agraph.graphviz_layout(G, prog="dot")
        
        plt.figure(figsize=(12, 8))
        
        # Draw nodes by sex
        node_shapes = nx.get_node_attributes(G, 'shape')
        node_colors = nx.get_node_attributes(G, 'color')
        node_types = nx.get_node_attributes(G, 'type')
        
        # Draw males (squares)
        male_nodes = [n for n, s in node_shapes.items() if s == 's']
        male_colors = [node_colors[n] for n in male_nodes]
        nx.draw_networkx_nodes(G, pos, nodelist=male_nodes, node_color=male_colors,
                              node_shape='s', node_size=500)
        
        # Draw females (circles)
        female_nodes = [n for n, s in node_shapes.items() if s == 'o']
        female_colors = [node_colors[n] for n in female_nodes]
        nx.draw_networkx_nodes(G, pos, nodelist=female_nodes, node_color=female_colors,
                              node_shape='o', node_size=500)
        
        # Draw unknown (diamonds)
        unknown_nodes = [n for n, s in node_shapes.items() if s == 'd']
        unknown_colors = [node_colors[n] for n in unknown_nodes]
        nx.draw_networkx_nodes(G, pos, nodelist=unknown_nodes, node_color=unknown_colors,
                              node_shape='d', node_size=500)
        
        # Draw edges
        nx.draw_networkx_edges(G, pos, arrows=False)
        
        # Draw labels
        labels = {}
        for node in G.nodes():
            if node_types[node] == 'known':
                labels[node] = str(node)
            else:
                labels[node] = f"A{abs(node)}"
        nx.draw_networkx_labels(G, pos, labels=labels)
        
        plt.title("Pedigree Visualization")
        plt.axis('off')
        plt.show()
        
        return G

# Create an Up-Node Dictionary with our sample data
up_node_dict = UpNodeDictionary(bioinfo_dict)

# Display the initial state
print("Initial Up-Node Dictionary:")
for id, info in up_node_dict.up_dict.items():
    print(f"  {id}: {info}")

# Add some relationships
print("\nAdding full-sibling relationship between 1 and 3...")
added_nodes = up_node_dict.add_relationship(1, 3, 'full-siblings')
print(f"Added nodes: {added_nodes}")

print("\nAdding first-cousin relationship between 2 and 4...")
added_nodes = up_node_dict.add_relationship(2, 4, 'first-cousins')
print(f"Added nodes: {added_nodes}")

# Visualize the pedigree
print("\nVisualizing the pedigree...")
up_node_dict.visualize_pedigree()

In [ ]:
def calculate_pedigree_likelihood(up_dict, ibd_dict, relationship_model):
    """Calculate the overall likelihood of a pedigree given observed IBD sharing.
    
    Args:
        up_dict: Up-Node Dictionary representing the pedigree
        ibd_dict: Dictionary of IBD segments between pairs
        relationship_model: Relationship model for likelihood calculations
        
    Returns:
        Log likelihood of the pedigree
    """
    total_log_likelihood = 0
    
    # Process each pair with observed IBD segments
    for pair, segments in ibd_dict.items():
        id1, id2 = pair
        
        # Infer the relationship type from the pedigree
        relationship_type = infer_relationship_from_pedigree(up_dict.up_dict, id1, id2)
        
        # Calculate likelihood of this relationship given the observed segments
        if relationship_type is not None:
            try:
                log_likelihood = relationship_model.calculate_relationship_likelihood(pair, segments, relationship_type)
                total_log_likelihood += log_likelihood
                print(f"Pair {pair}: Inferred {relationship_type}, Log likelihood: {log_likelihood:.2f}")
            except ValueError:
                # If the relationship type is not supported by the model
                print(f"Warning: Relationship type '{relationship_type}' not supported by the model")
        else:
            print(f"Warning: Could not infer relationship between {id1} and {id2}")
    
    return total_log_likelihood

def infer_relationship_from_pedigree(up_dict, id1, id2):
    """Infer the relationship type between two individuals in a pedigree.
    
    Args:
        up_dict: Up-Node Dictionary representing the pedigree
        id1: ID of first individual
        id2: ID of second individual
        
    Returns:
        String indicating the relationship type, or None if it couldn't be determined
    """
    # Check for direct parent-child relationship
    if (up_dict[id1].get('father') == id2 or up_dict[id1].get('mother') == id2 or
        up_dict[id2].get('father') == id1 or up_dict[id2].get('mother') == id1):
        return 'parent-child'
    
    # Check for full siblings
    if (up_dict[id1].get('father') is not None and up_dict[id2].get('father') is not None and
        up_dict[id1].get('mother') is not None and up_dict[id2].get('mother') is not None and
        up_dict[id1].get('father') == up_dict[id2].get('father') and
        up_dict[id1].get('mother') == up_dict[id2].get('mother')):
        return 'full-siblings'
    
    # Check for half siblings (same father)
    if (up_dict[id1].get('father') is not None and up_dict[id2].get('father') is not None and
        up_dict[id1].get('father') == up_dict[id2].get('father') and
        up_dict[id1].get('mother') != up_dict[id2].get('mother')):
        return 'half-siblings'
    
    # Check for half siblings (same mother)
    if (up_dict[id1].get('mother') is not None and up_dict[id2].get('mother') is not None and
        up_dict[id1].get('mother') == up_dict[id2].get('mother') and
        up_dict[id1].get('father') != up_dict[id2].get('father')):
        return 'half-siblings'
    
    # For simplicity, we'll stop at these basic relationships
    # In a real implementation, we would trace through the pedigree to find grandparent-grandchild,
    # first cousin, etc. relationships
    
    return None

# Calculate the likelihood of our sample pedigree
print("Calculating pedigree likelihood...")
pedigree_likelihood = calculate_pedigree_likelihood(up_node_dict, ibd_dict, relationship_model)
print(f"Total log likelihood of the pedigree: {pedigree_likelihood:.2f}")

## 3. Optimization Engine

The Optimization Engine is responsible for searching through the space of possible pedigrees to find the structure that best explains the observed IBD sharing. This is the heart of the Bonsai algorithm, where the computational magic happens.

### 3.1 Search Algorithms

Bonsai uses several search algorithms to explore the space of possible pedigrees:

1. **Greedy Search**: Iteratively adds the most likely relationship at each step
2. **Simulated Annealing**: Uses randomization to avoid getting stuck in local optima
3. **Markov Chain Monte Carlo (MCMC)**: Samples from the space of possible pedigrees

Let's implement a simplified version of the greedy search algorithm:

In [ ]:
def greedy_pedigree_search(bioinfo_dict, ibd_dict, relationship_model, max_iterations=10):
    """Perform a greedy search for the best pedigree structure.
    
    Args:
        bioinfo_dict: Dictionary of biological information for individuals
        ibd_dict: Dictionary of IBD segments between pairs
        relationship_model: Relationship model for likelihood calculations
        max_iterations: Maximum number of iterations to perform
        
    Returns:
        Tuple of (best_pedigree, best_likelihood)
    """
    # Initialize an empty pedigree
    current_pedigree = UpNodeDictionary(bioinfo_dict)
    current_likelihood = calculate_pedigree_likelihood(current_pedigree, ibd_dict, relationship_model)
    
    print(f"Initial pedigree likelihood: {current_likelihood:.2f}")
    
    # Keep track of pairs that have been processed
    processed_pairs = set()
    
    # Perform greedy search
    for iteration in range(max_iterations):
        print(f"\nIteration {iteration + 1}/{max_iterations}")
        
        best_move = None
        best_move_likelihood = current_likelihood
        
        # Try adding each possible relationship
        for pair, segments in ibd_dict.items():
            if pair in processed_pairs:
                continue
                
            id1, id2 = pair
            
            # Find the most likely relationship for this pair
            relationship, _ = relationship_model.find_most_likely_relationship(pair, segments)
            
            # Create a copy of the current pedigree
            test_pedigree = UpNodeDictionary(bioinfo_dict)
            test_pedigree.up_dict = current_pedigree.up_dict.copy()
            
            # Try adding this relationship
            try:
                test_pedigree.add_relationship(id1, id2, relationship)
                test_likelihood = calculate_pedigree_likelihood(test_pedigree, ibd_dict, relationship_model)
                
                print(f"Trying {relationship} between {id1} and {id2}: Likelihood {test_likelihood:.2f}")
                
                # If this move improves the likelihood, remember it
                if test_likelihood > best_move_likelihood:
                    best_move = (id1, id2, relationship)
                    best_move_likelihood = test_likelihood
            except ValueError as e:
                print(f"Error adding {relationship} between {id1} and {id2}: {e}")
        
        # If we found a better move, apply it
        if best_move is not None:
            id1, id2, relationship = best_move
            print(f"Applying best move: {relationship} between {id1} and {id2}")
            current_pedigree.add_relationship(id1, id2, relationship)
            current_likelihood = best_move_likelihood
            processed_pairs.add((id1, id2))
            processed_pairs.add((id2, id1))  # Add both orderings of the pair
        else:
            print("No improving moves found. Stopping search.")
            break
    
    print(f"\nFinal pedigree likelihood: {current_likelihood:.2f}")
    return current_pedigree, current_likelihood

# Run the greedy search algorithm on our sample data
# Note: This could take a while, so we'll limit to just a few iterations
print("Running greedy pedigree search...")
best_pedigree, best_likelihood = greedy_pedigree_search(
    bioinfo_dict, ibd_dict, relationship_model, max_iterations=3)

print("\nVisualizing the best pedigree found...")
best_pedigree.visualize_pedigree()

### 3.2 Simulated Annealing

The greedy search algorithm can get stuck in local optima. Simulated annealing addresses this by occasionally accepting moves that decrease the likelihood, with a probability that depends on a "temperature" parameter that decreases over time.

Here's a simplified implementation of simulated annealing for pedigree reconstruction:

In [ ]:
def simulated_annealing_pedigree_search(bioinfo_dict, ibd_dict, relationship_model, 
                                  max_iterations=100, initial_temp=10.0, cooling_rate=0.95):
    """Perform a simulated annealing search for the best pedigree structure.
    
    Args:
        bioinfo_dict: Dictionary of biological information for individuals
        ibd_dict: Dictionary of IBD segments between pairs
        relationship_model: Relationship model for likelihood calculations
        max_iterations: Maximum number of iterations to perform
        initial_temp: Initial temperature for simulated annealing
        cooling_rate: Rate at which temperature decreases
        
    Returns:
        Tuple of (best_pedigree, best_likelihood)
    """
    # Initialize an empty pedigree
    current_pedigree = UpNodeDictionary(bioinfo_dict)
    current_likelihood = calculate_pedigree_likelihood(current_pedigree, ibd_dict, relationship_model)
    
    # Keep track of the best pedigree seen so far
    best_pedigree = current_pedigree
    best_likelihood = current_likelihood
    
    # Initialize temperature
    temperature = initial_temp
    
    print(f"Initial pedigree likelihood: {current_likelihood:.2f}")
    print(f"Initial temperature: {temperature:.2f}")
    
    # Perform simulated annealing
    for iteration in range(max_iterations):
        # Generate a random move: add a relationship between a random pair
        pairs = list(ibd_dict.keys())
        if not pairs:
            print("No pairs to process. Stopping search.")
            break
            
        pair = random.choice(pairs)
        id1, id2 = pair
        
        # Find a random relationship to try (in a real implementation, we would use
        # the relationship model to guide this choice)
        relationship = random.choice(['full-siblings', 'half-siblings', 'first-cousins'])
        
        # Create a copy of the current pedigree
        test_pedigree = UpNodeDictionary(bioinfo_dict)
        test_pedigree.up_dict = current_pedigree.up_dict.copy()
        
        # Try applying the move
        try:
            test_pedigree.add_relationship(id1, id2, relationship)
            test_likelihood = calculate_pedigree_likelihood(test_pedigree, ibd_dict, relationship_model)
            
            # Calculate change in likelihood
            delta_likelihood = test_likelihood - current_likelihood
            
            # Decide whether to accept the move
            if delta_likelihood > 0:
                # Always accept improving moves
                current_pedigree = test_pedigree
                current_likelihood = test_likelihood
                print(f"Iteration {iteration + 1}/{max_iterations}: Accepted improving move. Likelihood: {current_likelihood:.2f}")
                
                # Update best pedigree if this is the best seen so far
                if current_likelihood > best_likelihood:
                    best_pedigree = current_pedigree
                    best_likelihood = current_likelihood
            else:
                # Accept worsening moves with a probability that depends on temperature
                acceptance_probability = np.exp(delta_likelihood / temperature)
                if random.random() < acceptance_probability:
                    current_pedigree = test_pedigree
                    current_likelihood = test_likelihood
                    print(f"Iteration {iteration + 1}/{max_iterations}: Accepted worsening move with probability {acceptance_probability:.4f}. Likelihood: {current_likelihood:.2f}")
                else:
                    print(f"Iteration {iteration + 1}/{max_iterations}: Rejected worsening move with probability {1 - acceptance_probability:.4f}")
        except ValueError as e:
            print(f"Iteration {iteration + 1}/{max_iterations}: Error adding {relationship} between {id1} and {id2}: {e}")
        
        # Cool the temperature
        temperature *= cooling_rate
        
        # Print status every 10 iterations
        if (iteration + 1) % 10 == 0:
            print(f"Iteration {iteration + 1}/{max_iterations}: Temperature: {temperature:.4f}, Current likelihood: {current_likelihood:.2f}, Best likelihood: {best_likelihood:.2f}")
    
    print(f"\nFinal best pedigree likelihood: {best_likelihood:.2f}")
    return best_pedigree, best_likelihood

# For demonstration, we'll just describe the algorithm without running it
print("Simulated Annealing Pedigree Search:")
print("  1. Start with an initial pedigree and temperature")
print("  2. Iteratively propose random changes to the pedigree")
print("  3. Always accept changes that improve the likelihood")
print("  4. Sometimes accept changes that worsen the likelihood, with probability dependent on temperature")
print("  5. Gradually decrease the temperature over time")
print("  6. Return the best pedigree found during the search")
print("\nNote: We won't run this algorithm here due to time constraints, but in a real application,")
print("      it would be used to escape local optima that the greedy search might get stuck in.")

## 4. Constraint Handler

The Constraint Handler ensures that the pedigree respects biological and logical constraints. Examples of constraints include:

1. **Biological constraints**:
   - Age constraints (parents must be older than children)
   - Reproductive age limits
   - Sex-specific constraints (e.g., only females can be mothers)

2. **Logical constraints**:
   - No cycles in the pedigree (a person cannot be their own ancestor)
   - No duplicate relationships (a person cannot have multiple biological fathers)
   - Consistency with known relationships

Let's implement a simplified constraint handler:

In [ ]:
class ConstraintHandler:
    """Handles biological and logical constraints for pedigree structures."""
    
    def __init__(self, bioinfo_dict):
        """Initialize constraint handler with biological information.
        
        Args:
            bioinfo_dict: Dictionary mapping individual IDs to biological information
        """
        self.bioinfo_dict = bioinfo_dict
        self.min_reproductive_age = 15  # Minimum age to have children
        self.max_reproductive_age = {
            'F': 50,  # Maximum age for females to have children
            'M': 80   # Maximum age for males to have children
        }
        self.min_parent_child_age_diff = 15  # Minimum age difference between parent and child
    
    def check_constraints(self, up_dict):
        """Check if a pedigree satisfies all constraints.
        
        Args:
            up_dict: Up-Node Dictionary representing the pedigree
            
        Returns:
            Tuple of (is_valid, list_of_violations)
        """
        violations = []
        
        # Check for cycles
        if self._has_cycles(up_dict):
            violations.append("Pedigree contains cycles")
        
        # Check for duplicate parents
        for individual_id, info in up_dict.items():
            father = info.get('father')
            mother = info.get('mother')
            
            if father is not None and mother is not None:
                # Check if father is male and mother is female
                if father in up_dict and up_dict[father].get('sex') != 'M':
                    violations.append(f"Individual {father} is not male but is set as father")
                
                if mother in up_dict and up_dict[mother].get('sex') != 'F':
                    violations.append(f"Individual {mother} is not female but is set as mother")
            
            # Check age constraints
            if self._has_age_violations(up_dict, individual_id):
                violations.append(f"Age constraints violated for individual {individual_id}")
        
        return len(violations) == 0, violations
    
    def _has_cycles(self, up_dict):
        """Check if the pedigree contains cycles.
        
        Args:
            up_dict: Up-Node Dictionary representing the pedigree
            
        Returns:
            Boolean indicating whether cycles exist
        """
        # Create a directed graph
        G = nx.DiGraph()
        
        # Add nodes and edges
        for individual_id, info in up_dict.items():
            G.add_node(individual_id)
            
            # Add edges to parents
            father = info.get('father')
            mother = info.get('mother')
            
            if father is not None:
                G.add_edge(individual_id, father)
            
            if mother is not None:
                G.add_edge(individual_id, mother)
        
        # Check for cycles
        try:
            nx.find_cycle(G)
            return True  # Cycle found
        except nx.NetworkXNoCycle:
            return False  # No cycle found
    
    def _has_age_violations(self, up_dict, individual_id):
        """Check for age-related constraint violations.
        
        Args:
            up_dict: Up-Node Dictionary representing the pedigree
            individual_id: ID of the individual to check
            
        Returns:
            Boolean indicating whether age constraints are violated
        """
        # Skip if this is not a known individual or age information is missing
        if individual_id not in self.bioinfo_dict or 'age' not in self.bioinfo_dict[individual_id]:
            return False
        
        individual_age = self.bioinfo_dict[individual_id].get('age')
        
        # Skip if age is unknown
        if individual_age is None:
            return False
        
        # Check parents' ages
        father_id = up_dict[individual_id].get('father')
        mother_id = up_dict[individual_id].get('mother')
        
        if father_id is not None and father_id in self.bioinfo_dict:
            father_age = self.bioinfo_dict[father_id].get('age')
            
            if father_age is not None:
                # Father must be older than child
                if father_age <= individual_age:
                    return True
                
                # Father must be at least min_parent_child_age_diff years older
                if father_age - individual_age < self.min_parent_child_age_diff:
                    return True
                
                # Father must have been of reproductive age when child was born
                father_age_at_birth = father_age - individual_age
                if father_age_at_birth < self.min_reproductive_age or father_age_at_birth > self.max_reproductive_age['M']:
                    return True
        
        if mother_id is not None and mother_id in self.bioinfo_dict:
            mother_age = self.bioinfo_dict[mother_id].get('age')
            
            if mother_age is not None:
                # Mother must be older than child
                if mother_age <= individual_age:
                    return True
                
                # Mother must be at least min_parent_child_age_diff years older
                if mother_age - individual_age < self.min_parent_child_age_diff:
                    return True
                
                # Mother must have been of reproductive age when child was born
                mother_age_at_birth = mother_age - individual_age
                if mother_age_at_birth < self.min_reproductive_age or mother_age_at_birth > self.max_reproductive_age['F']:
                    return True
        
        return False
    
    def enforce_constraints(self, up_dict):
        """Modify a pedigree to enforce constraints.
        
        Args:
            up_dict: Up-Node Dictionary representing the pedigree
            
        Returns:
            Modified Up-Node Dictionary
        """
        # Create a copy of the pedigree
        enforced_up_dict = up_dict.copy()
        
        # Check for violations and remove problematic relationships
        for individual_id, info in up_dict.items():
            father = info.get('father')
            mother = info.get('mother')
            
            # Fix sex constraints
            if father is not None and father in up_dict and up_dict[father].get('sex') != 'M':
                enforced_up_dict[individual_id]['father'] = None
            
            if mother is not None and mother in up_dict and up_dict[mother].get('sex') != 'F':
                enforced_up_dict[individual_id]['mother'] = None
            
            # Fix age constraints
            if self._has_age_violations(up_dict, individual_id):
                # If age constraints are violated, remove parent relationships
                enforced_up_dict[individual_id]['father'] = None
                enforced_up_dict[individual_id]['mother'] = None
        
        # If cycles exist, we would need to break them
        # This is a complex problem that would require a more sophisticated algorithm
        # For simplicity, we'll just remove all relationships if cycles are detected
        if self._has_cycles(up_dict):
            for individual_id in enforced_up_dict:
                enforced_up_dict[individual_id]['father'] = None
                enforced_up_dict[individual_id]['mother'] = None
        
        return enforced_up_dict

# Create a constraint handler for our sample data
constraint_handler = ConstraintHandler(bioinfo_dict)

# Check if our pedigree satisfies constraints
is_valid, violations = constraint_handler.check_constraints(up_node_dict.up_dict)

print(f"Is the pedigree valid? {is_valid}")
if not is_valid:
    print("Constraint violations:")
    for violation in violations:
        print(f"  - {violation}")
    
    print("\nEnforcing constraints...")
    enforced_up_dict = constraint_handler.enforce_constraints(up_node_dict.up_dict)
    
    # Create a new pedigree with the enforced up_dict
    enforced_pedigree = UpNodeDictionary(bioinfo_dict)
    enforced_pedigree.up_dict = enforced_up_dict
    
    # Check if the enforced pedigree satisfies constraints
    is_valid, violations = constraint_handler.check_constraints(enforced_pedigree.up_dict)
    print(f"Is the enforced pedigree valid? {is_valid}")
    
    # Visualize the enforced pedigree
    if not is_valid:
        print("Constraint violations still exist:")
        for violation in violations:
            print(f"  - {violation}")
    else:
        print("Constraints successfully enforced.")
        enforced_pedigree.visualize_pedigree()

## 5. Performance Optimization

Bonsai's architecture is designed with performance in mind, as it needs to handle large-scale pedigree reconstruction with potentially thousands of individuals. Let's explore some of the key optimization strategies used in Bonsai:

### 5.1 Memory Management

Pedigree reconstruction with large datasets can be memory-intensive. Bonsai implements several strategies to minimize memory usage:

1. **Sparse representation**: The Up-Node Dictionary only stores non-null relationships
2. **On-demand computation**: Some values are computed as needed rather than stored
3. **Data filtering**: IBD segments below a certain threshold are filtered out early
4. **Incremental processing**: The pedigree is built incrementally, focusing on the most important relationships first

Here's a simple example of memory usage benchmarking:

In [ ]:
def generate_synthetic_data(num_individuals, num_segments_per_pair, min_segment_len=7):
    """Generate synthetic data for benchmarking.
    
    Args:
        num_individuals: Number of individuals to generate
        num_segments_per_pair: Average number of IBD segments per pair
        min_segment_len: Minimum segment length
        
    Returns:
        Tuple of (bioinfo_dict, ibd_dict)
    """
    import sys
    import time
    import random
    from collections import defaultdict
    
    # Generate synthetic bioinfo
    bioinfo_dict = {}
    for i in range(1, num_individuals + 1):
        bioinfo_dict[i] = {
            'sex': random.choice(['M', 'F']),
            'age': random.randint(20, 80),
            'birth_year': None
        }
    
    # Generate synthetic IBD segments
    ibd_dict = defaultdict(list)
    for i in range(1, num_individuals + 1):
        for j in range(i+1, num_individuals + 1):
            if random.random() < 0.3:  # Only generate segments for 30% of pairs
                num_segments = max(1, int(random.gauss(num_segments_per_pair, 2)))
                for _ in range(num_segments):
                    chromosome = random.randint(1, 22)
                    start = random.randint(1000000, 200000000)
                    end = start + random.randint(1000000, 10000000)
                    length = min_segment_len + random.expovariate(1/20)  # Mean length of 20 cM
                    is_ibd2 = random.random() < 0.05  # 5% chance of IBD2
                    
                    ibd_dict[(i, j)].append({
                        'chromosome': str(chromosome),
                        'start': start,
                        'end': end,
                        'is_ibd2': is_ibd2,
                        'length_cm': length
                    })
    
    return bioinfo_dict, ibd_dict

def measure_memory_usage(func):
    """Measure memory usage of a function."""
    import tracemalloc
    import time
    
    tracemalloc.start()
    start_time = time.time()
    
    result = func()
    
    end_time = time.time()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    
    print(f"Execution time: {end_time - start_time:.2f} seconds")
    print(f"Current memory usage: {current / 10**6:.2f} MB")
    print(f"Peak memory usage: {peak / 10**6:.2f} MB")
    
    return result

# Define benchmark scenarios
def benchmark_small():
    """Benchmark with small dataset (10 individuals)."""
    bioinfo_dict, ibd_dict = generate_synthetic_data(10, 5)
    pedigree = UpNodeDictionary(bioinfo_dict)
    return pedigree, len(bioinfo_dict), sum(len(segments) for segments in ibd_dict.values())

def benchmark_medium():
    """Benchmark with medium dataset (100 individuals)."""
    bioinfo_dict, ibd_dict = generate_synthetic_data(100, 5)
    pedigree = UpNodeDictionary(bioinfo_dict)
    return pedigree, len(bioinfo_dict), sum(len(segments) for segments in ibd_dict.values())

# Uncomment to run the medium benchmark - it might take a while
# def benchmark_large():
#     """Benchmark with large dataset (1000 individuals)."""
#     bioinfo_dict, ibd_dict = generate_synthetic_data(1000, 5)
#     pedigree = UpNodeDictionary(bioinfo_dict)
#     return pedigree, len(bioinfo_dict), sum(len(segments) for segments in ibd_dict.values())

# Run benchmarks
print("Running small benchmark...")
small_result = measure_memory_usage(benchmark_small)
print(f"Created pedigree with {small_result[1]} individuals and {small_result[2]} segments")

print("\nRunning medium benchmark...")
medium_result = measure_memory_usage(benchmark_medium)
print(f"Created pedigree with {medium_result[1]} individuals and {medium_result[2]} segments")

# Uncomment to run the large benchmark - it might take a while
# print("\nRunning large benchmark...")
# large_result = measure_memory_usage(benchmark_large)
# print(f"Created pedigree with {large_result[1]} individuals and {large_result[2]} segments")

### 5.2 Computational Optimizations

In addition to memory optimizations, Bonsai employs several strategies to reduce computational complexity:

1. **Parallel processing**: Independent likelihood calculations can be parallelized
2. **Caching**: Frequently accessed values are cached to avoid redundant computation
3. **Heuristic pruning**: The search space is pruned using heuristics to focus on promising pedigree structures
4. **Incremental likelihood calculation**: When making small changes to a pedigree, only affected likelihoods are recalculated

Let's implement a simple example of caching and incremental likelihood calculation:

In [ ]:
class OptimizedRelationshipModel(RelationshipModel):
    """Optimized version of the RelationshipModel with caching."""
    
    def __init__(self):
        """Initialize relationship model with caching."""
        super().__init__()
        self.likelihood_cache = {}  # Cache for likelihood calculations
    
    def calculate_relationship_likelihood(self, pair, segments, relationship_type):
        """Calculate likelihood of a specific relationship with caching.
        
        Args:
            pair: Tuple of individual IDs
            segments: List of IBD segments between the pair
            relationship_type: String indicating relationship type
            
        Returns:
            Log likelihood of the relationship
        """
        # Create a cache key
        cache_key = (pair, relationship_type, len(segments), 
                    sum(seg['length_cm'] for seg in segments))
        
        # Check if result is in cache
        if cache_key in self.likelihood_cache:
            return self.likelihood_cache[cache_key]
        
        # Calculate likelihood (same as in base class)
        if relationship_type not in self.expected_sharing:
            raise ValueError(f"Unknown relationship type: {relationship_type}")
        
        # Calculate observed sharing
        total_cm = sum(seg['length_cm'] for seg in segments)
        num_segments = len(segments)
        
        # Get expected sharing for this relationship
        expected_cm, expected_segments = self.expected_sharing[relationship_type]
        
        # Calculate standard deviations
        sd_cm = expected_cm * self.sharing_sd['total_cm'] if expected_cm > 0 else 1
        sd_segments = expected_segments * self.sharing_sd['num_segments'] if expected_segments > 0 else 1
        
        # Calculate log likelihood using normal distribution
        from scipy.stats import norm
        
        # Log likelihood for total cM
        ll_cm = norm.logpdf(total_cm, expected_cm, sd_cm) if expected_cm > 0 else 0
        
        # Log likelihood for number of segments
        ll_segments = norm.logpdf(num_segments, expected_segments, sd_segments) if expected_segments > 0 else 0
        
        # Combined log likelihood
        log_likelihood = ll_cm + ll_segments
        
        # Store in cache
        self.likelihood_cache[cache_key] = log_likelihood
        
        return log_likelihood

class OptimizedPedigreeLikelihood:
    """Calculates pedigree likelihoods with optimizations for incremental updates."""
    
    def __init__(self, up_dict, ibd_dict, relationship_model):
        """Initialize with a pedigree and relationship model.
        
        Args:
            up_dict: Up-Node Dictionary representing the pedigree
            ibd_dict: Dictionary of IBD segments between pairs
            relationship_model: Relationship model for likelihood calculations
        """
        self.up_dict = up_dict
        self.ibd_dict = ibd_dict
        self.relationship_model = relationship_model
        self.pair_likelihoods = {}
        
        # Calculate initial likelihoods for all pairs
        self.calculate_all_likelihoods()
    
    def calculate_all_likelihoods(self):
        """Calculate likelihoods for all pairs."""
        self.pair_likelihoods = {}
        
        for pair, segments in self.ibd_dict.items():
            id1, id2 = pair
            relationship_type = infer_relationship_from_pedigree(self.up_dict.up_dict, id1, id2)
            
            if relationship_type is not None:
                try:
                    log_likelihood = self.relationship_model.calculate_relationship_likelihood(
                        pair, segments, relationship_type
                    )
                    self.pair_likelihoods[pair] = log_likelihood
                except ValueError:
                    self.pair_likelihoods[pair] = float('-inf')
            else:
                # Default to unrelated if no relationship is inferred
                try:
                    log_likelihood = self.relationship_model.calculate_relationship_likelihood(
                        pair, segments, 'unrelated'
                    )
                    self.pair_likelihoods[pair] = log_likelihood
                except ValueError:
                    self.pair_likelihoods[pair] = float('-inf')
    
    def get_total_likelihood(self):
        """Get total log likelihood of the pedigree."""
        return sum(self.pair_likelihoods.values())
    
    def update_likelihood_for_pairs(self, affected_pairs):
        """Update likelihoods only for affected pairs.
        
        Args:
            affected_pairs: List of pairs whose relationships have changed
        """
        for pair in affected_pairs:
            if pair in self.ibd_dict:
                id1, id2 = pair
                segments = self.ibd_dict[pair]
                relationship_type = infer_relationship_from_pedigree(self.up_dict.up_dict, id1, id2)
                
                if relationship_type is not None:
                    try:
                        log_likelihood = self.relationship_model.calculate_relationship_likelihood(
                            pair, segments, relationship_type
                        )
                        self.pair_likelihoods[pair] = log_likelihood
                    except ValueError:
                        self.pair_likelihoods[pair] = float('-inf')
                else:
                    # Default to unrelated if no relationship is inferred
                    try:
                        log_likelihood = self.relationship_model.calculate_relationship_likelihood(
                            pair, segments, 'unrelated'
                        )
                        self.pair_likelihoods[pair] = log_likelihood
                    except ValueError:
                        self.pair_likelihoods[pair] = float('-inf')
    
    def simulate_pedigree_change(self, test_up_dict, affected_pairs):
        """Simulate a change to the pedigree and calculate new likelihood.
        
        Args:
            test_up_dict: Test Up-Node Dictionary with the proposed change
            affected_pairs: List of pairs whose relationships would be affected
            
        Returns:
            Log likelihood of the test pedigree
        """
        # Save current state
        original_up_dict = self.up_dict
        original_likelihoods = self.pair_likelihoods.copy()
        
        # Apply test changes
        self.up_dict = test_up_dict
        self.update_likelihood_for_pairs(affected_pairs)
        
        # Calculate new likelihood
        new_likelihood = self.get_total_likelihood()
        
        # Restore original state
        self.up_dict = original_up_dict
        self.pair_likelihoods = original_likelihoods
        
        return new_likelihood

# Create an optimized relationship model
optimized_model = OptimizedRelationshipModel()

# Calculate relationships with and without optimization to compare
print("Calculating relationship likelihoods without optimization...")
start_time = time.time()
for pair, segments in ibd_dict.items():
    for relationship in optimized_model.expected_sharing.keys():
        relationship_model.calculate_relationship_likelihood(pair, segments, relationship)
end_time = time.time()
print(f"Time without optimization: {end_time - start_time:.4f} seconds")

print("\nCalculating relationship likelihoods with optimization (first run)...")
start_time = time.time()
for pair, segments in ibd_dict.items():
    for relationship in optimized_model.expected_sharing.keys():
        optimized_model.calculate_relationship_likelihood(pair, segments, relationship)
end_time = time.time()
print(f"Time with optimization (first run): {end_time - start_time:.4f} seconds")

print("\nCalculating relationship likelihoods with optimization (second run)...")
start_time = time.time()
for pair, segments in ibd_dict.items():
    for relationship in optimized_model.expected_sharing.keys():
        optimized_model.calculate_relationship_likelihood(pair, segments, relationship)
end_time = time.time()
print(f"Time with optimization (second run): {end_time - start_time:.4f} seconds")

## 6. Extension Mechanisms

Bonsai's modular architecture allows for easy extension and customization. Here are some of the key extension points:

### 6.1 Custom Relationship Models

One of the most common extensions is to create custom relationship models for specific populations or research questions. For example, you might want to create a model that accounts for endogamy (intermarriage within a small community) or population-specific recombination rates.

Here's an example of how to create a custom relationship model for a population with higher endogamy:

In [ ]:
class EndogamousPopulationModel(RelationshipModel):
    """Relationship model for a population with higher endogamy.
    
    In endogamous populations, individuals share more IBD segments than expected
    because of multiple distant relationships.
    """
    
    def __init__(self, endogamy_factor=1.5):
        """Initialize model with endogamy factor.
        
        Args:
            endogamy_factor: Factor by which to increase expected IBD sharing
        """
        super().__init__()
        self.endogamy_factor = endogamy_factor
        
        # Adjust expected sharing for endogamy
        # In endogamous populations, we expect more IBD sharing
        for relationship, (expected_cm, expected_segments) in self.expected_sharing.items():
            if relationship != 'parent-child' and relationship != 'full-siblings':
                # Parent-child and full sibling relationships aren't affected by endogamy
                self.expected_sharing[relationship] = (
                    expected_cm * endogamy_factor,
                    expected_segments * endogamy_factor
                )
    
    def calculate_relationship_likelihood(self, pair, segments, relationship_type):
        """Calculate likelihood of a specific relationship in an endogamous population.
        
        Args:
            pair: Tuple of individual IDs
            segments: List of IBD segments between the pair
            relationship_type: String indicating relationship type
            
        Returns:
            Log likelihood of the relationship
        """
        # Add a background IBD sharing component for distant relationships
        if relationship_type in ['unrelated', 'second-cousins', 'first-cousins']:
            # Calculate likelihood as in base class
            base_likelihood = super().calculate_relationship_likelihood(pair, segments, relationship_type)
            
            # Add a small bonus to account for potential multiple distant relationships
            bonus = 0.1 * np.log(self.endogamy_factor)
            
            return base_likelihood + bonus
        else:
            # For close relationships, use standard calculation
            return super().calculate_relationship_likelihood(pair, segments, relationship_type)

class AshkenaziJewishModel(EndogamousPopulationModel):
    """Relationship model specifically for Ashkenazi Jewish populations."""
    
    def __init__(self):
        """Initialize with Ashkenazi-specific parameters."""
        # Higher endogamy factor for Ashkenazi populations
        super().__init__(endogamy_factor=2.0)
        
        # Adjust expected sharing based on Ashkenazi-specific studies
        self.expected_sharing.update({
            'first-cousins': (950, 17),  # Higher than standard
            'second-cousins': (275, 6),  # Higher than standard
            'unrelated': (30, 1)  # Background IBD sharing in Ashkenazi populations
        })

# Test the endogamous population model
endogamous_model = EndogamousPopulationModel(endogamy_factor=1.5)
ashkenazi_model = AshkenaziJewishModel()

# Compare expected sharing across models
print("Expected sharing by relationship type:")
print("Standard Model:")
for relationship, (expected_cm, expected_segments) in relationship_model.expected_sharing.items():
    print(f"  {relationship}: {expected_cm:.1f} cM, {expected_segments:.1f} segments")

print("\nEndogamous Model (factor=1.5):")
for relationship, (expected_cm, expected_segments) in endogamous_model.expected_sharing.items():
    print(f"  {relationship}: {expected_cm:.1f} cM, {expected_segments:.1f} segments")

print("\nAshkenazi Jewish Model:")
for relationship, (expected_cm, expected_segments) in ashkenazi_model.expected_sharing.items():
    print(f"  {relationship}: {expected_cm:.1f} cM, {expected_segments:.1f} segments")

### 6.2 Integration with External Tools

Bonsai can be integrated with external tools for IBD detection, visualization, and analysis. Here's a simple example of how Bonsai can be integrated with visualization tools to create interactive pedigree visualizations:

In [ ]:
def convert_pedigree_to_graphviz(up_dict):
    """Convert a pedigree to Graphviz format for interactive visualization.
    
    Args:
        up_dict: Up-Node Dictionary representing the pedigree
        
    Returns:
        Graphviz object containing the pedigree
    """
    # Create a new Graphviz graph
    G = pgv.AGraph(strict=False, directed=True)
    
    # Set graph attributes for better visualization
    G.graph_attr['rankdir'] = 'BT'  # Bottom to Top layout
    G.graph_attr['splines'] = 'ortho'  # Orthogonal lines
    G.graph_attr['nodesep'] = '0.5'
    G.graph_attr['ranksep'] = '0.5'
    
    # Add nodes
    for individual_id, info in up_dict.items():
        # Node attributes based on sex and type
        node_attrs = {
            'shape': 'box' if info.get('sex') == 'M' else 'ellipse',
            'style': 'filled',
            'fillcolor': 'lightblue' if info.get('sex') == 'M' else 'pink',
            'width': '1.0',
            'height': '0.6',
            'fontsize': '10'
        }
        
        # Add age to label if available
        label = str(individual_id)
        if info.get('age') is not None:
            label += f" ({info.get('age')})"
        
        node_attrs['label'] = label
        
        # Add the node
        G.add_node(individual_id, **node_attrs)
    
    # Add edges (parent-child relationships)
    for individual_id, info in up_dict.items():
        father = info.get('father')
        mother = info.get('mother')
        
        if father is not None:
            G.add_edge(father, individual_id, color='blue')
        
        if mother is not None:
            G.add_edge(mother, individual_id, color='red')
    
    return G

def save_pedigree_visualization(up_dict, output_path, format='png'):
    """Save a visualization of the pedigree to a file.
    
    Args:
        up_dict: Up-Node Dictionary representing the pedigree
        output_path: Path to save the visualization
        format: Output format (png, pdf, svg, etc.)
    """
    # Convert pedigree to Graphviz
    G = convert_pedigree_to_graphviz(up_dict)
    
    # Save to file
    G.draw(output_path, prog='dot', format=format)
    
    print(f"Pedigree visualization saved to {output_path}")
    
    # Display the image if in a notebook
    if format in ['png', 'jpg', 'jpeg']:
        plt.figure(figsize=(12, 8))
        img = mpimg.imread(output_path)
        plt.imshow(img)
        plt.axis('off')
        plt.title("Pedigree Visualization")
        plt.show()

# Create a visualization of our pedigree
output_path = os.path.join(results_directory, "pedigree_visualization.png")
save_pedigree_visualization(up_node_dict.up_dict, output_path)

print("\nBonsai can integrate with various tools for:")
print("  1. IBD Detection: IBIS, Refined-IBD, hap-IBD, GERMLINE, etc.")
print("  2. Visualization: Graphviz, D3.js, Cytoscape.js, etc.")
print("  3. Analysis: R, Python pandas, NetworkX, etc.")
print("  4. Data Management: PostgreSQL, MongoDB, etc.")

## 7. Exercises

Now that we've explored the architecture and implementation of Bonsai, let's complete some exercises to deepen our understanding.

### Exercise 1: Implement a New Constraint

Create a new constraint that ensures that the number of children per individual doesn't exceed a certain threshold (e.g., 10 children maximum per individual). Implement this constraint in the `ConstraintHandler` class.

In [ ]:
# Exercise 1: Your solution here
# Hint: Implement a method to count children and check against a threshold

### Exercise 2: Create a Custom Relationship Model

Create a custom relationship model for a specific population or research question (e.g., a population with high consanguinity, or a model that takes into account the length distribution of IBD segments).

In [ ]:
# Exercise 2: Your solution here
# Hint: Extend the RelationshipModel class with customized parameters

### Exercise 3: Implement Additional Optimization Strategies

Implement an additional optimization strategy for Bonsai, such as parallel processing, pruning the search space using heuristics, or improving caching.

In [ ]:
# Exercise 3: Your solution here
# Hint: Consider using Python's multiprocessing package for parallel processing

### Exercise 4: Extend the Up-Node Dictionary

Extend the Up-Node Dictionary to include additional information, such as:
- Confidence scores for inferred relationships
- Alternative relationship hypotheses
- Support for half-identical regions (HIRs) in addition to fully-identical regions (FIRs)

In [ ]:
# Exercise 4: Your solution here
# Hint: Modify the UpNodeDictionary class to store additional information

### Exercise 5: Integration with External Tools

Design and implement a function to export Bonsai pedigrees to a standard format (e.g., GEDCOM, GraphML) that can be imported into other genealogy software.

In [ ]:
# Exercise 5: Your solution here
# Hint: Create functions to convert the Up-Node Dictionary to GEDCOM or GraphML format

## Conclusion

In this lab, we've explored the architectural design and implementation details of the Bonsai algorithm. We've examined the core components of Bonsai, including the Input Processing Module, Relationship Model, Up-Node Dictionary, Optimization Engine, and Constraint Handler. We've also implemented simplified versions of these components to understand how they work together to build pedigree structures from genetic data.

Key takeaways:
- Bonsai follows a modular architecture that separates concerns and allows for easy extension
- The Up-Node Dictionary is the core data structure that represents the pedigree
- The Optimization Engine searches for the optimal pedigree structure
- The Constraint Handler ensures that the pedigree respects biological and logical constraints
- Performance optimization is crucial for handling large-scale pedigree reconstruction
- Bonsai can be customized and extended to address specific research questions

In the next lab, we'll explore advanced topics in pedigree reconstruction, including dealing with missing data, handling complex relationships, and evaluating pedigree quality.