# Lab 14: Optimizing Small Pedigree Configurations

## Overview

In this lab, we'll explore techniques for optimizing small pedigree configurations in Bonsai v3. Building on our understanding of small pedigree structures from Lab 13, we'll focus on methods to evaluate and optimize these structures to best explain observed genetic data. We'll examine how Bonsai systematically explores alternative configurations to find the most likely pedigree that explains IBD sharing patterns.

In [ ]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
import copy
import random
import math
import itertools
from collections import defaultdict

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [ ]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [ ]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\n## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\nMethods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\n## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_source(obj):
    """Display the source code of an object (function or class)"""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\n{source}\n```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [ ]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Lab 14: Optimizing Small Pedigree Configurations

In this lab, we'll delve into the techniques Bonsai v3 uses to optimize small pedigree configurations. This optimization process is crucial for finding the most likely pedigree structure that explains observed genetic data. We'll explore:

1. How to evaluate different pedigree configurations based on genetic data
2. Methods to systematically search through the space of possible configurations
3. Algorithms for finding the optimal configuration that maximizes likelihood
4. Techniques for handling ambiguous or competing hypotheses

This process builds on our previous understanding of constructing small pedigree structures and will serve as a foundation for scaling to larger pedigrees in future labs.

## Part 1: Evaluating Pedigree Configurations

The first step in optimizing pedigree configurations is being able to evaluate how well different configurations explain the observed genetic data. Let's explore how Bonsai v3 approaches this evaluation process.

In [ ]:
# Import key functions from Bonsai v3
from utils.bonsaitree.bonsaitree.v3.pedigrees import (
    get_possible_connection_point_set,
    get_partner_id_set,
    add_parent,
    get_min_id,
    reverse_node_dict,
    get_simple_rel_tuple,
    get_unconnected_indivs_in_pedigree,
    get_descendants,
    get_ancestors
)

from utils.bonsaitree.bonsaitree.v3.connections import (
    get_likelihood_of_relationship
)

# For visualization
def visualize_pedigree(up_node_dict, title="Pedigree", highlight_nodes=None, individual_metadata=None):
    """Visualize a pedigree from an up_node_dict using networkx.
    
    Args:
        up_node_dict: Dictionary mapping individuals to their parents
        title: Title for the visualization
        highlight_nodes: Set of nodes to highlight
        individual_metadata: Dictionary mapping individuals to their metadata (age, sex, etc.)
    """
    # Create a directed graph (edges point from child to parent)
    G = nx.DiGraph()
    
    # Add all nodes to the graph (combine all IDs from keys and values)
    all_ids = set(up_node_dict.keys())
    for parents in up_node_dict.values():
        all_ids.update(parents.keys())
    
    # Create node labels
    node_labels = {}
    for node_id in all_ids:
        label = str(node_id)
        if individual_metadata and node_id in individual_metadata:
            metadata = individual_metadata[node_id]
            if 'sex' in metadata and metadata['sex']:
                label += f" ({metadata['sex']})"
            if 'age' in metadata and metadata['age'] is not None:
                label += f"\\nAge: {metadata['age']}"
        node_labels[node_id] = label
    
    # Create a color map - blue for males, pink for females, gray for unknown
    highlight_nodes = highlight_nodes or set()
    color_map = []
    for node_id in all_ids:
        if node_id in highlight_nodes:
            color_map.append('red')
        elif individual_metadata and node_id in individual_metadata and 'sex' in individual_metadata[node_id]:
            if individual_metadata[node_id]['sex'] == 'M':
                color_map.append('lightblue')
            elif individual_metadata[node_id]['sex'] == 'F':
                color_map.append('pink')
            else:
                color_map.append('lightgray')
        else:
            color_map.append('lightgray')
    
    # Add edges (from child to parent)
    edges = []
    for child, parents in up_node_dict.items():
        for parent in parents:
            edges.append((child, parent))
    
    G.add_edges_from(edges)
    
    # Create plot
    plt.figure(figsize=(10, 6))
    plt.title(title)
    
    # Layout: By default, parents are shown above children (opposite arrow direction)
    pos = nx.spring_layout(G, seed=42)  # For reproducibility, use a fixed seed
    
    # Draw nodes
    nx.draw(G, pos, with_labels=True, labels=node_labels, node_color=color_map, 
            node_size=800, font_weight='bold')
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5, arrows=True)
    
    plt.tight_layout()
    plt.show()

# For JupyterLite compatibility, provide simplified implementations
if is_jupyterlite():
    def reverse_node_dict(dct):
        """Reverse a node dict. If it's a down dict make it an up dict and vice versa."""
        rev_dct = {}
        for i, info in dct.items():
            for a, d in info.items():
                if a not in rev_dct:
                    rev_dct[a] = {}
                rev_dct[a][i] = d
        return rev_dct
    
    def get_min_id(dct):
        """Get the minimal ID in a node dict."""
        all_ids = set(dct.keys())
        for parents in dct.values():
            all_ids.update(parents.keys())
        min_id = min(all_ids) if all_ids else 0
        return min(-1, min_id)  # ensure ID is negative
    
    def add_parent(node, up_dct, min_id=None):
        """Add an ungenotyped parent to node in up_dct."""
        import copy
        up_dct = copy.deepcopy(up_dct)
        
        if node not in up_dct:
            raise ValueError(f"Node {node} is not in up dct.")
            
        pid_dict = up_dct[node]
        if len(pid_dict) >= 2:
            return up_dct, None
            
        if min_id is None:
            min_id = get_min_id(up_dct)
            
        new_pid = min_id - 1
        up_dct[node][new_pid] = 1
        up_dct[new_pid] = {}
        
        return up_dct, new_pid
    
    def get_partner_id_set(node, up_dct):
        """Find the set of partners of node in pedigree up_dct."""
        down_dct = reverse_node_dict(up_dct)
        child_id_set = {c for c, d in down_dct.get(node, {}).items() if d == 1}
        partner_id_set = set()
        for cid in child_id_set:
            pids = {p for p, d in up_dct.get(cid, {}).items() if d == 1}
            partner_id_set |= pids
        partner_id_set -= {node}
        return partner_id_set
    
    def get_simple_rel_tuple(up_node_dict, i, j):
        """Get relationship tuple (up, down, num_ancs) between individuals i and j."""
        if i == j:
            return (0, 0, 2)
        
        # Simple implementation for JupyterLite - this would be more complex in reality
        if j in up_node_dict.get(i, {}):
            return (1, 0, 1)  # i is child of j
        elif i in up_node_dict.get(j, {}):
            return (0, 1, 1)  # i is parent of j
        
        # Check for siblings/cousins (simplified)
        i_parents = set(up_node_dict.get(i, {}).keys())
        j_parents = set(up_node_dict.get(j, {}).keys())
        common_parents = i_parents.intersection(j_parents)
        
        if common_parents:
            if len(common_parents) == 2:
                return (1, 1, 2)  # Full siblings
            else:
                return (1, 1, 1)  # Half siblings
        
        # Default - no relationship found
        return None
    
    def get_possible_connection_point_set(ped):
        """Find all possible points through which a pedigree can be connected to another pedigree."""
        point_set = set()
        all_ids = set(ped.keys())
        for parents in ped.values():
            all_ids.update(parents.keys())
            
        for a in all_ids:
            parent_to_deg = ped.get(a, {})
            if len(parent_to_deg) < 2:
                point_set.add((a, None, 1))  # Can connect upward
                
            partners = get_partner_id_set(a, ped)
            point_set.add((a, None, 0))  # Can connect downward
            for partner in partners:
                if (partner, a, 0) not in point_set:
                    point_set.add((a, partner, 0))
                point_set.add((a, partner, None))
                
            point_set.add((a, None, None))  # Can replace node
            
        return point_set
    
    def get_unconnected_indivs_in_pedigree(up_node_dict):
        """Get all individuals in the pedigree not connected to each other."""
        # Get all IDs in the pedigree
        all_ids = set(up_node_dict.keys())
        for parents in up_node_dict.values():
            all_ids.update(parents.keys())
        
        # For a simplified version, we'll just return individual nodes
        # This is not accurate for real pedigrees but works for JupyterLite demo
        result = []
        for id_val in all_ids:
            result.append([id_val])
        
        return result
    
    def get_descendants(node, up_node_dict):
        """Get all descendants of node in pedigree up_node_dict."""
        descendants = set()
        down_dict = reverse_node_dict(up_node_dict)
        
        # BFS to find all descendants
        queue = [node]
        while queue:
            current = queue.pop(0)
            for child in down_dict.get(current, {}):
                if child not in descendants:
                    descendants.add(child)
                    queue.append(child)
        
        return descendants
    
    def get_ancestors(node, up_node_dict):
        """Get all ancestors of node in pedigree up_node_dict."""
        ancestors = set()
        
        # BFS to find all ancestors
        queue = [node]
        while queue:
            current = queue.pop(0)
            for parent in up_node_dict.get(current, {}):
                if parent not in ancestors:
                    ancestors.add(parent)
                    queue.append(parent)
        
        return ancestors
    
    def get_likelihood_of_relationship(pedigree, i, j, ibd_data):
        """Calculate likelihood of relationship between i and j based on IBD data."""
        # Simple implementation for JupyterLite
        rel_tuple = get_simple_rel_tuple(pedigree, i, j)
        if rel_tuple is None:
            return -100  # Highly unlikely if no relationship is found
        
        up, down, num_ancs = rel_tuple
        degree = up + down
        
        # Simplified likelihood calculation based on degree
        # In reality, this would be much more complex
        if (i, j) in ibd_data or (j, i) in ibd_data:
            pair = (i, j) if (i, j) in ibd_data else (j, i)
            segments = ibd_data[pair]
            total_cm = sum(seg["length_cm"] for seg in segments)
            
            # Different relationships have different expected amounts of IBD
            if degree == 0:  # Self
                expected_total_cm = 3400
            elif degree == 1:  # Parent-child
                expected_total_cm = 3400 / 2
            elif degree == 2 and num_ancs == 2:  # Full siblings
                expected_total_cm = 2550
            elif degree == 2 and num_ancs == 1:  # Half siblings/grandparents
                expected_total_cm = 1700
            elif degree == 3:  # First cousins once removed
                expected_total_cm = 850
            elif degree == 4:  # Second cousins
                expected_total_cm = 425
            elif degree == 5:  # Second cousins once removed
                expected_total_cm = 212.5
            elif degree == 6:  # Third cousins
                expected_total_cm = 106.25
            else:  # More distant
                expected_total_cm = 53.125
            
            # Calculate log-likelihood
            std_dev = expected_total_cm * 0.2  # 20% variation
            if std_dev > 0:
                log_likelihood = -0.5 * ((total_cm - expected_total_cm) / std_dev) ** 2 - math.log(std_dev)
                return log_likelihood
        
        # Default value for no IBD data
        return -10

### 1.1 Generating IBD Data

Before we can evaluate and optimize pedigree configurations, we need genetic data to work with. In real-world scenarios, this would be IBD (Identity by Descent) segments detected between individuals. Let's create a simulation function to generate IBD data for our examples:

def simulate_ibd_segments(rel_tuple, num_segments=10, noise_level=0.2):
    """Simulate IBD segments for a given relationship tuple.
    
    Args:
        rel_tuple: (up, down, num_ancs) tuple representing the relationship
        num_segments: Number of segments to simulate
        noise_level: Level of noise to add to segment lengths
        
    Returns:
        segments: List of simulated IBD segments
    """
    if rel_tuple is None:
        return []  # No relationship, no IBD
    
    up, down, num_ancs = rel_tuple
    degree = up + down
    
    # Different relationships have different expected amounts of IBD
    if degree == 0:  # Self
        expected_total_cm = 3400  # Entire genome
        avg_segment_cm = 340
    elif degree == 1:  # Parent-child
        expected_total_cm = 3400 / 2  # Half the genome
        avg_segment_cm = 170
    elif degree == 2 and num_ancs == 2:  # Full siblings
        expected_total_cm = 2550  # ~75% of the genome
        avg_segment_cm = 85
    elif degree == 2 and num_ancs == 1:  # Half siblings/grandparents
        expected_total_cm = 1700  # ~50% of the genome
        avg_segment_cm = 42.5
    elif degree == 3:  # First cousins once removed
        expected_total_cm = 850  # ~25% of the genome
        avg_segment_cm = 21.25
    elif degree == 4:  # Second cousins
        expected_total_cm = 425  # ~12.5% of the genome
        avg_segment_cm = 10.6
    elif degree == 5:  # Second cousins once removed
        expected_total_cm = 212.5  # ~6.25% of the genome
        avg_segment_cm = 5.3
    elif degree == 6:  # Third cousins
        expected_total_cm = 106.25  # ~3.125% of the genome
        avg_segment_cm = 5.3 / 2
    else:  # More distant
        expected_total_cm = 53.125  # ~1.5625% of the genome
        avg_segment_cm = 5.3 / 4
    
    # Generate simulated segments
    segments = []
    chromosomes = list(range(1, 23))  # Chromosomes 1-22
    
    for _ in range(num_segments):
        # Select a random chromosome
        chromosome = random.choice(chromosomes)
        
        # Generate a segment length with some noise
        segment_cm = avg_segment_cm * (1 + noise_level * (random.random() - 0.5))
        
        # Generate random start and end positions (in genetic distance)
        max_pos = 100 + 20 * chromosome  # Approximate chromosome length
        start_cm = random.uniform(0, max_pos - segment_cm)
        end_cm = start_cm + segment_cm
        
        segments.append({
            "chromosome": chromosome,
            "start_cm": start_cm,
            "end_cm": end_cm,
            "length_cm": segment_cm
        })
    
    return segments

def simulate_pedigree_ibd(pedigree, genotyped_ids=None):
    """Simulate IBD segments for all genotyped individuals in a pedigree.
    
    Args:
        pedigree: Up-node dictionary representing the pedigree
        genotyped_ids: List of IDs to simulate (defaults to all positive IDs)
        
    Returns:
        ibd_data: Dictionary mapping pairs of IDs to their simulated IBD segments
    """
    # If no genotyped IDs are provided, use all positive IDs
    if genotyped_ids is None:
        all_ids = set(pedigree.keys()).union(*[set(parents.keys()) for parents in pedigree.values()])
        genotyped_ids = [i for i in all_ids if i > 0]
    
    # Create a dictionary to store IBD segments for each pair
    ibd_data = {}
    
    # For each pair of genotyped individuals
    for i in range(len(genotyped_ids)):
        for j in range(i + 1, len(genotyped_ids)):
            id1 = genotyped_ids[i]
            id2 = genotyped_ids[j]
            
            # Get the relationship tuple
            rel_tuple = get_simple_rel_tuple(pedigree, id1, id2)
            
            # Simulate IBD segments based on the relationship
            segments = simulate_ibd_segments(rel_tuple)
            
            # Store the segments
            pair_key = (id1, id2)
            ibd_data[pair_key] = segments
    
    return ibd_data

### 1.2 Creating Sample Pedigree Configurations

Let's create some sample pedigree configurations to work with. We'll define several alternative configurations to demonstrate how Bonsai evaluates and optimizes them.

# Helper functions to create various pedigree structures
def create_parent_child_unit(child_id, parent1_id=None, parent2_id=None):
    """Create a simple parent-child unit."""
    pedigree = {child_id: {}}
    
    # Create first parent if not provided
    if parent1_id is None:
        min_id = get_min_id(pedigree)
        parent1_id = min_id - 1
    
    # Add first parent
    pedigree[child_id][parent1_id] = 1
    if parent1_id not in pedigree:
        pedigree[parent1_id] = {}
    
    # Add second parent if needed
    if parent2_id is not None:
        pedigree[child_id][parent2_id] = 1
        if parent2_id not in pedigree:
            pedigree[parent2_id] = {}
    elif parent2_id is None and parent1_id is not None:
        # Add an ungenotyped second parent
        pedigree, new_parent_id = add_parent(child_id, pedigree)
        if new_parent_id not in pedigree:
            pedigree[new_parent_id] = {}
    
    return pedigree

def create_sibling_group(child_ids, parent1_id=None, parent2_id=None):
    """Create a sibling group with the specified children and parents."""
    # Initialize an empty pedigree
    pedigree = {}
    
    # Create entries for each child
    for child_id in child_ids:
        pedigree[child_id] = {}
    
    # Create parents if not provided
    if parent1_id is None:
        min_id = get_min_id(pedigree)
        parent1_id = min_id - 1
    
    if parent2_id is None:
        min_id = get_min_id(pedigree)
        parent2_id = min_id - 1
    
    # Add parents to each child
    for child_id in child_ids:
        pedigree[child_id][parent1_id] = 1
        pedigree[child_id][parent2_id] = 1
    
    # Add parent entries to the pedigree
    pedigree[parent1_id] = {}
    pedigree[parent2_id] = {}
    
    return pedigree

def create_half_sibling_structure(child1_id, child2_id, common_parent_id=None, parent1_id=None, parent2_id=None):
    """Create a half-sibling structure with two children sharing one parent."""
    # Initialize an empty pedigree
    pedigree = {child1_id: {}, child2_id: {}}
    
    # Create common parent if not provided
    if common_parent_id is None:
        min_id = get_min_id(pedigree)
        common_parent_id = min_id - 1
    
    # Add common parent to both children
    pedigree[child1_id][common_parent_id] = 1
    pedigree[child2_id][common_parent_id] = 1
    pedigree[common_parent_id] = {}
    
    # Create and add second parent for first child if needed
    if parent1_id is None:
        min_id = get_min_id(pedigree)
        parent1_id = min_id - 1
    pedigree[child1_id][parent1_id] = 1
    pedigree[parent1_id] = {}
    
    # Create and add second parent for second child if needed
    if parent2_id is None:
        min_id = get_min_id(pedigree)
        parent2_id = min_id - 1
    pedigree[child2_id][parent2_id] = 1
    pedigree[parent2_id] = {}
    
    return pedigree

def connect_pedigrees(pedigree1, pedigree2, connection_point):
    """Connect two pedigrees using the specified connection point."""
    import copy
    combined_pedigree = copy.deepcopy(pedigree1)
    pedigree2_copy = copy.deepcopy(pedigree2)
    
    # Extract connection information
    id1, id2, direction = connection_point
    
    # Get lowest ID in both pedigrees to use for new ungenotyped individuals
    all_ids1 = set(pedigree1.keys()).union(*[set(parents.keys()) for parents in pedigree1.values()])
    all_ids2 = set(pedigree2.keys()).union(*[set(parents.keys()) for parents in pedigree2.values()])
    min_id = min(min(all_ids1), min(all_ids2)) - 1
    if min_id > 0:  # Ensure negative ID for ungenotyped individuals
        min_id = -1
    
    # Adjust IDs in pedigree2 to avoid conflicts
    id_map = {}
    for old_id in all_ids2:
        if old_id in all_ids1:  # If ID already exists in pedigree1
            if old_id > 0:  # Only remap genotyped IDs
                new_id = min(all_ids1) - 1  # Generate a new ID
                if new_id > 0:  # Ensure it's negative for ungenotyped
                    new_id = min_id
                    min_id -= 1
                id_map[old_id] = new_id
    
    # Apply the ID mapping to pedigree2
    if id_map:
        remapped_pedigree2 = {}
        for node, parents in pedigree2_copy.items():
            new_node = id_map.get(node, node)
            remapped_parents = {id_map.get(p, p): d for p, d in parents.items()}
            remapped_pedigree2[new_node] = remapped_parents
        pedigree2_copy = remapped_pedigree2
    
    # Connect based on direction
    if direction == 0:  # Connect downward (add as child)
        # Create a new individual as child of id1 (and id2 if provided)
        connector_id = min_id
        min_id -= 1
        
        # Add the connector as child of id1 (and id2 if provided)
        combined_pedigree[connector_id] = {id1: 1}
        if id2 is not None:
            combined_pedigree[connector_id][id2] = 1
        
        # Make the connector the parent of all founders in pedigree2
        # Get founders (nodes with no parents) in pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        for founder in founders2:
            pedigree2_copy[founder][connector_id] = 1
    
    elif direction == 1:  # Connect upward (add as parent)
        # Get founders in pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        
        if len(founders2) == 1:  # If pedigree2 has a single founder, connect directly
            combined_pedigree[id1][founders2[0]] = 1
        else:  # Otherwise, create a connector individual
            connector_id = min_id
            min_id -= 1
            
            # Add the connector as parent of id1
            combined_pedigree[id1][connector_id] = 1
            combined_pedigree[connector_id] = {}
            
            # Make the founders of pedigree2 parents of the connector
            for founder in founders2:
                combined_pedigree[connector_id][founder] = 1
    
    else:  # Replace/lateral connection
        # We'll implement this as replacing id1 with a founder from pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        if founders2:  # If there are founders in pedigree2
            replaced_founder = founders2[0]
            
            # Replace all occurrences of id1 with replaced_founder
            for node, parents in combined_pedigree.items():
                if id1 in parents:
                    degree = parents.pop(id1)
                    parents[replaced_founder] = degree
            
            # Handle the connections of id1
            if id1 in combined_pedigree:
                # Transfer the parents of id1 to replaced_founder
                if replaced_founder not in combined_pedigree:
                    combined_pedigree[replaced_founder] = {}
                combined_pedigree[replaced_founder].update(combined_pedigree[id1])
                del combined_pedigree[id1]  # Remove id1 from the pedigree
    
    # Merge the modified pedigree2 into the combined pedigree
    for node, parents in pedigree2_copy.items():
        if node not in combined_pedigree:
            combined_pedigree[node] = parents
        else:
            combined_pedigree[node].update(parents)
    
    return combined_pedigree

In [ ]:
# Create several sample pedigree configurations we'll optimize

# Configuration 1: Three individuals (1, 2, 3) with no relationships between them
config1 = {1: {}, 2: {}, 3: {}}

# Configuration 2: Parent-child relationship (1 is parent of 2 and 3)
config2 = {
    1: {},
    2: {1: 1},
    3: {1: 1}
}

# Configuration 3: Full sibling relationship (1 and 2 are siblings, 3 is unrelated)
config3 = create_sibling_group([1, 2], parent1_id=-1, parent2_id=-2)
config3[3] = {}

# Configuration 4: Half-sibling relationship (1 and 2 share one parent, 3 is unrelated)
config4 = create_half_sibling_structure(1, 2, common_parent_id=-1, parent1_id=-2, parent2_id=-3)
config4[3] = {}

# Configuration 5: Complex structure (1 and 2 are siblings, 3 is their half-cousin)
# First create a sibling group for 1 and 2
sibling_group = create_sibling_group([1, 2], parent1_id=4, parent2_id=5)
# Create a half-sibling structure with common parent 4
half_sibling = create_half_sibling_structure(6, 7, common_parent_id=4, parent1_id=5, parent2_id=8)
# Make 3 the child of 7
parent_child = create_parent_child_unit(3, parent1_id=7)
# Combine the pedigrees
config5 = sibling_group
for id_val, parents in half_sibling.items():
    if id_val in config5:
        config5[id_val].update(parents)
    else:
        config5[id_val] = parents
for id_val, parents in parent_child.items():
    if id_val in config5:
        config5[id_val].update(parents)
    else:
        config5[id_val] = parents

# Create metadata for visualization
metadata = {
    1: {"sex": "M", "age": 40},
    2: {"sex": "F", "age": 38},
    3: {"sex": "M", "age": 25},
    4: {"sex": "M", "age": 70},
    5: {"sex": "F", "age": 68},
    6: {"sex": "M", "age": 42},
    7: {"sex": "F", "age": 39},
    8: {"sex": "M", "age": 72}
}

# Visualize each configuration
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
visualize_pedigree(config1, "Config 1: Unrelated", individual_metadata=metadata)

plt.subplot(2, 3, 2)
visualize_pedigree(config2, "Config 2: Parent-Child", individual_metadata=metadata)

plt.subplot(2, 3, 3)
visualize_pedigree(config3, "Config 3: Full Siblings", individual_metadata=metadata)

plt.subplot(2, 3, 4)
visualize_pedigree(config4, "Config 4: Half-Siblings", individual_metadata=metadata)

plt.subplot(2, 3, 5)
visualize_pedigree(config5, "Config 5: Complex Structure", individual_metadata=metadata)

plt.tight_layout()
plt.show()

### 1.3 Evaluating Pedigree Configurations

Now let's implement a function to evaluate how well a pedigree configuration explains a set of IBD data. This is a critical step in the optimization process.

In [ ]:
# Now let's create a "true" pedigree and generate IBD data from it
# We'll use configuration 5 (complex structure) as our ground truth
true_pedigree = config5

# Simulate IBD data based on this true pedigree
# Only consider genotyped individuals (those with positive IDs)
genotyped_ids = [1, 2, 3, 4, 5, 6, 7, 8]
true_ibd_data = simulate_pedigree_ibd(true_pedigree, genotyped_ids)

# Display the IBD sharing patterns
ibd_summary = []
for (id1, id2), segments in true_ibd_data.items():
    total_cm = sum(seg["length_cm"] for seg in segments)
    num_segments = len(segments)
    
    # Get relationship information
    rel_tuple = get_simple_rel_tuple(true_pedigree, id1, id2)
    relationship = "Unknown"
    if rel_tuple:
        up, down, num_ancs = rel_tuple
        degree = up + down
        if degree == 0:
            relationship = "Self"
        elif degree == 1:
            relationship = "Parent-Child"
        elif degree == 2 and num_ancs == 2:
            relationship = "Full Siblings"
        elif degree == 2 and num_ancs == 1:
            relationship = "Half Siblings/Grandparent"
        elif degree == 3:
            relationship = "First Cousins Once Removed/Great-Grandparent"
        elif degree == 4:
            relationship = "Second Cousins"
        else:
            relationship = f"Degree {degree} (Distant Relative)"
    
    ibd_summary.append({
        "Individual 1": id1,
        "Individual 2": id2,
        "Relationship": relationship,
        "Total cM": total_cm,
        "Num Segments": num_segments
    })

# Display the IBD summary
ibd_df = pd.DataFrame(ibd_summary).sort_values(by="Total cM", ascending=False)
display(ibd_df)

In [ ]:
def evaluate_pedigree(pedigree, ibd_data):
    """Evaluate how well a pedigree explains a set of IBD data.
    
    Args:
        pedigree: Up-node dictionary representing the pedigree
        ibd_data: Dictionary mapping pairs of IDs to their IBD segments
        
    Returns:
        log_likelihood: Log-likelihood score (higher is better)
        relationship_consistency: Percentage of relationships consistent with IBD
    """
    log_likelihood = 0.0
    consistent_relationships = 0
    total_relationships = 0
    
    # For each pair with IBD data
    for (id1, id2), segments in ibd_data.items():
        total_relationships += 1
        
        # Get the expected relationship from the pedigree
        rel_tuple = get_simple_rel_tuple(pedigree, id1, id2)
        
        # If no relationship is found in the pedigree but IBD is observed, penalize
        if rel_tuple is None:
            log_likelihood -= 10.0  # Arbitrary penalty
            continue
            
        # Calculate expected and observed IBD sharing
        up, down, num_ancs = rel_tuple
        degree = up + down
        
        # Calculate expected IBD sharing based on the relationship degree
        if degree == 0:  # Self
            expected_total_cm = 3400
        elif degree == 1:  # Parent-child
            expected_total_cm = 3400 / 2
        elif degree == 2 and num_ancs == 2:  # Full siblings
            expected_total_cm = 2550
        elif degree == 2 and num_ancs == 1:  # Half siblings/grandparents
            expected_total_cm = 1700
        elif degree == 3:  # First cousins once removed
            expected_total_cm = 850
        elif degree == 4:  # Second cousins
            expected_total_cm = 425
        elif degree == 5:  # Second cousins once removed
            expected_total_cm = 212.5
        elif degree == 6:  # Third cousins
            expected_total_cm = 106.25
        else:  # More distant
            expected_total_cm = 53.125
        
        # Calculate observed IBD sharing
        observed_total_cm = sum(seg["length_cm"] for seg in segments)
        
        # Add contribution to log-likelihood
        std_dev = expected_total_cm * 0.2  # 20% variation
        if std_dev > 0:
            log_likelihood += -0.5 * ((observed_total_cm - expected_total_cm) / std_dev) ** 2 - math.log(std_dev)
        
        # Check if the relationship is consistent with the IBD
        # A simple check: is the observed IBD within 50% of the expected?
        if 0.5 * expected_total_cm <= observed_total_cm <= 1.5 * expected_total_cm:
            consistent_relationships += 1
    
    # Calculate relationship consistency as a percentage
    relationship_consistency = 100 * consistent_relationships / total_relationships if total_relationships > 0 else 0
    
    return log_likelihood, relationship_consistency

# If we're not in JupyterLite, let's also show the actual function from Bonsai v3
if not is_jupyterlite():
    try:
        # Get the source code for get_likelihood_of_relationship
        print("Here's the source code for get_likelihood_of_relationship from Bonsai v3:\n")
        view_source(get_likelihood_of_relationship)
    except Exception as e:
        print(f"Couldn't display source code: {e}")

In [ ]:
# Now let's evaluate each of our sample configurations against the true IBD data
configurations = {
    "Config 1: Unrelated": config1,
    "Config 2: Parent-Child": config2,
    "Config 3: Full Siblings": config3,
    "Config 4: Half-Siblings": config4,
    "Config 5: Complex (True)": config5
}

evaluation_results = []
for config_name, pedigree in configurations.items():
    ll, consistency = evaluate_pedigree(pedigree, true_ibd_data)
    evaluation_results.append({
        "Configuration": config_name,
        "Log-Likelihood": ll,
        "Consistency %": consistency,
        "Is True Pedigree": config_name == "Config 5: Complex (True)"
    })

# Display evaluation results sorted by log-likelihood
eval_df = pd.DataFrame(evaluation_results).sort_values(by="Log-Likelihood", ascending=False)
display(eval_df)

# Highlight the true pedigree
true_config = eval_df[eval_df["Is True Pedigree"]].iloc[0]
print(f"\nThe true pedigree (Config 5) has a log-likelihood of {true_config['Log-Likelihood']:.2f} and consistency of {true_config['Consistency %']:.1f}%.")
print(f"It ranks #{eval_df.index.get_loc(true_config.name) + 1} among all configurations.")

# Visualize the best configuration
best_config_name = eval_df.iloc[0]["Configuration"]
best_config = configurations[best_config_name]
visualize_pedigree(best_config, f"Best Configuration: {best_config_name}", individual_metadata=metadata)

## Part 2: Systematic Search of Pedigree Configurations

After evaluating individual pedigree configurations, the next step is to systematically search through the space of possible configurations to find the optimal one. Let's explore how Bonsai v3 approaches this search process.

### 2.1 Generating Alternative Configurations

The first step in systematic search is to generate alternative pedigree configurations to evaluate. Let's explore how to generate these alternatives:

In [ ]:
def generate_alternative_configurations(base_pedigree, individuals, max_configs=5):
    """Generate alternative pedigree configurations by modifying the base pedigree.
    
    Args:
        base_pedigree: Starting pedigree configuration
        individuals: List of individuals to consider
        max_configs: Maximum number of alternative configurations to generate
        
    Returns:
        list: Alternative pedigree configurations
    """
    alternatives = []
    
    # Make a copy of the base pedigree to avoid modifying the original
    base_copy = copy.deepcopy(base_pedigree)
    
    # 1. Add parent-child relationships
    for i in individuals:
        for j in individuals:
            if i != j:
                # Create a new configuration where i is a child of j
                new_config = copy.deepcopy(base_copy)
                
                # Don't create cycles (j can't be a descendant of i)
                if j not in get_descendants(i, new_config):
                    # Add j as a parent of i
                    if j not in new_config.get(i, {}):
                        if i not in new_config:
                            new_config[i] = {}
                        new_config[i][j] = 1
                        
                        # Add entry for j if it doesn't exist
                        if j not in new_config:
                            new_config[j] = {}
                        
                        alternatives.append(("Parent-Child", new_config))
                        if len(alternatives) >= max_configs:
                            return [alt[1] for alt in alternatives]
    
    # 2. Add sibling relationships
    for i, j in itertools.combinations(individuals, 2):
        # Create new configuration where i and j are siblings
        new_config = copy.deepcopy(base_copy)
        
        # Create a new parent for both
        min_id = get_min_id(new_config)
        parent_id = min_id - 1
        
        # Add the parent to both individuals
        if i not in new_config:
            new_config[i] = {}
        if j not in new_config:
            new_config[j] = {}
        
        new_config[i][parent_id] = 1
        new_config[j][parent_id] = 1
        new_config[parent_id] = {}
        
        alternatives.append(("Siblings", new_config))
        if len(alternatives) >= max_configs:
            return [alt[1] for alt in alternatives]
    
    # 3. Add grandparent relationships
    for i in individuals:
        for j in individuals:
            if i != j:
                # Create a new configuration where i is a grandparent of j
                new_config = copy.deepcopy(base_copy)
                
                # Don't create cycles
                if j not in get_descendants(i, new_config) and i not in get_ancestors(j, new_config):
                    # Create a connector individual (i's child, j's parent)
                    min_id = get_min_id(new_config)
                    connector_id = min_id - 1
                    
                    # Add connector as child of i
                    if i not in new_config:
                        new_config[i] = {}
                    if connector_id not in new_config:
                        new_config[connector_id] = {}
                    
                    new_config[connector_id][i] = 1
                    
                    # Add connector as parent of j
                    if j not in new_config:
                        new_config[j] = {}
                    
                    new_config[j][connector_id] = 1
                    
                    alternatives.append(("Grandparent", new_config))
                    if len(alternatives) >= max_configs:
                        return [alt[1] for alt in alternatives]
    
    # 4. Remove relationships (if any exist)
    for ind_id, parents in base_copy.items():
        for parent_id in list(parents.keys()):
            # Create a new configuration with this relationship removed
            new_config = copy.deepcopy(base_copy)
            new_config[ind_id].pop(parent_id, None)
            
            alternatives.append(("Remove-Relationship", new_config))
            if len(alternatives) >= max_configs:
                return [alt[1] for alt in alternatives]
    
    return [alt[1] for alt in alternatives]

# For demonstration, let's create alternatives for a simple configuration
simple_config = {1: {}, 2: {}, 3: {}}  # Three unconnected individuals
alternatives = generate_alternative_configurations(simple_config, [1, 2, 3], max_configs=4)

# Visualize the alternatives
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
visualize_pedigree(simple_config, "Original Configuration", individual_metadata=metadata)

for i, alt in enumerate(alternatives):
    plt.subplot(2, 3, i+2)
    visualize_pedigree(alt, f"Alternative {i+1}", individual_metadata=metadata)

plt.tight_layout()
plt.show()

### 2.2 Implementing a Greedy Search Algorithm

Now let's implement a greedy search algorithm to find the optimal pedigree configuration:

In [ ]:
def greedy_search(initial_pedigree, individuals, ibd_data, max_iterations=10, max_configs_per_iter=5):
    """Perform greedy search to find the optimal pedigree configuration.
    
    Args:
        initial_pedigree: Starting pedigree configuration
        individuals: List of individuals to consider
        ibd_data: Dictionary of IBD segments between pairs of individuals
        max_iterations: Maximum number of search iterations
        max_configs_per_iter: Maximum number of configurations to consider per iteration
        
    Returns:
        best_pedigree: The best pedigree configuration found
        history: List of (pedigree, log_likelihood, consistency) tuples for each iteration
    """
    current_pedigree = copy.deepcopy(initial_pedigree)
    current_ll, current_consistency = evaluate_pedigree(current_pedigree, ibd_data)
    
    # Keep track of the search history
    history = [(current_pedigree, current_ll, current_consistency)]
    
    for iteration in range(max_iterations):
        print(f"Iteration {iteration+1}/{max_iterations}: Current log-likelihood = {current_ll:.2f}, consistency = {current_consistency:.1f}%")
        
        # Generate alternative configurations
        alternatives = generate_alternative_configurations(current_pedigree, individuals, max_configs=max_configs_per_iter)
        
        # Evaluate each alternative
        best_alt = None
        best_alt_ll = current_ll
        best_alt_consistency = current_consistency
        
        for i, alt in enumerate(alternatives):
            alt_ll, alt_consistency = evaluate_pedigree(alt, ibd_data)
            print(f"  Alternative {i+1}: log-likelihood = {alt_ll:.2f}, consistency = {alt_consistency:.1f}%")
            
            # Check if this alternative is better
            if alt_ll > best_alt_ll:
                best_alt = alt
                best_alt_ll = alt_ll
                best_alt_consistency = alt_consistency
        
        # If no improvement is found, stop the search
        if best_alt is None or best_alt_ll <= current_ll:
            print(f"No improvement found. Stopping search.")
            break
        
        # Update the current configuration
        current_pedigree = best_alt
        current_ll = best_alt_ll
        current_consistency = best_alt_consistency
        
        # Add to history
        history.append((current_pedigree, current_ll, current_consistency))
    
    return current_pedigree, history

# Run a small test example
test_individuals = [1, 2, 3]
test_initial_pedigree = {1: {}, 2: {}, 3: {}}  # Start with unrelated individuals
test_ibd_data = simulate_pedigree_ibd(config3, test_individuals)  # Simulate IBD for siblings

best_pedigree, history = greedy_search(test_initial_pedigree, test_individuals, test_ibd_data, 
                                       max_iterations=3, max_configs_per_iter=3)

# Visualize the progression of configurations
plt.figure(figsize=(15, 10))

# First, show the initial configuration
plt.subplot(2, 2, 1)
visualize_pedigree(history[0][0], f"Initial: LL={history[0][1]:.2f}, Cons={history[0][2]:.1f}%", 
                   individual_metadata=metadata)

# Then show intermediate steps and final configuration
for i in range(1, min(3, len(history))):
    plt.subplot(2, 2, i+1)
    visualize_pedigree(history[i][0], f"Step {i}: LL={history[i][1]:.2f}, Cons={history[i][2]:.1f}%", 
                       individual_metadata=metadata)

# Show the final "true" configuration for comparison
plt.subplot(2, 2, 4)
visualize_pedigree(config3, "True Configuration (Siblings)", individual_metadata=metadata)

plt.tight_layout()
plt.show()

### 2.3 Using Bonsai's Actual Optimization Functions

Now let's examine how Bonsai v3 actually handles optimization of pedigree structures. Instead of using simplified implementations, we'll explore the actual functions in the Bonsai codebase.

In [ ]:
# Let's look at the actual optimization functions in Bonsai v3
if not is_jupyterlite():
    try:
        # Import optimization functions from Bonsai v3
        from utils.bonsaitree.bonsaitree.v3.optimize import (
            optimize_pedigree,
            evaluate_pedigree_structure,
            generate_alternative_structures
        )
        
        print("Successfully imported Bonsai v3 optimization functions.")
        
        # Examine the optimize_pedigree function
        print("\nOptimize Pedigree function from Bonsai v3:")
        view_source(optimize_pedigree)
        
        # Examine the evaluate_pedigree_structure function
        print("\nEvaluate Pedigree Structure function from Bonsai v3:")
        view_source(evaluate_pedigree_structure)
        
        # Examine the generate_alternative_structures function
        print("\nGenerate Alternative Structures function from Bonsai v3:")
        view_source(generate_alternative_structures)
        
    except ImportError as e:
        print(f"Could not import Bonsai v3 optimization functions: {e}")
        print("Using simplified implementations instead.")
    except AttributeError as e:
        print(f"Could not find specific optimization functions in Bonsai v3: {e}")
        print("Using simplified implementations instead.")
else:
    print("Running in JupyterLite - using simplified implementations instead of actual Bonsai functions.")

In [ ]:
# Now let's see what optimization-related functions are available in the Bonsai v3 modules
if not is_jupyterlite():
    try:
        # Check for optimization-related functions in v3.pedigrees
        print("Optimization-related functions in v3.pedigrees:")
        functions = [name for name, func in inspect.getmembers(v3.pedigrees, inspect.isfunction) 
                     if any(kw in name.lower() for kw in ['optimize', 'improve', 'alternative', 'evaluate'])]
        for func_name in functions:
            print(f"- {func_name}")
            
        # Check for optimization-related functions in v3.connections
        print("\nOptimization-related functions in v3.connections:")
        functions = [name for name, func in inspect.getmembers(v3.connections, inspect.isfunction) 
                     if any(kw in name.lower() for kw in ['optimize', 'improve', 'alternative', 'evaluate'])]
        for func_name in functions:
            print(f"- {func_name}")
        
        # Look for any module named 'optimize' or similar
        print("\nChecking for optimization modules in Bonsai v3:")
        import importlib.util
        import os
        
        utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
        bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
        v3_dir = os.path.join(bonsaitree_dir, 'bonsaitree', 'v3')
        
        if os.path.exists(v3_dir):
            modules = [f for f in os.listdir(v3_dir) 
                      if os.path.isfile(os.path.join(v3_dir, f)) 
                      and f.endswith('.py') 
                      and any(kw in f.lower() for kw in ['optim', 'improve', 'search'])]
            for module in modules:
                print(f"Found module: {module}")
        
    except Exception as e:
        print(f"Error exploring optimization functions: {e}")
else:
    print("Running in JupyterLite - skipping exploration of actual Bonsai modules.")

In [ ]:
# Let's examine the specific optimization functions in Bonsai v3
if not is_jupyterlite():
    try:
        # Look for functions in pedigrees module that might be related to configuration optimization
        print("Exploring key pedigree optimization-related functions in Bonsai v3:")
        
        # Check for get_up_dict_with_new_connection function
        if hasattr(v3.pedigrees, 'get_up_dict_with_new_connection'):
            print("\nFunction: get_up_dict_with_new_connection")
            view_source(v3.pedigrees.get_up_dict_with_new_connection)
        
        # Check for get_alternative_connections function
        if hasattr(v3.pedigrees, 'get_alternative_connections'):
            print("\nFunction: get_alternative_connections")
            view_source(v3.pedigrees.get_alternative_connections)
        
        # Check for evaluate_up_dict function
        if hasattr(v3.pedigrees, 'evaluate_up_dict'):
            print("\nFunction: evaluate_up_dict")
            view_source(v3.pedigrees.evaluate_up_dict)
        
        # Look for functions in connections module
        print("\nExploring key connection optimization functions in Bonsai v3:")
        
        # Check for get_best_connection function
        if hasattr(v3.connections, 'get_best_connection'):
            print("\nFunction: get_best_connection")
            view_source(v3.connections.get_best_connection)
            
        # Check for evaluate_connections function
        if hasattr(v3.connections, 'evaluate_connections'):
            print("\nFunction: evaluate_connections")
            view_source(v3.connections.evaluate_connections)
            
    except Exception as e:
        print(f"Error examining optimization functions: {e}")
else:
    print("Running in JupyterLite - skipping examination of Bonsai functions.")

## Part 3: Applying Optimization Techniques

Now let's apply what we've learned to optimize pedigree configurations using both simplified approaches and, where possible, actual Bonsai v3 functions.

### 3.1 Handling Ambiguous Cases and Multiple Hypotheses

When working with small pedigrees, multiple configurations may be equally plausible based on genetic data. Let's explore techniques for handling these ambiguous cases:

In [ ]:
# Create an ambiguous case: half-siblings vs. grandparent-grandchild
# Generate two different configurations that can have similar IBD patterns
# Configuration A: Half-siblings (1 and 2 share one parent)
config_half_siblings = create_half_sibling_structure(1, 2, common_parent_id=-1, parent1_id=-2, parent2_id=-3)

# Configuration B: Grandparent-grandchild (1 is grandparent of 2)
config_grandparent = {1: {}}
min_id = get_min_id(config_grandparent)
connector_id = min_id - 1
config_grandparent[connector_id] = {1: 1}
config_grandparent[2] = {connector_id: 1}

# Create metadata for visualization
ambiguous_metadata = {
    1: {"sex": "M", "age": 65},
    2: {"sex": "F", "age": 20},
    -1: {"sex": "M", "age": 45},
    -2: {"sex": "F", "age": 40},
    -3: {"sex": "M", "age": 42},
    connector_id: {"sex": "F", "age": 42}
}

# Visualize both configurations
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
visualize_pedigree(config_half_siblings, "Hypothesis A: Half-Siblings", individual_metadata=ambiguous_metadata)

plt.subplot(1, 2, 2)
visualize_pedigree(config_grandparent, "Hypothesis B: Grandparent-Grandchild", individual_metadata=ambiguous_metadata)

plt.tight_layout()
plt.show()

# Simulate IBD data that is ambiguous between these configurations
# For half-siblings: ~850 cM shared in short segments
# For grandparent-grandchild: ~850 cM shared in longer segments
def create_ambiguous_ibd_data():
    # Create similar amount of shared DNA with different patterns
    half_sib_segments = []
    grandparent_segments = []
    
    # Half-siblings: more segments, shorter length
    for _ in range(25):
        chromosome = random.randint(1, 22)
        segment_cm = random.uniform(15, 40)  # Shorter segments
        start_cm = random.uniform(0, 100)
        end_cm = start_cm + segment_cm
        
        half_sib_segments.append({
            "chromosome": chromosome,
            "start_cm": start_cm,
            "end_cm": end_cm,
            "length_cm": segment_cm
        })
    
    # Grandparent-grandchild: fewer segments, longer length
    for _ in range(10):
        chromosome = random.randint(1, 22)
        segment_cm = random.uniform(60, 100)  # Longer segments
        start_cm = random.uniform(0, 100)
        end_cm = start_cm + segment_cm
        
        grandparent_segments.append({
            "chromosome": chromosome,
            "start_cm": start_cm,
            "end_cm": end_cm,
            "length_cm": segment_cm
        })
    
    return {(1, 2): half_sib_segments}, {(1, 2): grandparent_segments}

# Create two different IBD datasets
half_sib_ibd, grandparent_ibd = create_ambiguous_ibd_data()

# Evaluate both configurations against both IBD datasets
half_sib_on_half_ibd_ll, half_sib_on_half_ibd_cons = evaluate_pedigree(config_half_siblings, half_sib_ibd)
half_sib_on_grand_ibd_ll, half_sib_on_grand_ibd_cons = evaluate_pedigree(config_half_siblings, grandparent_ibd)
grand_on_half_ibd_ll, grand_on_half_ibd_cons = evaluate_pedigree(config_grandparent, half_sib_ibd)
grand_on_grand_ibd_ll, grand_on_grand_ibd_cons = evaluate_pedigree(config_grandparent, grandparent_ibd)

# Create a DataFrame to compare results
results = pd.DataFrame([
    {"Configuration": "Half-Siblings", "IBD Pattern": "Half-Sibling Pattern", 
     "Log-Likelihood": half_sib_on_half_ibd_ll, "Consistency %": half_sib_on_half_ibd_cons},
    {"Configuration": "Half-Siblings", "IBD Pattern": "Grandparent Pattern", 
     "Log-Likelihood": half_sib_on_grand_ibd_ll, "Consistency %": half_sib_on_grand_ibd_cons},
    {"Configuration": "Grandparent", "IBD Pattern": "Half-Sibling Pattern", 
     "Log-Likelihood": grand_on_half_ibd_ll, "Consistency %": grand_on_half_ibd_cons},
    {"Configuration": "Grandparent", "IBD Pattern": "Grandparent Pattern", 
     "Log-Likelihood": grand_on_grand_ibd_ll, "Consistency %": grand_on_grand_ibd_cons}
])

# Display the results
display(results)

# Analyze the ambiguity
print("Analysis of ambiguity:")
print(f"1. When evaluating with half-sibling IBD pattern:")
print(f"   - Half-sibling configuration: LL = {half_sib_on_half_ibd_ll:.2f}")
print(f"   - Grandparent configuration: LL = {grand_on_half_ibd_ll:.2f}")
print(f"   - Difference: {abs(half_sib_on_half_ibd_ll - grand_on_half_ibd_ll):.2f}")
print()
print(f"2. When evaluating with grandparent IBD pattern:")
print(f"   - Half-sibling configuration: LL = {half_sib_on_grand_ibd_ll:.2f}")
print(f"   - Grandparent configuration: LL = {grand_on_grand_ibd_ll:.2f}")
print(f"   - Difference: {abs(half_sib_on_grand_ibd_ll - grand_on_grand_ibd_ll):.2f}")

# Comment on the implications
if abs(half_sib_on_half_ibd_ll - grand_on_half_ibd_ll) < 3 or abs(half_sib_on_grand_ibd_ll - grand_on_grand_ibd_ll) < 3:
    print("\nImplication: The log-likelihood difference is small, indicating ambiguity between these relationships.")
    print("This demonstrates why Bonsai sometimes needs to maintain multiple hypotheses for small pedigrees.")
    print("Additional information (such as ages) would be needed to resolve this ambiguity.")

### 3.2 Additional Constraints for Optimization

In real-world pedigree optimization, additional constraints beyond genetic data can help resolve ambiguities. Let's incorporate some of these constraints:

In [ ]:
def evaluate_pedigree_with_constraints(pedigree, ibd_data, individual_metadata=None):
    """Evaluate a pedigree with additional constraints like age and sex.
    
    Args:
        pedigree: Up-node dictionary representing the pedigree
        ibd_data: Dictionary mapping pairs of IDs to their IBD segments
        individual_metadata: Dictionary mapping individuals to their metadata (age, sex, etc.)
        
    Returns:
        log_likelihood: Log-likelihood score (higher is better)
        consistency: Percentage of relationships consistent with constraints
    """
    # Start with basic genetic evaluation
    log_likelihood, relationship_consistency = evaluate_pedigree(pedigree, ibd_data)
    
    # If no metadata is provided, return basic evaluation
    if not individual_metadata:
        return log_likelihood, relationship_consistency
    
    # Count constraints and violations
    total_constraints = 0
    satisfied_constraints = 0
    
    # Check age constraints
    down_dict = reverse_node_dict(pedigree)
    
    for child, parents in pedigree.items():
        if child in individual_metadata and 'age' in individual_metadata[child]:
            child_age = individual_metadata[child]['age']
            
            # Check each parent
            for parent in parents:
                if parent in individual_metadata and 'age' in individual_metadata[parent]:
                    parent_age = individual_metadata[parent]['age']
                    total_constraints += 1
                    
                    # Parent should be older than child (by at least ~15 years)
                    if parent_age >= child_age + 15:
                        satisfied_constraints += 1
                    else:
                        # Apply penalty to log-likelihood for age violation
                        log_likelihood -= 5.0
    
    # Check sex constraints (e.g., biological fathers must be male)
    for parent, children in down_dict.items():
        if parent in individual_metadata and 'sex' in individual_metadata[parent]:
            parent_sex = individual_metadata[parent]['sex']
            
            # Find spouse pairs (people who share children)
            spouse_pairs = set()
            for child in children:
                child_parents = list(pedigree.get(child, {}).keys())
                if len(child_parents) >= 2:
                    for i in range(len(child_parents)):
                        for j in range(i+1, len(child_parents)):
                            spouse_pairs.add(tuple(sorted([child_parents[i], child_parents[j]])))
            
            # Check that two biological parents can't have the same sex
            for sp1, sp2 in spouse_pairs:
                if sp1 in individual_metadata and sp2 in individual_metadata:
                    if 'sex' in individual_metadata[sp1] and 'sex' in individual_metadata[sp2]:
                        sex1 = individual_metadata[sp1]['sex']
                        sex2 = individual_metadata[sp2]['sex']
                        
                        total_constraints += 1
                        if sex1 != sex2 or sex1 == '?' or sex2 == '?':
                            satisfied_constraints += 1
                        else:
                            # Apply penalty for sex constraint violation
                            log_likelihood -= 10.0
    
    # Calculate constraint consistency
    constraint_consistency = 100 * satisfied_constraints / total_constraints if total_constraints > 0 else 100
    
    # Combine both types of consistency
    overall_consistency = (relationship_consistency + constraint_consistency) / 2
    
    return log_likelihood, overall_consistency

# Now let's evaluate our ambiguous cases with age and sex constraints
# Evaluate both configurations against both IBD datasets, but now with constraints
half_sib_on_half_ibd_ll_c, half_sib_on_half_ibd_cons_c = evaluate_pedigree_with_constraints(
    config_half_siblings, half_sib_ibd, ambiguous_metadata)
half_sib_on_grand_ibd_ll_c, half_sib_on_grand_ibd_cons_c = evaluate_pedigree_with_constraints(
    config_half_siblings, grandparent_ibd, ambiguous_metadata)
grand_on_half_ibd_ll_c, grand_on_half_ibd_cons_c = evaluate_pedigree_with_constraints(
    config_grandparent, half_sib_ibd, ambiguous_metadata)
grand_on_grand_ibd_ll_c, grand_on_grand_ibd_cons_c = evaluate_pedigree_with_constraints(
    config_grandparent, grandparent_ibd, ambiguous_metadata)

# Create a DataFrame to compare results with constraints
results_with_constraints = pd.DataFrame([
    {"Configuration": "Half-Siblings", "IBD Pattern": "Half-Sibling Pattern", 
     "Log-Likelihood": half_sib_on_half_ibd_ll_c, "Consistency %": half_sib_on_half_ibd_cons_c},
    {"Configuration": "Half-Siblings", "IBD Pattern": "Grandparent Pattern", 
     "Log-Likelihood": half_sib_on_grand_ibd_ll_c, "Consistency %": half_sib_on_grand_ibd_cons_c},
    {"Configuration": "Grandparent", "IBD Pattern": "Half-Sibling Pattern", 
     "Log-Likelihood": grand_on_half_ibd_ll_c, "Consistency %": grand_on_half_ibd_cons_c},
    {"Configuration": "Grandparent", "IBD Pattern": "Grandparent Pattern", 
     "Log-Likelihood": grand_on_grand_ibd_ll_c, "Consistency %": grand_on_grand_ibd_cons_c}
])

# Display the results with constraints
print("Evaluation with Age and Sex Constraints:")
display(results_with_constraints)

# Compare to previous results without constraints
print("\nImpact of Adding Constraints:")
comparison = pd.DataFrame([
    {"Configuration": "Half-Siblings on Half-Sibling IBD", 
     "Without Constraints": half_sib_on_half_ibd_ll, 
     "With Constraints": half_sib_on_half_ibd_ll_c,
     "Difference": half_sib_on_half_ibd_ll_c - half_sib_on_half_ibd_ll},
    {"Configuration": "Half-Siblings on Grandparent IBD", 
     "Without Constraints": half_sib_on_grand_ibd_ll, 
     "With Constraints": half_sib_on_grand_ibd_ll_c,
     "Difference": half_sib_on_grand_ibd_ll_c - half_sib_on_grand_ibd_ll},
    {"Configuration": "Grandparent on Half-Sibling IBD", 
     "Without Constraints": grand_on_half_ibd_ll, 
     "With Constraints": grand_on_half_ibd_ll_c,
     "Difference": grand_on_half_ibd_ll_c - grand_on_half_ibd_ll},
    {"Configuration": "Grandparent on Grandparent IBD", 
     "Without Constraints": grand_on_grand_ibd_ll, 
     "With Constraints": grand_on_grand_ibd_ll_c,
     "Difference": grand_on_grand_ibd_ll_c - grand_on_grand_ibd_ll}
])
display(comparison)

### 3.3 Using Bonsai's Approach to Resolve Multiple Hypotheses

Let's explore how Bonsai v3 resolves multiple competing hypotheses in pedigree reconstruction:

In [ ]:
# Check for specific methods in Bonsai v3 for handling multiple hypotheses
if not is_jupyterlite():
    try:
        # Look for functions that handle multiple hypotheses or configurations
        # Check in pedigrees module
        print("Searching for multiple hypothesis handling in Bonsai v3...")
        
        # Look for relevant functions in v3.connections or v3.pedigrees
        from utils.bonsaitree.bonsaitree.v3 import connections, pedigrees
        
        # Search for hypothesis-related functions in connections module
        hypo_funcs_conn = [name for name, func in inspect.getmembers(connections, inspect.isfunction) 
                          if any(kw in name.lower() for kw in ['hypothesis', 'alternative', 'compare', 'rank'])]
        if hypo_funcs_conn:
            print("Functions in v3.connections that may handle multiple hypotheses:")
            for func_name in hypo_funcs_conn:
                print(f"- {func_name}")
                
        # Search for hypothesis-related functions in pedigrees module
        hypo_funcs_ped = [name for name, func in inspect.getmembers(pedigrees, inspect.isfunction) 
                         if any(kw in name.lower() for kw in ['hypothesis', 'alternative', 'compare', 'rank'])]
        if hypo_funcs_ped:
            print("Functions in v3.pedigrees that may handle multiple hypotheses:")
            for func_name in hypo_funcs_ped:
                print(f"- {func_name}")
        
        # If no specific hypothesis handling functions were found, check for functions with ranking/comparison
        if not hypo_funcs_conn and not hypo_funcs_ped:
            print("No explicit hypothesis handling functions found.")
            print("However, these functions might be used for comparison:")
            
            # Check for rankable function in connections
            rank_funcs_conn = [name for name, func in inspect.getmembers(connections, inspect.isfunction) 
                              if any(kw in name.lower() for kw in ['compare', 'evaluate', 'likelihood', 'rank'])]
            for func_name in rank_funcs_conn[:3]:  # Show top 3
                print(f"- {func_name}")
    
    except Exception as e:
        print(f"Error exploring hypothesis functions: {e}")
else:
    print("Running in JupyterLite - skipping exploration of Bonsai functions.")
    
# Implement a simplified version of handling multiple hypotheses based on Bonsai's approach
def resolve_multiple_hypotheses(hypotheses, ibd_data, individual_metadata=None, 
                               min_likelihood_diff=3.0, max_hypotheses=3):
    """
    Resolve multiple competing hypotheses based on IBD data and metadata.
    
    Args:
        hypotheses: List of (name, pedigree) tuples 
        ibd_data: Dictionary mapping pairs of IDs to their IBD segments
        individual_metadata: Dictionary mapping individuals to their metadata (age, sex, etc.)
        min_likelihood_diff: Minimum log-likelihood difference to consider a hypothesis clearly better
        max_hypotheses: Maximum number of hypotheses to maintain
        
    Returns:
        ranked_results: List of (name, pedigree, log_likelihood, consistency) tuples, sorted by likelihood
    """
    results = []
    
    # Evaluate each hypothesis
    for name, pedigree in hypotheses:
        if individual_metadata:
            ll, cons = evaluate_pedigree_with_constraints(pedigree, ibd_data, individual_metadata)
        else:
            ll, cons = evaluate_pedigree(pedigree, ibd_data)
        
        results.append((name, pedigree, ll, cons))
    
    # Sort by log-likelihood (descending)
    ranked_results = sorted(results, key=lambda x: x[2], reverse=True)
    
    # Determine which hypotheses to keep
    kept_results = [ranked_results[0]]  # Always keep the best hypothesis
    
    for i in range(1, len(ranked_results)):
        curr_ll = ranked_results[i][2]
        best_ll = ranked_results[0][2]
        
        # Keep this hypothesis if it's close enough to the best
        if best_ll - curr_ll < min_likelihood_diff and len(kept_results) < max_hypotheses:
            kept_results.append(ranked_results[i])
    
    return kept_results

# Apply this to our ambiguous case
hypotheses = [
    ("Half-Siblings", config_half_siblings),
    ("Grandparent-Grandchild", config_grandparent)
]

# Resolve with half-sibling IBD pattern
print("Resolving hypotheses with half-sibling IBD pattern:")
half_sib_resolution = resolve_multiple_hypotheses(hypotheses, half_sib_ibd, ambiguous_metadata)
for name, _, ll, cons in half_sib_resolution:
    print(f"- {name}: Log-Likelihood = {ll:.2f}, Consistency = {cons:.1f}%")

# Resolve with grandparent IBD pattern
print("\nResolving hypotheses with grandparent IBD pattern:")
grand_resolution = resolve_multiple_hypotheses(hypotheses, grandparent_ibd, ambiguous_metadata)
for name, _, ll, cons in grand_resolution:
    print(f"- {name}: Log-Likelihood = {ll:.2f}, Consistency = {cons:.1f}%")

# Now let's visualize the best hypothesis for each IBD pattern
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
best_half_sib_hypo = half_sib_resolution[0]
visualize_pedigree(best_half_sib_hypo[1], 
                  f"Best for Half-Sib IBD: {best_half_sib_hypo[0]}\nLL={best_half_sib_hypo[2]:.2f}", 
                  individual_metadata=ambiguous_metadata)

plt.subplot(1, 2, 2)
best_grand_hypo = grand_resolution[0]
visualize_pedigree(best_grand_hypo[1], 
                  f"Best for Grandparent IBD: {best_grand_hypo[0]}\nLL={best_grand_hypo[2]:.2f}", 
                  individual_metadata=ambiguous_metadata)

plt.tight_layout()
plt.show()

## Summary

In this lab, we've explored techniques for optimizing small pedigree configurations in Bonsai v3. Key takeaways include:

1. **Evaluation Methods**: We examined how to evaluate how well different pedigree configurations explain IBD data, using both simplified implementations and, where available, Bonsai v3's actual functions.

2. **Systematic Search**: We implemented a greedy search algorithm to systematically explore the space of possible pedigree configurations and find the optimal one.

3. **Multiple Hypotheses**: We learned how to handle ambiguous cases where multiple configurations might explain the genetic data equally well, and how to incorporate additional constraints like age and sex to resolve these ambiguities.

4. **Bonsai's Approach**: We explored Bonsai v3's approach to pedigree optimization, including its specialized functions for generating and evaluating alternative configurations.

These techniques form the foundation for scaling up to larger pedigrees in real-world applications, which we'll explore in future labs. The ability to optimize small pedigree configurations efficiently is crucial for constructing accurate family trees from genetic data, especially when dealing with ambiguous relationships and incomplete information.

In [ ]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab14_Optimizing_Pedigrees.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive