# Lab 15: The combine_up_dicts() Algorithm

## Overview

In this lab, we'll explore the `combine_up_dicts()` algorithm, which is at the heart of Bonsai v3's ability to scale from small pedigree structures to larger, more complex family networks. This algorithm systematically merges smaller pedigrees based on genetic evidence, using an iterative, likelihood-based approach to find the most plausible connections between family units.

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
import copy
import random
import math
from collections import defaultdict

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\n## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\nMethods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\n## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_source(obj):
    """Display the source code of an object (function or class)"""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\n{source}\n```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Lab 15: The combine_up_dicts() Algorithm

In this lab, we'll explore the `combine_up_dicts()` algorithm, which is the central mechanism in Bonsai v3 for scaling from small pedigree structures to larger, more complex family networks. This algorithm is essential for reconstructing large pedigrees that would be computationally infeasible to optimize directly.

The `combine_up_dicts()` function implements a bottom-up, incremental approach to pedigree reconstruction, where:

1. Small, high-confidence pedigree units (often single individuals or family units) are identified first
2. These units are progressively merged based on genetic evidence, starting with the most closely related units
3. Multiple alternative pedigree configurations are maintained and evaluated at each step
4. The process continues until all units are connected or no more reliable connections can be made

This approach addresses the combinatorial explosion challenge inherent in pedigree reconstruction while allowing for exploration of multiple hypotheses about how individuals are related.

## Part 1: Understanding the combine_up_dicts() Function

Let's start by examining the `combine_up_dicts()` function in the Bonsai v3 codebase. This function orchestrates the entire process of merging small pedigrees into larger structures.

In [None]:
# Import the combine_up_dicts function from Bonsai v3
if not is_jupyterlite():
    from utils.bonsaitree.bonsaitree.v3.connections import combine_up_dicts
    
    # Display the source code of the function
    print("Source code for combine_up_dicts:")
    view_source(combine_up_dicts)
else:
    print("Cannot display source code in JupyterLite environment.")

### 1.1 The Core Algorithm Structure

The `combine_up_dicts()` function implements an iterative algorithm for incrementally combining pedigrees. Let's break down its key components:

1. **Initialization**: Setting up data structures to track pedigrees and their relationships
   - `idx_to_up_dict_ll_list`: Maps pedigree indices to lists of (pedigree, likelihood) pairs
   - `id_to_idx`: Maps individual IDs to the pedigree they belong to
   - `idx_to_id_set`: Maps pedigree indices to sets of individual IDs contained in each pedigree

2. **Main Iteration Loop**: Repeatedly finding and merging the closest pedigrees until only one remains or no more merges are possible

3. **Finding Closest Pedigrees**: Identifying which pedigrees share the most IBD and should be merged next

4. **Merging Pedigrees**: Combining two pedigrees through their most likely connection points

5. **Evaluating Combinations**: Calculating likelihoods for different ways of connecting pedigrees

6. **Maintaining Multiple Hypotheses**: Keeping track of multiple candidate pedigrees at each step

Let's implement a simplified version of this algorithm to see how it works:

In [None]:
def simplified_combine_up_dicts(id_to_up_dct, id_to_shared_ibd, id_to_info=None, n_keep=3):
    """
    Simplified implementation of combine_up_dicts algorithm.
    
    Args:
        id_to_up_dct: Dict mapping IDs to their pedigrees (up-node dictionaries)
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict mapping IDs to their demographic information
        n_keep: Number of top pedigrees to keep at each step
        
    Returns:
        List of (pedigree, likelihood) pairs for the final pedigrees
    """
    id_to_info = id_to_info or {}
    
    # Initialize pedigree tracking structures
    idx_to_up_dict_ll_list = {}  # Maps pedigree indices to (pedigree, likelihood) lists
    id_to_idx = {}               # Maps individual IDs to their pedigree index
    idx_to_id_set = {}           # Maps pedigree indices to sets of contained IDs
    
    # Initialize with individual pedigrees
    for i, (id_val, up_dict) in enumerate(id_to_up_dct.items()):
        idx_to_up_dict_ll_list[i] = [(up_dict, 0.0)]  # Initial likelihood = 0
        id_to_idx[id_val] = i
        idx_to_id_set[i] = {id_val}
    
    # Calculate total IBD sharing for each pair
    id_pair_to_cm = {}
    for (id1, id2), segments in id_to_shared_ibd.items():
        total_cm = sum(seg.get('length_cm', 0) for seg in segments)
        id_pair_to_cm[(id1, id2)] = total_cm
    
    # Main iteration loop
    while len(idx_to_up_dict_ll_list) > 1:
        # Find closest pedigrees to merge
        closest_pair, max_ibd = find_closest_pedigrees(idx_to_id_set, id_pair_to_cm)
        
        if closest_pair is None or max_ibd < 20:  # Minimum IBD threshold
            break  # No more pedigrees to merge
            
        idx1, idx2 = closest_pair
        
        # Get pedigrees to merge
        ped_ll_list1 = idx_to_up_dict_ll_list[idx1]
        ped_ll_list2 = idx_to_up_dict_ll_list[idx2]
        
        # Try all combinations of pedigrees
        all_combined = []
        for ped1, ll1 in ped_ll_list1:
            for ped2, ll2 in ped_ll_list2:
                # Find ways to combine these pedigrees
                combined_peds = combine_pedigrees_simple(ped1, ped2, id_pair_to_cm)
                
                for combined_ped, new_ll in combined_peds:
                    # Add the new pedigree with combined likelihood
                    all_combined.append((combined_ped, ll1 + ll2 + new_ll))
        
        # Sort by likelihood and keep top n_keep
        all_combined.sort(key=lambda x: x[1], reverse=True)
        kept_combined = all_combined[:n_keep]
        
        # Create new index for merged pedigree
        new_idx = max(idx_to_up_dict_ll_list.keys()) + 1
        
        # Store combined pedigrees under new index
        idx_to_up_dict_ll_list[new_idx] = kept_combined
        
        # Update id_to_idx and idx_to_id_set
        id_set1 = idx_to_id_set[idx1]
        id_set2 = idx_to_id_set[idx2]
        merged_id_set = id_set1.union(id_set2)
        
        for id_val in merged_id_set:
            id_to_idx[id_val] = new_idx
            
        idx_to_id_set[new_idx] = merged_id_set
        
        # Remove old pedigree records
        del idx_to_up_dict_ll_list[idx1]
        del idx_to_up_dict_ll_list[idx2]
        del idx_to_id_set[idx1]
        del idx_to_id_set[idx2]
    
    # Return the final pedigree list
    if idx_to_up_dict_ll_list:
        final_idx = next(iter(idx_to_up_dict_ll_list.keys()))
        return idx_to_up_dict_ll_list[final_idx]
    else:
        return []

def find_closest_pedigrees(idx_to_id_set, id_pair_to_cm):
    """
    Find the pair of pedigrees that share the most IBD.
    
    Args:
        idx_to_id_set: Dict mapping pedigree indices to sets of contained IDs
        id_pair_to_cm: Dict mapping ID pairs to their total shared cM
        
    Returns:
        closest_pair: Tuple of (idx1, idx2) for the closest pedigrees
        max_ibd: Total IBD sharing between the closest pedigrees
    """
    # Calculate IBD sharing between pedigrees
    ped_pair_to_ibd = defaultdict(float)
    
    # For each pair of pedigrees
    pedigree_indices = list(idx_to_id_set.keys())
    for i in range(len(pedigree_indices)):
        idx1 = pedigree_indices[i]
        id_set1 = idx_to_id_set[idx1]
        
        for j in range(i + 1, len(pedigree_indices)):
            idx2 = pedigree_indices[j]
            id_set2 = idx_to_id_set[idx2]
            
            # Calculate total IBD between individuals in different pedigrees
            for id1 in id_set1:
                for id2 in id_set2:
                    pair = (min(id1, id2), max(id1, id2))
                    if pair in id_pair_to_cm:
                        ped_pair_to_ibd[(idx1, idx2)] += id_pair_to_cm[pair]
    
    # Find the pair with maximum IBD
    if not ped_pair_to_ibd:
        return None, 0
        
    max_pair = max(ped_pair_to_ibd.items(), key=lambda x: x[1])
    return max_pair[0], max_pair[1]

def combine_pedigrees_simple(ped1, ped2, id_pair_to_cm):
    """
    Simple implementation of pedigree combination.
    
    Args:
        ped1, ped2: Pedigrees to combine (up-node dictionaries)
        id_pair_to_cm: Dict mapping ID pairs to their total shared cM
        
    Returns:
        List of (combined_pedigree, likelihood) pairs
    """
    # Simplified implementation - just merge the pedigrees without proper connections
    combined = copy.deepcopy(ped1)
    
    # Add all nodes from ped2 not already in combined
    for node, parents in ped2.items():
        if node not in combined:
            combined[node] = parents.copy()
        else:
            # Merge parents
            for parent, deg in parents.items():
                combined[node][parent] = deg
    
    # Calculate a simple likelihood based on total IBD
    ids1 = set(ped1.keys())
    ids2 = set(ped2.keys())
    
    total_ibd = 0
    for id1 in ids1:
        for id2 in ids2:
            pair = (min(id1, id2), max(id1, id2))
            if pair in id_pair_to_cm:
                total_ibd += id_pair_to_cm[pair]
    
    # Simple likelihood score based on total IBD
    likelihood = math.log(1 + total_ibd)
    
    return [(combined, likelihood)]

This simplified implementation captures the essence of the `combine_up_dicts()` algorithm, but lacks many of the sophisticated features of the real Bonsai v3 implementation, such as:

- Advanced likelihood calculations based on proper relationship inference
- Sophisticated mechanisms for finding the optimal way to connect pedigrees
- Age and sex constraint enforcement
- Handling of genotyped vs. ungenotyped individuals
- Comprehensive error handling

Next, let's import the key functions that `combine_up_dicts()` relies on and examine how they work together in the real Bonsai implementation.

In [None]:
# Import key functions used by combine_up_dicts
if not is_jupyterlite():
    try:
        from utils.bonsaitree.bonsaitree.v3.connections import (
            combine_pedigrees,
            find_closest_pedigrees,
            get_connecting_points_degs_and_log_likes
        )
        
        # Display the source code of these functions
        print("1. find_closest_pedigrees:")
        view_source(find_closest_pedigrees)
        
        print("\n2. combine_pedigrees:")
        view_source(combine_pedigrees)
        
        print("\n3. get_connecting_points_degs_and_log_likes:")
        view_source(get_connecting_points_degs_and_log_likes)
    except (ImportError, AttributeError) as e:
        print(f"Could not import functions: {e}")
else:
    print("Cannot display source code in JupyterLite environment.")

## Part 2: Tracking and Merging Pedigrees

Now that we've examined the core algorithm, let's focus on two critical aspects of `combine_up_dicts()`:

1. How it tracks and manages pedigrees during the merging process
2. How it determines the optimal way to merge pedigrees

### 2.1 Pedigree Tracking Data Structures

The `combine_up_dicts()` function uses several data structures to track pedigrees and their relationships during the merging process. Let's implement simplified versions of these data structures to see how they work:

In [None]:
def initialize_pedigree_tracking(id_to_up_dct):
    """
    Initialize pedigree tracking data structures.
    
    Args:
        id_to_up_dct: Dict mapping IDs to their pedigrees
        
    Returns:
        idx_to_up_dict_ll_list: Maps pedigree indices to (pedigree, likelihood) lists
        id_to_idx: Maps individual IDs to their pedigree index
        idx_to_id_set: Maps pedigree indices to sets of contained IDs
    """
    idx_to_up_dict_ll_list = {}
    id_to_idx = {}
    idx_to_id_set = {}
    
    # Initialize with individual pedigrees
    for i, (id_val, up_dict) in enumerate(id_to_up_dct.items()):
        idx_to_up_dict_ll_list[i] = [(up_dict, 0.0)]  # Initial likelihood = 0
        id_to_idx[id_val] = i
        idx_to_id_set[i] = {id_val}
    
    return idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set

def update_pedigree_tracking(idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set, 
                             idx1, idx2, kept_combined):
    """
    Update pedigree tracking data structures after merging two pedigrees.
    
    Args:
        idx_to_up_dict_ll_list: Maps pedigree indices to (pedigree, likelihood) lists
        id_to_idx: Maps individual IDs to their pedigree index
        idx_to_id_set: Maps pedigree indices to sets of contained IDs
        idx1, idx2: Indices of the pedigrees being merged
        kept_combined: List of (pedigree, likelihood) pairs for the merged pedigree
        
    Returns:
        Updated tracking data structures
    """
    # Create a copy to avoid modifying the originals
    idx_to_up_dict_ll_list = idx_to_up_dict_ll_list.copy()
    id_to_idx = id_to_idx.copy()
    idx_to_id_set = idx_to_id_set.copy()
    
    # Create new index for merged pedigree
    new_idx = max(idx_to_up_dict_ll_list.keys()) + 1 if idx_to_up_dict_ll_list else 0
    
    # Store combined pedigrees under new index
    idx_to_up_dict_ll_list[new_idx] = kept_combined
    
    # Update id_to_idx and idx_to_id_set
    id_set1 = idx_to_id_set[idx1]
    id_set2 = idx_to_id_set[idx2]
    merged_id_set = id_set1.union(id_set2)
    
    for id_val in merged_id_set:
        id_to_idx[id_val] = new_idx
        
    idx_to_id_set[new_idx] = merged_id_set
    
    # Remove old pedigree records
    del idx_to_up_dict_ll_list[idx1]
    del idx_to_up_dict_ll_list[idx2]
    del idx_to_id_set[idx1]
    del idx_to_id_set[idx2]
    
    return idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set

# Let's demonstrate how these functions work with a simple example
def demonstrate_pedigree_tracking():
    # Create some simple pedigrees
    id_to_up_dct = {
        1: {},  # Individual 1
        2: {},  # Individual 2
        3: {}   # Individual 3
    }
    
    # Initialize tracking structures
    idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set = initialize_pedigree_tracking(id_to_up_dct)
    
    print("Initial tracking structures:")
    print(f"id_to_idx: {id_to_idx}")
    print(f"idx_to_id_set: {idx_to_id_set}")
    print(f"idx_to_up_dict_ll_list keys: {list(idx_to_up_dict_ll_list.keys())}")
    
    # Merge pedigrees 0 and 1
    # In reality, this would be based on IBD sharing, but for simplicity we'll just merge them directly
    merged_pedigree = {1: {}, 2: {}}
    kept_combined = [(merged_pedigree, 10.0)]  # Merged pedigree with likelihood score
    
    # Update tracking structures
    idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set = update_pedigree_tracking(
        idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set, 0, 1, kept_combined)
    
    print("\nTracking structures after merging pedigrees 0 and 1:")
    print(f"id_to_idx: {id_to_idx}")
    print(f"idx_to_id_set: {idx_to_id_set}")
    print(f"idx_to_up_dict_ll_list keys: {list(idx_to_up_dict_ll_list.keys())}")
    
    # Merge the combined pedigree with pedigree 2
    final_pedigree = {1: {}, 2: {}, 3: {}}
    final_combined = [(final_pedigree, 15.0)]
    
    # Update tracking structures again
    idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set = update_pedigree_tracking(
        idx_to_up_dict_ll_list, id_to_idx, idx_to_id_set, 3, 2, final_combined)
    
    print("\nTracking structures after merging all pedigrees:")
    print(f"id_to_idx: {id_to_idx}")
    print(f"idx_to_id_set: {idx_to_id_set}")
    print(f"idx_to_up_dict_ll_list keys: {list(idx_to_up_dict_ll_list.keys())}")
    print(f"Final pedigree likelihood: {idx_to_up_dict_ll_list[4][0][1]}")

# Run the demonstration
demonstrate_pedigree_tracking()

### 2.2 Finding Optimal Ways to Merge Pedigrees

A critical aspect of the `combine_up_dicts()` algorithm is determining the optimal way to merge pedigrees. This involves:

1. Identifying which pedigrees should be merged next based on IBD sharing
2. Finding potential connection points in each pedigree
3. Evaluating different relationship configurations for connecting these points
4. Selecting the connections that maximize the overall likelihood

Let's implement simplified versions of these steps:

In [None]:
def find_optimal_merge(ped1, ped2, id_pair_to_cm):
    """
    Find the optimal way to merge two pedigrees based on IBD sharing.
    
    Args:
        ped1, ped2: Pedigrees to merge (up-node dictionaries)
        id_pair_to_cm: Dict mapping ID pairs to their total shared cM
        
    Returns:
        best_merged: The best merged pedigree
        likelihood: The likelihood of the merged pedigree
    """
    # Get all IDs in each pedigree
    ids1 = set(ped1.keys())
    ids2 = set(ped2.keys())
    
    # Find pairs of individuals that share IBD across pedigrees
    connecting_pairs = []
    for id1 in ids1:
        for id2 in ids2:
            pair = (min(id1, id2), max(id1, id2))
            if pair in id_pair_to_cm and id_pair_to_cm[pair] > 20:  # Minimum threshold
                connecting_pairs.append((id1, id2, id_pair_to_cm[pair]))
    
    # Sort by amount of IBD sharing (descending)
    connecting_pairs.sort(key=lambda x: x[2], reverse=True)
    
    # If no connecting pairs, just merge without connections
    if not connecting_pairs:
        merged = merge_pedigrees_simple(ped1, ped2)
        return merged, 0.0
    
    # Try different relationship configurations for the top connecting pair
    id1, id2, shared_cm = connecting_pairs[0]
    relationship_configs = [
        ("parent-child", 0, 1, 1),    # id1 is parent of id2
        ("parent-child", 1, 0, 1),    # id2 is parent of id1
        ("siblings", 1, 1, 2),        # id1 and id2 are full siblings
        ("half-siblings", 1, 1, 1),   # id1 and id2 are half siblings
        ("cousins", 2, 2, 2),         # id1 and id2 are cousins
        ("unrelated", None, None, None)  # No direct relationship
    ]
    
    # Evaluate each relationship configuration
    best_merged = None
    best_likelihood = float('-inf')
    
    for name, up, down, num_ancs in relationship_configs:
        # Skip relationship if it doesn't match the IBD amount
        if not is_plausible_relationship(name, shared_cm):
            continue
            
        # Merge the pedigrees with this relationship configuration
        merged = merge_with_relationship(ped1, ped2, id1, id2, up, down, num_ancs)
        
        # Calculate likelihood of the merged pedigree
        likelihood = calculate_merged_likelihood(merged, id_pair_to_cm)
        
        # Update best if this is better
        if likelihood > best_likelihood:
            best_merged = merged
            best_likelihood = likelihood
    
    return best_merged, best_likelihood

def is_plausible_relationship(relationship, shared_cm):
    """
    Check if a relationship is plausible given the amount of shared DNA.
    
    Args:
        relationship: String describing the relationship
        shared_cm: Amount of shared DNA in centimorgans
        
    Returns:
        is_plausible: Whether the relationship is plausible
    """
    # Expected IBD ranges for different relationships
    ranges = {
        "parent-child": (1700, 3400),  # ~50% of genome
        "siblings": (1700, 2800),      # ~50-75% of genome
        "half-siblings": (700, 1800),  # ~25-50% of genome
        "cousins": (200, 900),         # ~12.5-25% of genome
        "unrelated": (0, 100)          # Very little sharing
    }
    
    if relationship in ranges:
        min_cm, max_cm = ranges[relationship]
        return min_cm <= shared_cm <= max_cm
    else:
        return True  # Unknown relationship, assume plausible

def merge_with_relationship(ped1, ped2, id1, id2, up, down, num_ancs):
    """
    Merge pedigrees with a specific relationship between id1 and id2.
    
    Args:
        ped1, ped2: Pedigrees to merge
        id1, id2: IDs to connect
        up, down, num_ancs: Relationship parameters
        
    Returns:
        merged: Merged pedigree
    """
    # Create copies to avoid modifying originals
    ped1 = copy.deepcopy(ped1)
    ped2 = copy.deepcopy(ped2)
    
    # Implement different relationship types
    if up == 0 and down == 1:  # id1 is parent of id2
        if id2 not in ped2:
            ped2[id2] = {}
        ped2[id2][id1] = 1  # Add id1 as parent of id2
        
    elif up == 1 and down == 0:  # id2 is parent of id1
        if id1 not in ped1:
            ped1[id1] = {}
        ped1[id1][id2] = 1  # Add id2 as parent of id1
        
    elif up == 1 and down == 1:  # siblings or half-siblings
        # Create a common parent
        parent_id = -1  # Use negative ID for ungenotyped individual
        
        # Add parent to both individuals
        if id1 not in ped1:
            ped1[id1] = {}
        ped1[id1][parent_id] = 1
        
        if id2 not in ped2:
            ped2[id2] = {}
        ped2[id2][parent_id] = 1
        
        # For full siblings, add a second parent
        if num_ancs == 2:
            parent2_id = -2
            ped1[id1][parent2_id] = 1
            ped2[id2][parent2_id] = 1
    
    # Merge the modified pedigrees
    merged = merge_pedigrees_simple(ped1, ped2)
    return merged

def merge_pedigrees_simple(ped1, ped2):
    """
    Merge two pedigrees without adding new relationships.
    
    Args:
        ped1, ped2: Pedigrees to merge
        
    Returns:
        merged: Merged pedigree
    """
    merged = copy.deepcopy(ped1)
    
    # Add all nodes from ped2
    for node, parents in ped2.items():
        if node not in merged:
            merged[node] = parents.copy()
        else:
            # Merge parents
            for parent, deg in parents.items():
                merged[node][parent] = deg
    
    return merged

def calculate_merged_likelihood(merged_ped, id_pair_to_cm):
    """
    Calculate likelihood of a merged pedigree.
    
    Args:
        merged_ped: Merged pedigree
        id_pair_to_cm: Dict mapping ID pairs to their shared cM
        
    Returns:
        likelihood: Log-likelihood of the pedigree
    """
    # Simple implementation - just sum the log of shared IBD amounts
    likelihood = 0.0
    
    # Get all pairs of IDs in the pedigree
    ids = list(merged_ped.keys())
    for i in range(len(ids)):
        for j in range(i + 1, len(ids)):
            id1, id2 = ids[i], ids[j]
            pair = (min(id1, id2), max(id1, id2))
            
            if pair in id_pair_to_cm:
                # Add contribution to likelihood
                shared_cm = id_pair_to_cm[pair]
                likelihood += math.log(1 + shared_cm)
    
    return likelihood

Let's create a small example to demonstrate how these functions work together to find the optimal way to merge pedigrees:

In [None]:
def demonstrate_optimal_merge():
    # Create two simple pedigrees
    ped1 = {1: {}}  # Individual 1
    ped2 = {2: {}}  # Individual 2
    
    # Create IBD sharing data - simulate parent-child relationship
    id_pair_to_cm = {(1, 2): 1800}  # ~50% of genome shared
    
    # Find the optimal way to merge these pedigrees
    merged, likelihood = find_optimal_merge(ped1, ped2, id_pair_to_cm)
    
    print("Merge example 1: Parent-child relationship")
    print(f"Pedigree 1: {ped1}")
    print(f"Pedigree 2: {ped2}")
    print(f"IBD sharing: {id_pair_to_cm}")
    print(f"Merged pedigree: {merged}")
    print(f"Likelihood: {likelihood}")
    
    # Create another example with half-sibling relationship
    ped3 = {3: {}}  # Individual 3
    ped4 = {4: {}}  # Individual 4
    
    # Create IBD sharing data - simulate half-sibling relationship
    id_pair_to_cm = {(3, 4): 900}  # ~25% of genome shared
    
    # Find the optimal way to merge these pedigrees
    merged2, likelihood2 = find_optimal_merge(ped3, ped4, id_pair_to_cm)
    
    print("\nMerge example 2: Half-sibling relationship")
    print(f"Pedigree 3: {ped3}")
    print(f"Pedigree 4: {ped4}")
    print(f"IBD sharing: {id_pair_to_cm}")
    print(f"Merged pedigree: {merged2}")
    print(f"Likelihood: {likelihood2}")

# Run the demonstration
demonstrate_optimal_merge()

## Part 3: Putting It All Together - The Incremental Process

Now let's implement a simplified version of the full `combine_up_dicts()` algorithm that incorporates all the elements we've explored:

In [None]:
def simulate_ibd_data(num_individuals=5, density=0.3):
    """
    Simulate IBD data for a set of individuals.
    
    Args:
        num_individuals: Number of individuals to simulate
        density: Probability of IBD sharing between pairs
        
    Returns:
        id_to_up_dct: Dict mapping IDs to their pedigrees
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
    """
    # Create individual pedigrees
    id_to_up_dct = {i: {} for i in range(1, num_individuals + 1)}
    
    # Simulate IBD sharing
    id_to_shared_ibd = {}
    for i in range(1, num_individuals + 1):
        for j in range(i + 1, num_individuals + 1):
            # Randomly decide if this pair shares IBD
            if random.random() < density:
                # Simulate amount of sharing (random value between 50 and 2000 cM)
                shared_cm = random.uniform(50, 2000)
                
                # Create a dummy segment
                segment = {"start": 0, "end": 100, "length_cm": shared_cm}
                id_to_shared_ibd[(i, j)] = [segment]
    
    return id_to_up_dct, id_to_shared_ibd

def visualize_pedigree(up_node_dict, title="Pedigree"):
    """
    Visualize a pedigree using networkx.
    
    Args:
        up_node_dict: Up-node dictionary representing the pedigree
        title: Title for the visualization
    """
    # Create a directed graph (edges point from child to parent)
    G = nx.DiGraph()
    
    # Add all nodes to the graph
    all_ids = set(up_node_dict.keys())
    for parents in up_node_dict.values():
        all_ids.update(parents.keys())
    
    # Add nodes with colors (blue for genotyped, gray for ungenotyped)
    for node_id in all_ids:
        if node_id > 0:  # Genotyped individuals have positive IDs
            G.add_node(node_id, color='lightblue')
        else:  # Ungenotyped individuals have negative IDs
            G.add_node(node_id, color='lightgray')
    
    # Add edges (from child to parent)
    for child, parents in up_node_dict.items():
        for parent in parents:
            G.add_edge(child, parent)
    
    # Create plot
    plt.figure(figsize=(10, 6))
    plt.title(title)
    
    # Get node colors
    node_colors = [G.nodes[n]['color'] for n in G.nodes]
    
    # Set layout (tree layout looks nice for pedigrees)
    pos = nx.spring_layout(G, seed=42)  # For reproducibility
    
    # Draw the graph
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            node_size=800, font_weight='bold')
    
    plt.tight_layout()
    plt.show()

def run_combine_up_dicts_demo():
    # Generate simulated IBD data
    id_to_up_dct, id_to_shared_ibd = simulate_ibd_data(num_individuals=5, density=0.3)
    
    # Display the IBD sharing data
    print("Simulated IBD sharing:")
    for (id1, id2), segments in id_to_shared_ibd.items():
        total_cm = sum(seg["length_cm"] for seg in segments)
        print(f"Individuals {id1} and {id2} share {total_cm:.1f} cM")
    
    # Run our simplified combine_up_dicts implementation
    final_pedigrees = simplified_combine_up_dicts(id_to_up_dct, id_to_shared_ibd)
    
    if final_pedigrees:
        best_pedigree, likelihood = final_pedigrees[0]
        print(f"\nBest pedigree has likelihood {likelihood:.2f}")
        
        # Visualize the pedigree
        visualize_pedigree(best_pedigree, "Final Merged Pedigree")
    else:
        print("\nNo valid pedigrees found")

# Run the demonstration
np.random.seed(42)  # For reproducibility
random.seed(42)
run_combine_up_dicts_demo()

### 3.1 Comparison with the Real Bonsai v3 Implementation

Our simplified implementation captures the essence of the `combine_up_dicts()` algorithm, but the real Bonsai v3 implementation includes many sophisticated features that make it more accurate and robust. Some of the key differences include:

1. **Advanced Relationship Inference**: The real implementation uses sophisticated statistical models to infer relationships from IBD data, considering not just the total amount of sharing but also the patterns of segment lengths and counts.

2. **Comprehensive Likelihood Calculation**: The Bonsai v3 likelihood calculation integrates genetic evidence with biological constraints like age, sex, and generation gaps.

3. **Multiple Hypothesis Tracking**: The real implementation maintains multiple alternative hypotheses about how individuals might be related, allowing it to recover from local optima.

4. **Sophisticated Connection Point Selection**: Bonsai v3 uses advanced algorithms to identify the best points to connect pedigrees, considering both genetic evidence and structural constraints.

5. **Error Handling and Edge Cases**: The production code includes robust handling of edge cases like conflicting evidence or incompatible pedigrees.

Let's see if we can use the real Bonsai v3 implementation directly on our simulated data:

In [None]:
if not is_jupyterlite():
    try:
        # Import the real combine_up_dicts function
        from utils.bonsaitree.bonsaitree.v3.connections import combine_up_dicts as real_combine_up_dicts
        
        # Generate simulated data
        id_to_up_dct, id_to_shared_ibd = simulate_ibd_data(num_individuals=5, density=0.3)
        
        # Convert IBD data to Bonsai v3 format
        unphased_ibd_seg_list = []
        for (id1, id2), segments in id_to_shared_ibd.items():
            for segment in segments:
                # Create segment in Bonsai format
                bonsai_segment = {
                    "id1": id1,
                    "id2": id2,
                    "chrom": 1,  # Dummy chromosome
                    "start_cm": 0,
                    "end_cm": segment["length_cm"],
                    "length_cm": segment["length_cm"]
                }
                unphased_ibd_seg_list.append(bonsai_segment)
        
        # Create biographical information
        bio_info = []
        for id_val in range(1, 6):
            info = {
                "id": id_val,
                "sex": random.choice(["M", "F"]),
                "age": random.randint(20, 80)
            }
            bio_info.append(info)
        
        # Try to run the real combine_up_dicts function
        try:
            from utils.bonsaitree.bonsaitree.v3.pw_log_like import PwLogLike
            
            # Create PwLogLike instance
            pw_ll = PwLogLike(bio_info=bio_info, unphased_ibd_seg_list=unphased_ibd_seg_list)
            
            # Run the real combine_up_dicts function
            result = real_combine_up_dicts(
                unphased_ibd_seg_list=unphased_ibd_seg_list,
                bio_info=bio_info,
                pw_ll_cls=PwLogLike
            )
            
            print("Successfully ran real Bonsai v3 combine_up_dicts function")
            print(f"Result: {result}")
            
            # Visualize the result if available
            if result and len(result) > 0 and len(result[0]) > 0:
                best_pedigree = result[0][0]
                visualize_pedigree(best_pedigree, "Bonsai v3 Result")
            
        except Exception as e:
            print(f"Error running real combine_up_dicts: {e}")
            print("Using simplified implementation instead")
            
            # Run our simplified implementation
            result = simplified_combine_up_dicts(id_to_up_dct, id_to_shared_ibd)
            
            if result:
                best_pedigree, likelihood = result[0]
                print(f"Simplified result: {best_pedigree}")
                print(f"Likelihood: {likelihood}")
                visualize_pedigree(best_pedigree, "Simplified Result")
            
    except ImportError as e:
        print(f"Could not import Bonsai v3 functions: {e}")
        print("Using simplified implementation instead")
        
        # Run our simplified implementation
        result = simplified_combine_up_dicts(id_to_up_dct, id_to_shared_ibd)
        
        if result:
            best_pedigree, likelihood = result[0]
            print(f"Simplified result: {best_pedigree}")
            print(f"Likelihood: {likelihood}")
            visualize_pedigree(best_pedigree, "Simplified Result")
else:
    print("Running in JupyterLite environment - skipping real Bonsai v3 implementation")

## Summary

In this lab, we explored the `combine_up_dicts()` algorithm, which is at the heart of Bonsai v3's ability to scale from small pedigree structures to larger, more complex family networks. Key takeaways include:

1. **Iterative Merge Approach**: The algorithm works by repeatedly finding and merging the most closely related pedigrees until all individuals are connected or no more reliable connections can be made.

2. **Pedigree Tracking**: Sophisticated data structures track pedigrees, their likelihoods, and the individuals they contain throughout the merging process.

3. **Finding Optimal Connections**: The algorithm systematically evaluates different ways to connect pedigrees, selecting the connections that maximize the overall likelihood.

4. **Multiple Hypothesis Tracking**: By maintaining multiple alternative pedigree configurations at each step, the algorithm can explore different hypotheses about how individuals might be related.

5. **Comprehensive Likelihood Calculation**: The likelihood calculation integrates genetic evidence with biological constraints to evaluate different pedigree configurations.

This incremental, likelihood-based approach is what enables Bonsai v3 to reconstruct complex family networks from genetic data, even in the presence of noise, missing data, and ambiguous relationships. By dividing the reconstruction problem into smaller, more manageable pieces, the `combine_up_dicts()` algorithm makes it possible to scale to larger pedigrees while maintaining computational feasibility and accuracy.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab15_Combine_Up_Dicts.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive