# Lab 15: The combine_up_dicts() Algorithm

## Overview

This notebook explores the `combine_up_dicts()` algorithm, which is a core component of the Bonsai v3 framework for merging pedigree structures. This algorithm is essential for scaling pedigree reconstruction from small family units to large, complex family trees.

In this lab, we'll:
1. Examine the actual implementation of `combine_up_dicts()` in the Bonsai v3 codebase
2. Understand how pedigrees are merged using optimal connection points
3. Create an example that demonstrates the algorithm in action
4. Visualize the results of pedigree merging

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
import copy
from itertools import combinations

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\
## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\
Methods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\
## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code
        from IPython.display import display, Markdown
        display(Markdown(f"```python\
{source}\
```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    try:
        from utils.bonsaitree.bonsaitree import v3
        print("✅ Successfully imported Bonsai v3 module from utils directory")
    except ImportError as e:
        print(f"❌ Failed to import Bonsai v3 module: {e}")
        print("This lab requires access to the Bonsai v3 codebase.")
        print("Make sure you've properly set up your environment with the Bonsai repository.")

## Part 1: Understanding the combine_up_dicts() Algorithm

The `combine_up_dicts()` function is a cornerstone of Bonsai v3's pedigree reconstruction strategy. It takes multiple pedigrees (represented as up-node dictionaries) and merges them into larger structures, selecting optimal connection points based on IBD sharing.

Let's examine the actual implementation of this function in the Bonsai v3 codebase.

In [None]:
# Let's first check if we're running in JupyterLite or have access to the actual codebase
if not is_jupyterlite():
    # If in a local environment, we can access the actual Bonsai codebase
    try:
        view_function_source("bonsaitree.bonsaitree.v3.connections", "combine_up_dicts")
    except Exception as e:
        try:
            view_function_source("utils.bonsaitree.bonsaitree.v3.connections", "combine_up_dicts")
        except Exception as e2:
            print(f"Could not access the Bonsai codebase: {e2}")
else:
    # For JupyterLite, we'll provide a simplified description of the function
    print("Running in JupyterLite environment - displaying a simplified overview of combine_up_dicts():")
    combine_up_dicts_desc = """
    def combine_up_dicts(
        idx_to_up_dict_ll_list, # Dict of indices to pedigrees and log-likelihoods
        id_to_idx,              # Dict mapping genotype IDs to pedigree indices
        idx_to_id_set,          # Dict mapping pedigree indices to sets of genotype IDs
        ibd_seg_list,           # List of IBD segments 
        pw_ll_cls,              # Instance of PwLogLike class
        condition=True,         # Whether to condition on IBD
        max_peds=20,            # Maximum pedigrees to retain at each step
        max_start_peds=5,       # Number of pedigrees from previous round to use
        max_con_pts=100,        # Maximum connection points to consider
        min_seg_len=3.0,        # Minimum segment length in cM
        restrict_connection_points=False,  # Whether to restrict connection points
        connect_up_only=False,  # Whether to only connect through ancestors
        db_fig_base_dir=None,   # Directory for debugging figures
        true_ped=None,          # True pedigree for debugging
    ):
        """Core algorithm for merging multiple pedigrees into larger structures"""
    """
    display(Markdown(f"```python\
{combine_up_dicts_desc}\
```"))

### How combine_up_dicts() Works

Let's break down the algorithm step by step:

1. **Input**: Multiple pedigrees represented as up-node dictionaries, with their associated log-likelihoods

2. **Main Loop**:
   - While there are multiple pedigrees and IBD relationships between them:
     - Find the pair of individuals with the strongest IBD relationship who are in different pedigrees
     - Get the pedigrees containing these individuals
     - Consider all possible ways to combine these pedigrees
     - Select the most likely merged pedigrees
     - Update the data structures to reflect the merged pedigrees

3. **Output**: A set of merged pedigrees, represented as up-node dictionaries with their log-likelihoods

The algorithm iteratively merges pedigrees until either all individuals are in a single pedigree, or there are no more IBD relationships between pedigrees.

Let's look at some key helper functions that are used by `combine_up_dicts()`:

In [None]:
# Check if we can access the get_closest_pair function
if not is_jupyterlite():
    try:
        view_function_source("bonsaitree.bonsaitree.v3.ibd", "get_closest_pair")
    except Exception as e:
        try:
            view_function_source("utils.bonsaitree.bonsaitree.v3.ibd", "get_closest_pair")
        except Exception as e2:
            print(f"Could not access the get_closest_pair function: {e2}")
else:
    # Simplified description for JupyterLite
    print("Running in JupyterLite environment - displaying a simplified overview of get_closest_pair():")
    get_closest_pair_desc = """
    def get_closest_pair(ibd_stats):
        """Find the pair of individuals with the most IBD sharing.
        
        Args:
            ibd_stats: Dictionary mapping pairs of individual IDs to their IBD sharing metrics
            
        Returns:
            Tuple of (id1, id2) representing the pair with the most IBD sharing
        """
        # Find the pair with the highest IBD sharing
        max_ibd = 0
        max_pair = None
        
        for pair, stats in ibd_stats.items():
            # The pair is a frozenset, so convert it to a list
            id_pair = list(pair)
            
            # The ibd_stats stores the total shared cM for each pair
            total_ibd = stats[0]  # First element is total shared cM
            
            if total_ibd > max_ibd:
                max_ibd = total_ibd
                max_pair = (id_pair[0], id_pair[1])
        
        return max_pair
    """
    display(Markdown(f"```python\
{get_closest_pair_desc}\
```"))

In [None]:
# Let's also look at the combine_pedigrees function that's called by combine_up_dicts
if not is_jupyterlite():
    try:
        view_function_source("bonsaitree.bonsaitree.v3.connections", "combine_pedigrees")
    except Exception as e:
        try:
            view_function_source("utils.bonsaitree.bonsaitree.v3.connections", "combine_pedigrees")
        except Exception as e2:
            print(f"Could not access the combine_pedigrees function: {e2}")
else:
    # Simplified description for JupyterLite
    print("Running in JupyterLite environment - displaying a simplified overview of combine_pedigrees():")
    combine_pedigrees_desc = """
    def combine_pedigrees(
        up_dct1,                # Up-node dict for pedigree 1
        up_dct2,                # Up-node dict for pedigree 2
        pw_ll_cls,              # Instance of PwLogLike class
        ibd_seg_list,           # List of IBD segments
        condition=True,         # Whether to condition on IBD
        max_peds=20,            # Maximum number of pedigrees to build
        max_con_pts=100,        # Maximum connection points to consider
        min_seg_len=3.0,        # Minimum segment length
        restrict_connection_points=False,  # Whether to restrict connection points
        connect_up_only=False,  # Whether to only connect through ancestors
        fig_dir=None,           # Directory for figures
    ):
        """Combine two pedigrees through the most likely connection points.
        
        This function finds the best ways to connect two pedigrees by examining 
        potential connection points between them and evaluating different
        possible relationships between those points.
        
        Returns:
            List of (pedigree, log_likelihood) tuples for the merged pedigrees
        """
    """
    display(Markdown(f"```python\
{combine_pedigrees_desc}\
```"))

### Key Components of the Algorithm

1. **Finding Connection Points**: 
   The function `get_connecting_points_degs_and_log_likes()` identifies potential points in each pedigree where connections could be made.

2. **Evaluating Relationships**: 
   For each pair of connection points, different potential relationships (parent-child, siblings, cousins, etc.) are evaluated based on the observed IBD sharing.

3. **Selecting Optimal Connections**: 
   The connections with the highest likelihood are selected to merge the pedigrees.

4. **Iterative Merging**: 
   The algorithm iteratively merges pedigrees, starting with the pairs having the strongest IBD sharing, until all pedigrees are merged or no more connections can be made.

Let's now implement a simplified example to demonstrate how pedigrees are merged using the `combine_up_dicts()` algorithm.

## Part 2: Creating and Visualizing Up-Node Dictionaries

Before we can demonstrate pedigree merging, we need to create some example pedigrees using the up-node dictionary structure. We'll also need functions to visualize these pedigrees.

In [None]:
def create_simple_family(family_id, base_id=1, add_sex=True):
    """
    Create a simple family with two parents and two children.
    
    Args:
        family_id: Identifier for this family
        base_id: Starting ID for individuals (will be incremented)
        add_sex: Whether to add sex information to the individuals
        
    Returns:
        up_dict: Up-node dictionary representing the family
        bioinfo: List of dictionaries with individual metadata
    """
    # Create IDs for the family members
    father_id = base_id
    mother_id = base_id + 1
    child1_id = base_id + 2
    child2_id = base_id + 3
    
    # Create the up-node dictionary
    up_dict = {
        father_id: {},  # Father has no parents in this example
        mother_id: {},  # Mother has no parents in this example
        child1_id: {father_id: 1, mother_id: 1},  # Child 1 has both parents
        child2_id: {father_id: 1, mother_id: 1}   # Child 2 has both parents
    }
    
    # Create bioinfo for visualization
    bioinfo = [
        {"genotype_id": father_id, "family_id": family_id, "age": 40},
        {"genotype_id": mother_id, "family_id": family_id, "age": 38},
        {"genotype_id": child1_id, "family_id": family_id, "age": 15},
        {"genotype_id": child2_id, "family_id": family_id, "age": 12}
    ]
    
    # Add sex information if requested
    if add_sex:
        bioinfo[0]["sex"] = "M"  # Father
        bioinfo[1]["sex"] = "F"  # Mother
        bioinfo[2]["sex"] = "M"  # Child 1 (son)
        bioinfo[3]["sex"] = "F"  # Child 2 (daughter)
    
    return up_dict, bioinfo

def create_extended_family(family_id, base_id=10, add_sex=True):
    """
    Create a more complex family with grandparents, parents, and children.
    
    Args:
        family_id: Identifier for this family
        base_id: Starting ID for individuals
        add_sex: Whether to add sex information
        
    Returns:
        up_dict: Up-node dictionary representing the family
        bioinfo: List of dictionaries with individual metadata
    """
    # Create IDs for the family members
    grandpa_id = base_id
    grandma_id = base_id + 1
    father_id = base_id + 2
    mother_id = base_id + 3
    child1_id = base_id + 4
    child2_id = base_id + 5
    
    # Create the up-node dictionary
    up_dict = {
        grandpa_id: {},  # Grandpa has no parents in this example
        grandma_id: {},  # Grandma has no parents in this example
        father_id: {grandpa_id: 1, grandma_id: 1},  # Father's parents are grandpa and grandma
        mother_id: {},   # Mother has no parents in this example
        child1_id: {father_id: 1, mother_id: 1},  # Child 1 has both parents
        child2_id: {father_id: 1, mother_id: 1}   # Child 2 has both parents
    }
    
    # Create bioinfo for visualization
    bioinfo = [
        {"genotype_id": grandpa_id, "family_id": family_id, "age": 70},
        {"genotype_id": grandma_id, "family_id": family_id, "age": 68},
        {"genotype_id": father_id, "family_id": family_id, "age": 45},
        {"genotype_id": mother_id, "family_id": family_id, "age": 42},
        {"genotype_id": child1_id, "family_id": family_id, "age": 18},
        {"genotype_id": child2_id, "family_id": family_id, "age": 15}
    ]
    
    # Add sex information if requested
    if add_sex:
        bioinfo[0]["sex"] = "M"  # Grandpa
        bioinfo[1]["sex"] = "F"  # Grandma
        bioinfo[2]["sex"] = "M"  # Father
        bioinfo[3]["sex"] = "F"  # Mother
        bioinfo[4]["sex"] = "M"  # Child 1 (son)
        bioinfo[5]["sex"] = "F"  # Child 2 (daughter)
    
    return up_dict, bioinfo

In [None]:
def visualize_pedigree(up_dict, bioinfo=None, title="Pedigree", figsize=(10, 6), node_size=800):
    """
    Visualize a pedigree from an up-node dictionary.
    
    Args:
        up_dict: Up-node dictionary representing the pedigree
        bioinfo: List of dictionaries with individual metadata (optional)
        title: Title for the plot
        figsize: Figure size as (width, height) tuple
        node_size: Size of nodes in the visualization
    """
    # Create a directed graph for the pedigree
    G = nx.DiGraph()
    
    # Create a mapping of IDs to sex if bioinfo is provided
    sex_map = {}
    labels = {}
    if bioinfo is not None:
        for info in bioinfo:
            gid = info["genotype_id"]
            if "sex" in info:
                sex_map[gid] = info["sex"]
            # Create labels with ID and age if available
            if "age" in info:
                labels[gid] = f"{gid}\
({info['age']} yrs)"
            else:
                labels[gid] = f"{gid}"
    
    # Add nodes and edges to the graph
    for child_id, parents in up_dict.items():
        # Add the child node
        if child_id not in G:
            G.add_node(child_id)
            if child_id not in labels:
                labels[child_id] = f"{child_id}"
        
        # Add parent nodes and edges from parents to child
        for parent_id in parents:
            if parent_id not in G:
                G.add_node(parent_id)
                if parent_id not in labels:
                    labels[parent_id] = f"{parent_id}"
            # Edge is from parent to child in a pedigree
            G.add_edge(parent_id, child_id)
    
    # Create figure
    plt.figure(figsize=figsize)
    
    # Set node colors based on sex
    node_colors = []
    for node in G.nodes():
        if node in sex_map:
            if sex_map[node] == "M":
                node_colors.append("lightblue")
            elif sex_map[node] == "F":
                node_colors.append("pink")
            else:
                node_colors.append("lightgray")
        else:
            node_colors.append("lightgray")
    
    # Use hierarchical layout for pedigree
    try:
        # Try to use graphviz for better layout if available
        pos = nx.nx_agraph.graphviz_layout(G, prog="dot")
    except:
        # Fall back to networkx layout if graphviz is not available
        pos = nx.spring_layout(G)
    
    # Draw the graph
    nx.draw(
        G, pos, 
        with_labels=False, 
        node_color=node_colors, 
        node_size=node_size, 
        arrows=True,
        arrowsize=20,
        arrowstyle="->",
        edge_color="gray"
    )
    
    # Draw labels
    nx.draw_networkx_labels(G, pos, labels=labels, font_size=12)
    
    # Add title and legend
    plt.title(title, fontsize=16)
    
    # Add a legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor='lightblue', edgecolor='black', label='Male'),
        Patch(facecolor='pink', edgecolor='black', label='Female'),
        Patch(facecolor='lightgray', edgecolor='black', label='Unknown')
    ]
    plt.legend(handles=legend_elements, loc='upper right')
    
    plt.axis('off')  # Hide the axes
    plt.tight_layout()
    plt.show()

In [None]:
# Create two separate family pedigrees
family1_up_dict, family1_bioinfo = create_simple_family(family_id="FAM1", base_id=1)
family2_up_dict, family2_bioinfo = create_extended_family(family_id="FAM2", base_id=10)

# Visualize the pedigrees
visualize_pedigree(family1_up_dict, family1_bioinfo, title="Family 1 Pedigree")
visualize_pedigree(family2_up_dict, family2_bioinfo, title="Family 2 Pedigree")

## Part 3: Implementing a Simplified combine_up_dicts

Now that we have our example pedigrees and visualization tools, let's implement a simplified version of the `combine_up_dicts()` algorithm to demonstrate its core functionality.

In [None]:
def combine_pedigrees_simplified(up_dict1, up_dict2, connection_points):
    """
    A simplified function to combine two pedigrees through specified connection points.
    
    Args:
        up_dict1: The first up-node dictionary
        up_dict2: The second up-node dictionary
        connection_points: Tuple of (id1, id2) where id1 is from up_dict1 and id2 is from up_dict2
        
    Returns:
        Combined up-node dictionary
    """
    # Create a deep copy of both dictionaries to avoid modifying the originals
    up_dict1_copy = copy.deepcopy(up_dict1)
    up_dict2_copy = copy.deepcopy(up_dict2)
    
    # Extract the connection points
    id1, id2 = connection_points
    
    # Check if both IDs exist in their respective pedigrees
    if id1 not in up_dict1_copy or id2 not in up_dict2_copy:
        raise ValueError(f"Connection points {id1} and {id2} must exist in their respective pedigrees")
    
    # In this simplified version, we'll make id2 a child of id1
    # Create a combined pedigree by adding all nodes from both pedigrees
    combined_up_dict = {**up_dict1_copy, **up_dict2_copy}
    
    # Add the new parent-child relationship
    if id1 not in combined_up_dict[id2]:
        combined_up_dict[id2][id1] = 1  # Direct parent-child relationship
    
    return combined_up_dict

def simplified_combine_up_dicts(pedigrees, bioinfo_list, connection_map):
    """
    A simplified version of the combine_up_dicts algorithm.
    
    Args:
        pedigrees: List of up-node dictionaries representing separate pedigrees
        bioinfo_list: List of bioinfo dictionaries for each pedigree
        connection_map: Dictionary mapping pairs of pedigree indices to connection points
            Example: {(0, 1): (id_from_ped0, id_from_ped1)}
            
    Returns:
        Combined pedigree and bioinfo
    """
    # Create a copy of the pedigrees to avoid modifying the originals
    current_pedigrees = copy.deepcopy(pedigrees)
    current_bioinfo = [item for sublist in bioinfo_list for item in sublist]  # Flatten the bioinfo list
    
    # While there are multiple pedigrees and connections to make
    while len(current_pedigrees) > 1 and connection_map:
        # Find a valid connection
        valid_connections = [k for k in connection_map.keys() 
                             if k[0] < len(current_pedigrees) and k[1] < len(current_pedigrees)]
        
        if not valid_connections:
            break
        
        # Take the first valid connection
        idx1, idx2 = valid_connections[0]
        connection_points = connection_map[(idx1, idx2)]
        
        # Combine the pedigrees
        combined_pedigree = combine_pedigrees_simplified(
            current_pedigrees[idx1], 
            current_pedigrees[idx2], 
            connection_points
        )
        
        # Replace the first pedigree with the combined one
        current_pedigrees[idx1] = combined_pedigree
        
        # Remove the second pedigree
        current_pedigrees.pop(idx2)
        
        # Update connection map to account for the removed pedigree
        new_connection_map = {}
        for (a, b), conn in connection_map.items():
            if a \!= idx2 and b \!= idx2:
                # Adjust indices for pedigrees after idx2
                new_a = a if a < idx2 else a - 1
                new_b = b if b < idx2 else b - 1
                if new_a \!= new_b:  # Avoid connecting a pedigree to itself
                    new_connection_map[(new_a, new_b)] = conn
        connection_map = new_connection_map
    
    # Return the first (and possibly only) pedigree
    return current_pedigrees[0], current_bioinfo

In [None]:
# Create a connection map between our two example pedigrees
# Let's connect person 3 (child2 from family1) as a parent of person 13 (mother from family2)
connection_map = {(0, 1): (3, 13)}

# Combine the pedigrees
combined_pedigree, combined_bioinfo = simplified_combine_up_dicts(
    [family1_up_dict, family2_up_dict],
    [family1_bioinfo, family2_bioinfo],
    connection_map
)

# Visualize the combined pedigree
visualize_pedigree(combined_pedigree, combined_bioinfo, title="Combined Pedigree", figsize=(12, 8))

### Analyzing the Connection Process

Let's examine what happens when we combine pedigrees in our simplified example:

1. We started with two separate pedigrees:
   - Family 1: A simple family with two parents (1, 2) and two children (3, 4)
   - Family 2: An extended family with grandparents (10, 11), parents (12, 13), and children (14, 15)

2. We specified a connection point: making individual 3 from Family 1 a parent of individual 13 from Family 2

3. The result is a merged pedigree that preserves all the original relationships while adding the new connection

In the actual Bonsai v3 implementation, the most important differences are:

1. **Connection Point Selection**: Instead of manually specifying connection points, Bonsai uses IBD data to find the most likely connection points automatically

2. **Relationship Evaluation**: Bonsai evaluates multiple possible relationships between the connection points (parent-child, siblings, cousins, etc.) and selects the most likely ones

3. **Optimization**: Bonsai maintains multiple candidate pedigrees at each step and selects the ones with the highest likelihood scores

Let's create a more complex example that better illustrates the searching for optimal connection points.

In [None]:
def simulate_ibd_sharing(id1, id2, relationship_type):
    """
    Simulate IBD sharing between two individuals based on their relationship.
    
    Args:
        id1: First individual ID
        id2: Second individual ID
        relationship_type: String describing the relationship
        
    Returns:
        Simulated total cM sharing
    """
    # Simplified IBD sharing values based on relationship type
    sharing_values = {
        "parent-child": 3400,     # ~50% of genome
        "full-sibling": 2550,     # ~37.5% of genome
        "grandparent": 1700,      # ~25% of genome
        "half-sibling": 1700,     # ~25% of genome
        "first-cousin": 850,      # ~12.5% of genome
        "second-cousin": 212.5,   # ~3.125% of genome
        "unrelated": 0            # No sharing
    }
    
    # Add some random noise to make it more realistic
    base_sharing = sharing_values.get(relationship_type, 0)
    noise = np.random.normal(0, base_sharing * 0.1)  # 10% noise
    return max(0, base_sharing + noise)

def get_relationship_type(id1, id2, combined_pedigrees, bioinfo_list):
    """
    Determine the relationship type between two individuals.
    
    This is a simplified version that just checks a few common relationships.
    
    Args:
        id1: First individual ID
        id2: Second individual ID
        combined_pedigrees: List of pedigrees
        bioinfo_list: List of bioinfo dictionaries
        
    Returns:
        String describing the relationship type
    """
    # Find which pedigree each ID belongs to
    ped1_idx = None
    ped2_idx = None
    
    for i, (ped, bioinfo) in enumerate(zip(combined_pedigrees, bioinfo_list)):
        genotype_ids = [info["genotype_id"] for info in bioinfo]
        if id1 in genotype_ids:
            ped1_idx = i
        if id2 in genotype_ids:
            ped2_idx = i
    
    # If they're in different pedigrees, they're unrelated (for our simulation)
    if ped1_idx \!= ped2_idx:
        return "unrelated"
    
    # If they're in the same pedigree, check relationship
    pedigree = combined_pedigrees[ped1_idx]
    
    # Check parent-child relationship
    if id1 in pedigree and id2 in pedigree[id1]:
        return "parent-child"
    if id2 in pedigree and id1 in pedigree[id2]:
        return "parent-child"
    
    # Check siblings (share both parents)
    if id1 in pedigree and id2 in pedigree:
        parents1 = set(pedigree[id1].keys())
        parents2 = set(pedigree[id2].keys())
        common_parents = parents1.intersection(parents2)
        
        if len(common_parents) == 2:
            return "full-sibling"
        elif len(common_parents) == 1:
            return "half-sibling"
    
    # For simplicity, we'll default to "unrelated" for other cases
    return "unrelated"

def generate_ibd_stats(combined_pedigrees, bioinfo_list):
    """
    Generate simulated IBD statistics for all pairs of individuals across pedigrees.
    
    Args:
        combined_pedigrees: List of pedigrees
        bioinfo_list: List of bioinfo dictionaries
        
    Returns:
        Dictionary with IBD sharing statistics
    """
    # Get all individuals across all pedigrees
    all_individuals = []
    for bioinfo in bioinfo_list:
        all_individuals.extend([info["genotype_id"] for info in bioinfo])
    
    # Create IBD stats for all pairs
    ibd_stats = {}
    
    for id1, id2 in combinations(all_individuals, 2):
        # Determine relationship type
        relationship = get_relationship_type(id1, id2, combined_pedigrees, bioinfo_list)
        
        # Simulate IBD sharing
        total_sharing = simulate_ibd_sharing(id1, id2, relationship)
        
        # Skip pairs with no IBD sharing
        if total_sharing > 0:
            # Store the IBD stats
            key = frozenset([id1, id2])
            ibd_stats[key] = (total_sharing, relationship)
    
    return ibd_stats

In [None]:
# Create three separate family pedigrees
family1_up_dict, family1_bioinfo = create_simple_family(family_id="FAM1", base_id=1)
family2_up_dict, family2_bioinfo = create_extended_family(family_id="FAM2", base_id=10)
family3_up_dict, family3_bioinfo = create_simple_family(family_id="FAM3", base_id=20)

# Generate IBD statistics
pedigrees = [family1_up_dict, family2_up_dict, family3_up_dict]
bioinfo_list = [family1_bioinfo, family2_bioinfo, family3_bioinfo]
ibd_stats = generate_ibd_stats(pedigrees, bioinfo_list)

# Visualize the three pedigrees
visualize_pedigree(family1_up_dict, family1_bioinfo, title="Family 1 Pedigree")
visualize_pedigree(family2_up_dict, family2_bioinfo, title="Family 2 Pedigree")
visualize_pedigree(family3_up_dict, family3_bioinfo, title="Family 3 Pedigree")

In [None]:
# Add synthetic IBD sharing between families (to simulate distant relationships)
# Family 1 child (ID 3) is related to Family 2 grandmother (ID 11)
ibd_stats[frozenset([3, 11])] = (850, "first-cousin")

# Family 2 child (ID 14) is related to Family 3 mother (ID 21)
ibd_stats[frozenset([14, 21])] = (1700, "half-sibling")

# Find the pair with the strongest IBD sharing between different pedigrees
max_sharing = 0
closest_pair = None

for key, (sharing, relationship) in ibd_stats.items():
    id_pair = list(key)
    id1, id2 = id_pair
    
    # Find which pedigree each ID belongs to
    ped1_idx = None
    ped2_idx = None
    
    for i, bioinfo in enumerate(bioinfo_list):
        genotype_ids = [info["genotype_id"] for info in bioinfo]
        if id1 in genotype_ids:
            ped1_idx = i
        if id2 in genotype_ids:
            ped2_idx = i
    
    # If they're in different pedigrees and have stronger sharing
    if ped1_idx is not None and ped2_idx is not None and ped1_idx \!= ped2_idx and sharing > max_sharing:
        max_sharing = sharing
        closest_pair = (id1, id2, ped1_idx, ped2_idx, sharing, relationship)

# Print the pair with the strongest IBD sharing between different pedigrees
if closest_pair:
    id1, id2, ped1_idx, ped2_idx, sharing, relationship = closest_pair
    print(f"Strongest IBD sharing between pedigrees:")
    print(f"  Individual {id1} from Family {ped1_idx+1} and Individual {id2} from Family {ped2_idx+1}")
    print(f"  Sharing: {sharing:.2f} cM")
    print(f"  Relationship: {relationship}")
    
    # Create connection map based on the strongest IBD sharing
    connection_map = {(ped1_idx, ped2_idx): (id1, id2)}
    
    # Combine the first two pedigrees
    intermediate_pedigree, intermediate_bioinfo = simplified_combine_up_dicts(
        pedigrees[:],  # Copy to avoid modifying original
        bioinfo_list[:],  # Copy to avoid modifying original
        connection_map
    )
    
    # Visualize the intermediate combined pedigree
    visualize_pedigree(intermediate_pedigree, intermediate_bioinfo, 
                      title=f"Combined Pedigree (First Iteration)", figsize=(14, 10))

### Implementing More Realistic Relationship Integration

In the actual Bonsai v3 implementation, when two pedigrees are merged, the algorithm carefully evaluates different possible relationships between the connection points. Let's add more flexibility to our simplified algorithm to better mimic this behavior.

In [None]:
def connect_pedigrees_with_relationship(up_dict1, up_dict2, connection_points, relationship_type):
    """
    Connect two pedigrees with a specific relationship type between connection points.
    
    Args:
        up_dict1: First pedigree
        up_dict2: Second pedigree
        connection_points: Tuple of (id1, id2) with id1 from up_dict1 and id2 from up_dict2
        relationship_type: Type of relationship to establish
        
    Returns:
        Combined pedigree
    """
    # Create deep copies of the pedigrees
    up_dict1_copy = copy.deepcopy(up_dict1)
    up_dict2_copy = copy.deepcopy(up_dict2)
    
    # Extract the connection points
    id1, id2 = connection_points
    
    # Create combined pedigree with all nodes from both pedigrees
    combined_up_dict = {**up_dict1_copy, **up_dict2_copy}
    
    # Establish the appropriate relationship based on type
    if relationship_type == "parent-child":
        # Make id1 a parent of id2
        if id1 not in combined_up_dict[id2]:
            combined_up_dict[id2][id1] = 1
    
    elif relationship_type == "siblings" or relationship_type == "full-sibling":
        # Create a common ancestor for both individuals
        # Use negative IDs for inferred ancestors to avoid conflicts
        father_id = -1
        mother_id = -2
        
        # Add the inferred parents to the pedigree
        combined_up_dict[father_id] = {}
        combined_up_dict[mother_id] = {}
        
        # Make both individuals children of the inferred parents
        if id1 in combined_up_dict:
            combined_up_dict[id1][father_id] = 1
            combined_up_dict[id1][mother_id] = 1
        else:
            combined_up_dict[id1] = {father_id: 1, mother_id: 1}
            
        if id2 in combined_up_dict:
            combined_up_dict[id2][father_id] = 1
            combined_up_dict[id2][mother_id] = 1
        else:
            combined_up_dict[id2] = {father_id: 1, mother_id: 1}
    
    elif relationship_type == "half-sibling":
        # Create a common parent
        common_parent_id = -1
        combined_up_dict[common_parent_id] = {}
        
        # Make both individuals children of the common parent
        if id1 in combined_up_dict:
            combined_up_dict[id1][common_parent_id] = 1
        else:
            combined_up_dict[id1] = {common_parent_id: 1}
            
        if id2 in combined_up_dict:
            combined_up_dict[id2][common_parent_id] = 1
        else:
            combined_up_dict[id2] = {common_parent_id: 1}
    
    elif relationship_type == "first-cousin":
        # Create a structure for first cousins (shared grandparents)
        grandparent1_id = -1
        grandparent2_id = -2
        parent1_id = -3
        parent2_id = -4
        
        # Add the inferred ancestors
        combined_up_dict[grandparent1_id] = {}
        combined_up_dict[grandparent2_id] = {}
        combined_up_dict[parent1_id] = {grandparent1_id: 1, grandparent2_id: 1}
        combined_up_dict[parent2_id] = {grandparent1_id: 1, grandparent2_id: 1}
        
        # Make id1 a child of parent1
        if id1 in combined_up_dict:
            combined_up_dict[id1][parent1_id] = 1
        else:
            combined_up_dict[id1] = {parent1_id: 1}
        
        # Make id2 a child of parent2
        if id2 in combined_up_dict:
            combined_up_dict[id2][parent2_id] = 1
        else:
            combined_up_dict[id2] = {parent2_id: 1}
    
    return combined_up_dict

def combine_up_dicts_realistic(pedigrees, bioinfo_list, ibd_stats):
    """
    A more realistic implementation of combine_up_dicts that selects connection points
    based on IBD sharing and creates appropriate relationships.
    
    Args:
        pedigrees: List of pedigrees
        bioinfo_list: List of bioinfo dictionaries
        ibd_stats: Dictionary of IBD sharing statistics
        
    Returns:
        Combined pedigree and bioinfo
    """
    # Create working copies
    current_pedigrees = copy.deepcopy(pedigrees)
    current_bioinfo = copy.deepcopy(bioinfo_list)
    current_ibd_stats = copy.deepcopy(ibd_stats)
    
    # Create mappings between individuals and pedigrees
    id_to_ped_idx = {}
    ped_idx_to_ids = {}
    
    for i, bioinfo in enumerate(current_bioinfo):
        ids = [info["genotype_id"] for info in bioinfo]
        ped_idx_to_ids[i] = set(ids)
        for id_ in ids:
            id_to_ped_idx[id_] = i
    
    # Track combined bioinfo
    combined_bioinfo = [item for sublist in bioinfo_list for item in sublist]
    
    # While there are multiple pedigrees and IBD stats
    while len(current_pedigrees) > 1 and current_ibd_stats:
        # Find the pair with the strongest IBD sharing between different pedigrees
        max_sharing = 0
        closest_pair = None
        best_relationship = None
        
        for key, (sharing, relationship) in current_ibd_stats.items():
            id_pair = list(key)
            id1, id2 = id_pair
            
            # Check if both IDs are still in the mappings
            if id1 not in id_to_ped_idx or id2 not in id_to_ped_idx:
                continue
            
            ped1_idx = id_to_ped_idx[id1]
            ped2_idx = id_to_ped_idx[id2]
            
            # If they're in different pedigrees and have stronger sharing
            if ped1_idx \!= ped2_idx and sharing > max_sharing:
                max_sharing = sharing
                closest_pair = (id1, id2, ped1_idx, ped2_idx)
                best_relationship = relationship
        
        if not closest_pair:
            break
        
        # Extract information about the closest pair
        id1, id2, ped1_idx, ped2_idx = closest_pair
        
        print(f"Merging pedigrees with connection: {id1} and {id2}, relationship: {best_relationship}")
        
        # Ensure we're always merging the higher index into the lower index
        if ped1_idx > ped2_idx:
            ped1_idx, ped2_idx = ped2_idx, ped1_idx
            id1, id2 = id2, id1
        
        # Connect the pedigrees with the appropriate relationship
        combined_pedigree = connect_pedigrees_with_relationship(
            current_pedigrees[ped1_idx],
            current_pedigrees[ped2_idx],
            (id1, id2),
            best_relationship
        )
        
        # Update pedigree at the lower index
        current_pedigrees[ped1_idx] = combined_pedigree
        
        # Combine bioinfo sets
        current_bioinfo[ped1_idx].extend(current_bioinfo[ped2_idx])
        
        # Update mappings
        ped_idx_to_ids[ped1_idx] = ped_idx_to_ids[ped1_idx].union(ped_idx_to_ids[ped2_idx])
        for id_ in ped_idx_to_ids[ped2_idx]:
            id_to_ped_idx[id_] = ped1_idx
        
        # Remove the higher index pedigree
        current_pedigrees.pop(ped2_idx)
        current_bioinfo.pop(ped2_idx)
        ped_idx_to_ids.pop(ped2_idx)
        
        # Adjust indices for remaining pedigrees
        new_id_to_ped_idx = {}
        for id_, idx in id_to_ped_idx.items():
            if idx < ped2_idx:
                new_id_to_ped_idx[id_] = idx
            elif idx > ped2_idx:
                new_id_to_ped_idx[id_] = idx - 1
            else:  # idx == ped2_idx
                new_id_to_ped_idx[id_] = ped1_idx
        id_to_ped_idx = new_id_to_ped_idx
        
        new_ped_idx_to_ids = {}
        for idx, id_set in ped_idx_to_ids.items():
            if idx < ped2_idx:
                new_ped_idx_to_ids[idx] = id_set
            elif idx > ped2_idx:
                new_ped_idx_to_ids[idx-1] = id_set
        new_ped_idx_to_ids[ped1_idx] = ped_idx_to_ids[ped1_idx]
        ped_idx_to_ids = new_ped_idx_to_ids
        
        # Update IBD stats - remove pairs that are now in the same pedigree
        new_ibd_stats = {}
        for key, value in current_ibd_stats.items():
            id_pair = list(key)
            id1, id2 = id_pair
            
            # Skip if either ID is no longer in mappings
            if id1 not in id_to_ped_idx or id2 not in id_to_ped_idx:
                continue
            
            # Keep only pairs that are still in different pedigrees
            if id_to_ped_idx[id1] \!= id_to_ped_idx[id2]:
                new_ibd_stats[key] = value
        
        current_ibd_stats = new_ibd_stats
    
    # Return the first pedigree and combined bioinfo
    return current_pedigrees[0], combined_bioinfo

In [None]:
# Use our more realistic implementation to combine all three pedigrees
final_pedigree, final_bioinfo = combine_up_dicts_realistic(
    [family1_up_dict, family2_up_dict, family3_up_dict],
    [family1_bioinfo, family2_bioinfo, family3_bioinfo],
    ibd_stats
)

# Visualize the final combined pedigree
visualize_pedigree(final_pedigree, final_bioinfo, title="Final Combined Pedigree", figsize=(16, 12))

## Summary

In this lab, we explored the `combine_up_dicts()` algorithm, which is a core component of the Bonsai v3 framework for merging pedigree structures. Key takeaways include:

1. The `combine_up_dicts()` algorithm iteratively merges pedigrees, starting with those sharing the strongest IBD, until all related pedigrees are combined.

2. For each merge, Bonsai evaluates multiple possible connection points and relationship types to find the most likely configuration.

3. The algorithm uses several helper functions like `get_closest_pair()` to identify the best candidates for merging and `combine_pedigrees()` to handle the actual combination.

4. The up-node dictionary data structure enables efficient pedigree representation and manipulation during the merging process.

5. Bonsai maintains multiple candidate pedigrees at each step, selecting those with the highest likelihood for further consideration.

This approach allows Bonsai to tackle the challenging problem of pedigree reconstruction by breaking it down into smaller, manageable merges, making it possible to reconstruct large and complex family trees from genetic data.

In [None]:
# Convert this notebook to PDF using poetry
\!poetry run jupyter nbconvert --to pdf Lab15_Explore_Bonsai.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive