# Lab 17: Incremental Individual Addition Strategies

## Overview

In this lab, we'll explore how Bonsai v3 efficiently adds new individuals to existing pedigrees one at a time, rather than rebuilding the entire structure. These incremental addition strategies are critical for performance when working with large datasets and for updating pedigrees as new genetic data becomes available.

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
import copy
import random
import math
from collections import defaultdict

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\
## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\
Methods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\
## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_source(obj):
    """Display the source code of an object (function or class)"""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\
{source}\
```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Lab 17: Incremental Individual Addition Strategies

In this lab, we'll explore how Bonsai v3 efficiently adds new individuals to existing pedigrees one at a time, rather than rebuilding the entire structure. This incremental approach is critical for:

1. **Performance**: Adding one individual is much faster than rebuilding an entire pedigree
2. **Stability**: Existing relationships in the pedigree remain stable as new individuals are added
3. **Scalability**: The approach can handle continuous addition of new individuals over time
4. **Real-time Updates**: Results can be updated promptly as new data becomes available

We'll implement simplified versions of the key algorithms to understand how they work, and explore the trade-offs and optimizations involved in incremental addition.

## Part 1: Understanding Incremental Addition

Let's start by examining the core incremental addition functions in Bonsai v3:

In [None]:
# Import the incremental addition functions from Bonsai v3
if not is_jupyterlite():
    try:
        from utils.bonsaitree.bonsaitree.v3.connections import add_individual_to_pedigree
        
        # Display the source code of the function
        print("Source code for add_individual_to_pedigree:")
        view_source(add_individual_to_pedigree)
    except (ImportError, AttributeError) as e:
        print(f"Could not import function: {e}")
else:
    print("Cannot display source code in JupyterLite environment.")

### 1.1 Finding Potential Connection Points

The first step in adding a new individual to a pedigree is to identify potential connection points - individuals in the existing pedigree who share IBD with the new individual. Let's implement a simplified version of this function:

In [None]:
def find_potential_connection_points(id_val, up_dct, id_to_shared_ibd):
    """
    Find potential points for connecting a new individual to a pedigree.
    
    Args:
        id_val: ID of the individual to add
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        
    Returns:
        connection_points: List of (existing_id, ibd_amount) tuples
    """
    # Find individuals in the pedigree who share IBD with the new individual
    connection_points = []
    
    # Check all IBD pairs
    for pair, segments in id_to_shared_ibd.items():
        id1, id2 = pair
        
        # Check if one of the IDs is the new individual
        if id1 == id_val or id2 == id_val:
            # Get the other ID
            other_id = id2 if id1 == id_val else id1
            
            # Check if the other ID is in the pedigree
            if other_id in up_dct:
                # Calculate total shared IBD
                total_cm = sum(seg.get('length_cm', 0) for seg in segments)
                
                # Add to connection points
                connection_points.append((other_id, total_cm))
    
    # Sort by amount of IBD sharing (descending)
    connection_points.sort(key=lambda x: x[1], reverse=True)
    
    return connection_points

# Example usage
def demonstrate_connection_points():
    # Create a simple pedigree
    #   -1 (grandparent)     -2 (grandparent)
    #           \\               /
    #            \\             /
    #             -3 (parent)      -4 (parent)
    #                     \\         /
    #                      \\       /
    #                        1 (child)
    #                     
    # Create the pedigree structure
    pedigree = {
        -1: {},               # Grandparent 1
        -2: {},               # Grandparent 2
        -3: {-1: 1, -2: 1},   # Parent 1 with parents
        -4: {},               # Parent 2
        1: {-3: 1, -4: 1}     # Child with parents
    }
    
    # Create IBD sharing data
    # New individual 2 shares IBD with individual 1 (half-sibling level)
    id_to_shared_ibd = {
        (1, 2): [{'length_cm': 900}]  # Half-sibling level
    }
    
    # Find potential connection points for individual 2
    connection_points = find_potential_connection_points(2, pedigree, id_to_shared_ibd)
    
    print("Potential connection points for individual 2:")
    for existing_id, ibd_amount in connection_points:
        print(f"Individual {existing_id}: {ibd_amount} cM")
    
    # Visualize the pedigree
    visualize_pedigree(pedigree, "Existing Pedigree")

# Helper function to visualize pedigrees
def visualize_pedigree(up_node_dict, title="Pedigree"):
    # Create a directed graph (edges point from child to parent)
    G = nx.DiGraph()
    
    # Add all nodes to the graph
    all_ids = set(up_node_dict.keys())
    for parents in up_node_dict.values():
        all_ids.update(parents.keys())
    
    # Add nodes with colors (blue for genotyped, gray for ungenotyped)
    for node_id in all_ids:
        if node_id > 0:  # Genotyped individuals have positive IDs
            G.add_node(node_id, color='lightblue')
        else:  # Ungenotyped individuals have negative IDs
            G.add_node(node_id, color='lightgray')
    
    # Add edges (from child to parent)
    for child, parents in up_node_dict.items():
        for parent in parents:
            G.add_edge(child, parent)
    
    # Create plot
    plt.figure(figsize=(10, 6))
    plt.title(title)
    
    # Get node colors
    node_colors = [G.nodes[n]['color'] for n in G.nodes]
    
    # Set layout (tree layout looks nice for pedigrees)
    pos = nx.spring_layout(G, seed=42)  # For reproducibility
    
    # Draw the graph
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            node_size=800, font_weight='bold')
    
    plt.tight_layout()
    plt.show()

# Run the demonstration
demonstrate_connection_points()

### 1.2 Evaluating Relationship Configurations

Once potential connection points are identified, we need to evaluate different relationship configurations between the new individual and each connection point. Let's implement a simplified version of this process:

In [None]:
def evaluate_relationship_configs(id_val, existing_id, shared_cm, id_to_info=None):
    """
    Evaluate different relationship configurations between two individuals.
    
    Args:
        id_val: ID of the new individual
        existing_id: ID of the existing individual
        shared_cm: Amount of shared DNA in centimorgans
        id_to_info: Dict with demographic information for individuals
        
    Returns:
        rel_configs: List of (up, down, num_ancs, likelihood, description) tuples
    """
    id_to_info = id_to_info or {}
    
    # Get demographic information
    info1 = id_to_info.get(id_val, {})
    info2 = id_to_info.get(existing_id, {})
    age1 = info1.get('age')
    age2 = info2.get('age')
    sex1 = info1.get('sex')
    sex2 = info2.get('sex')
    
    # Define possible relationship configurations
    relationship_configs = [
        # (up, down, num_ancs, description)
        (0, 1, 1, "Parent-Child (new→existing)"),  # new is parent of existing
        (1, 0, 1, "Parent-Child (existing→new)"),  # existing is parent of new
        (1, 1, 2, "Full Siblings"),               # full siblings
        (1, 1, 1, "Half Siblings"),               # half siblings
        (2, 2, 2, "First Cousins"),               # first cousins
        (2, 1, 1, "Half-Aunt/Uncle-Niece/Nephew"),  # half-aunt/uncle-niece/nephew
        (1, 2, 1, "Half-Niece/Nephew-Aunt/Uncle"),  # half-niece/nephew-aunt/uncle
        (3, 3, 2, "Second Cousins")               # second cousins
    ]
    
    # Expected IBD ranges for different relationships
    ibd_ranges = {
        "Parent-Child (new→existing)": (1700, 3400),
        "Parent-Child (existing→new)": (1700, 3400),
        "Full Siblings": (1700, 2800),
        "Half Siblings": (700, 1800),
        "First Cousins": (200, 900),
        "Half-Aunt/Uncle-Niece/Nephew": (400, 1200),
        "Half-Niece/Nephew-Aunt/Uncle": (400, 1200),
        "Second Cousins": (50, 400)
    }
    
    # Simple age compatibility check
    def is_age_compatible(rel_desc, age1, age2):
        if age1 is None or age2 is None:
            return True  # Can't check without ages
            
        if "Parent-Child (new→existing)" == rel_desc:
            return age1 > age2 + 12  # Parent should be older than child
            
        if "Parent-Child (existing→new)" == rel_desc:
            return age2 > age1 + 12  # Parent should be older than child
            
        if "Siblings" in rel_desc:
            return abs(age1 - age2) < 30  # Siblings typically close in age
            
        return True  # Default to compatible
    
    # Evaluate each relationship configuration
    rel_configs = []
    for up, down, num_ancs, description in relationship_configs:
        # Check age compatibility
        if not is_age_compatible(description, age1, age2):
            continue
            
        # Check IBD compatibility
        min_cm, max_cm = ibd_ranges.get(description, (0, float('inf')))
        
        # Calculate likelihood based on IBD amount
        if min_cm <= shared_cm <= max_cm:
            # Higher likelihood for IBD in middle of range
            range_center = (min_cm + max_cm) / 2
            distance_from_center = abs(shared_cm - range_center)
            range_width = (max_cm - min_cm) / 2
            
            # Normalize distance to [0, 1] range
            normalized_distance = min(distance_from_center / range_width, 1.0)
            
            # Calculate likelihood (higher for values closer to center)
            likelihood = (1 - normalized_distance) * math.log(1 + shared_cm)
        else:
            # Outside expected range - lower likelihood
            likelihood = 0.1 * math.log(1 + shared_cm)
        
        rel_configs.append((up, down, num_ancs, likelihood, description))
    
    # Sort by likelihood (descending)
    rel_configs.sort(key=lambda x: x[3], reverse=True)
    
    return rel_configs

# Demonstrate the function
def demonstrate_relationship_configs():
    # Create demographic information
    id_to_info = {
        1: {'age': 25, 'sex': 'M'},
        2: {'age': 23, 'sex': 'F'},
        3: {'age': 50, 'sex': 'F'}
    }
    
    # Evaluate different IBD amounts
    ibd_amounts = [1800, 900, 400, 100]
    labels = ["Parent-Child level", "Half-sibling level", "First cousin level", "Second cousin level"]
    
    for shared_cm, label in zip(ibd_amounts, labels):
        print(f"\
Evaluating relationships for IBD sharing of {shared_cm} cM ({label}):")
        
        # Evaluate relationship configurations
        rel_configs = evaluate_relationship_configs(2, 1, shared_cm, id_to_info)
        
        # Display top 3 configurations
        print("Top 3 relationship configurations:")
        for i, (up, down, num_ancs, likelihood, description) in enumerate(rel_configs[:3]):
            print(f"{i+1}. {description} (up={up}, down={down}, ancs={num_ancs}): likelihood={likelihood:.2f}")

# Run the demonstration
demonstrate_relationship_configs()

### 1.3 Adding an Individual with a Specific Relationship

Once we've determined the most likely relationship configuration, we need to physically add the new individual to the pedigree with that relationship. Let's implement a simplified version of this function:

In [None]:
def add_with_relationship(id_val, existing_id, up_dct, up, down, num_ancs):
    """
    Add a new individual to a pedigree with a specific relationship.
    
    Args:
        id_val: ID of the individual to add
        existing_id: ID of the existing individual to connect to
        up_dct: Up-node dictionary representing the existing pedigree
        up, down, num_ancs: Relationship parameters
        
    Returns:
        up_dct_new: Updated pedigree with the new individual
    """
    # Create a copy of the existing pedigree
    up_dct_new = copy.deepcopy(up_dct)
    
    # Ensure the new individual exists in the pedigree
    if id_val not in up_dct_new:
        up_dct_new[id_val] = {}
    
    # Handle different relationship types
    if up == 0 and down == 1:  # id_val is parent of existing_id
        # Add id_val as parent of existing_id
        up_dct_new[existing_id][id_val] = 1
        
    elif up == 1 and down == 0:  # existing_id is parent of id_val
        # Add existing_id as parent of id_val
        up_dct_new[id_val][existing_id] = 1
        
    elif up == 1 and down == 1:  # Siblings
        # Get existing parents of existing_id
        existing_parents = list(up_dct_new.get(existing_id, {}).keys())
        
        if num_ancs == 1:  # Half siblings
            # Need at least one parent to create half siblings
            if existing_parents:
                shared_parent = existing_parents[0]
                up_dct_new[id_val][shared_parent] = 1
            else:
                # Create a new ungenotyped parent
                new_parent_id = min(up_dct_new.keys()) - 1 if up_dct_new else -1
                up_dct_new[existing_id][new_parent_id] = 1
                up_dct_new[id_val][new_parent_id] = 1
                up_dct_new[new_parent_id] = {}
                
        elif num_ancs == 2:  # Full siblings
            # Need two parents to create full siblings
            if len(existing_parents) >= 2:
                parent1, parent2 = existing_parents[:2]
                up_dct_new[id_val][parent1] = 1
                up_dct_new[id_val][parent2] = 1
            elif len(existing_parents) == 1:
                # Use the existing parent and create a new one
                shared_parent = existing_parents[0]
                new_parent_id = min(up_dct_new.keys()) - 1 if up_dct_new else -1
                up_dct_new[existing_id][new_parent_id] = 1
                up_dct_new[id_val][shared_parent] = 1
                up_dct_new[id_val][new_parent_id] = 1
                up_dct_new[new_parent_id] = {}
            else:
                # Create two new ungenotyped parents
                parent1_id = min(up_dct_new.keys()) - 1 if up_dct_new else -1
                parent2_id = parent1_id - 1
                up_dct_new[existing_id][parent1_id] = 1
                up_dct_new[existing_id][parent2_id] = 1
                up_dct_new[id_val][parent1_id] = 1
                up_dct_new[id_val][parent2_id] = 1
                up_dct_new[parent1_id] = {}
                up_dct_new[parent2_id] = {}
                
    else:  # More distant relationships
        # Helper function to get a new ungenotyped ID
        def get_new_id():
            return min(up_dct_new.keys()) - 1 if up_dct_new else -1
        
        # Create ancestor chains as needed
        if up > 0 and down > 0:  # Relationship through common ancestors
            # Create chain up from id_val
            curr_id = id_val
            for i in range(up):
                parent_id = get_new_id()
                up_dct_new[curr_id][parent_id] = 1
                up_dct_new[parent_id] = {}
                curr_id = parent_id
            top1 = curr_id  # Top of first chain
            
            # Create chain up from existing_id
            curr_id = existing_id
            for i in range(down):
                parent_id = get_new_id()
                up_dct_new[curr_id][parent_id] = 1
                up_dct_new[parent_id] = {}
                curr_id = parent_id
            top2 = curr_id  # Top of second chain
            
            # Connect the tops of the chains through common ancestors
            if num_ancs >= 1:
                common_anc1 = get_new_id()
                up_dct_new[top1][common_anc1] = 1
                up_dct_new[top2][common_anc1] = 1
                up_dct_new[common_anc1] = {}
                
            if num_ancs >= 2:
                common_anc2 = get_new_id()
                up_dct_new[top1][common_anc2] = 1
                up_dct_new[top2][common_anc2] = 1
                up_dct_new[common_anc2] = {}
    
    return up_dct_new

# Demonstrate the function
def demonstrate_adding_relationships():
    # Create a simple pedigree with one individual
    pedigree = {1: {}}
    
    # Add individual 2 as child of individual 1
    ped_child = add_with_relationship(2, 1, pedigree, 1, 0, 1)
    print("Adding as child of individual 1:")
    visualize_pedigree(ped_child, "2 as Child of 1")
    
    # Add individual 3 as parent of individual 1
    ped_parent = add_with_relationship(3, 1, pedigree, 0, 1, 1)
    print("\
Adding as parent of individual 1:")
    visualize_pedigree(ped_parent, "3 as Parent of 1")
    
    # Add individual 4 as full sibling of individual 1
    ped_sibling = add_with_relationship(4, 1, pedigree, 1, 1, 2)
    print("\
Adding as full sibling of individual 1:")
    visualize_pedigree(ped_sibling, "4 as Full Sibling of 1")
    
    # Add individual 5 as first cousin of individual 1
    ped_cousin = add_with_relationship(5, 1, pedigree, 2, 2, 2)
    print("\
Adding as first cousin of individual 1:")
    visualize_pedigree(ped_cousin, "5 as First Cousin of 1")

# Run the demonstration
demonstrate_adding_relationships()

## Part 2: Implementing Incremental Addition

Now let's combine these functions to implement the complete incremental addition process:

In [None]:
def add_individual_to_pedigree_simple(id_val, up_dct, id_to_shared_ibd, id_to_info=None, n_keep=3):
    """
    Add a single individual to an existing pedigree.
    
    Args:
        id_val: ID of the individual to add
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        n_keep: Number of top pedigrees to keep
        
    Returns:
        best_pedigrees: List of (pedigree, likelihood) pairs
    """
    id_to_info = id_to_info or {}
    
    # Step 1: Find potential connection points
    connection_points = find_potential_connection_points(id_val, up_dct, id_to_shared_ibd)
    
    # If no connection points found, just add the individual without connections
    if not connection_points:
        new_ped = copy.deepcopy(up_dct)
        new_ped[id_val] = {}  # Add as isolated individual
        return [(new_ped, 0.0)]
    
    # Step 2: Evaluate relationship configurations for each connection point
    all_pedigrees = []
    
    for existing_id, shared_cm in connection_points:
        # Get possible relationship configurations
        rel_configs = evaluate_relationship_configs(id_val, existing_id, shared_cm, id_to_info)
        
        # Add individual with each relationship configuration
        for up, down, num_ancs, likelihood, description in rel_configs:
            # Create new pedigree with this relationship
            new_ped = add_with_relationship(id_val, existing_id, up_dct, up, down, num_ancs)
            
            # Add to list of potential pedigrees
            all_pedigrees.append((new_ped, likelihood, description))
    
    # Step 3: Sort by likelihood and keep top n_keep
    all_pedigrees.sort(key=lambda x: x[1], reverse=True)
    best_pedigrees = [(ped, likelihood) for ped, likelihood, _ in all_pedigrees[:n_keep]]
    
    return best_pedigrees

# Demonstrate the function
def demonstrate_incremental_addition():
    # Create a simple pedigree
    #   -1 (grandparent)     -2 (grandparent)
    #           \\               /
    #            \\             /
    #             -3 (parent)      -4 (parent)
    #                     \\         /
    #                      \\       /
    #                        1 (child)
    #                     
    # Create the pedigree structure
    pedigree = {
        -1: {},               # Grandparent 1
        -2: {},               # Grandparent 2
        -3: {-1: 1, -2: 1},   # Parent 1 with parents
        -4: {},               # Parent 2
        1: {-3: 1, -4: 1}     # Child with parents
    }
    
    # Create IBD sharing data for new individuals
    id_to_shared_ibd = {
        (1, 2): [{'length_cm': 900}],   # Individual 2 shares with 1 (half-sibling level)
        (1, 3): [{'length_cm': 1800}],  # Individual 3 shares with 1 (parent-child level)
        (1, 4): [{'length_cm': 400}]    # Individual 4 shares with 1 (first cousin level)
    }
    
    # Create demographic information
    id_to_info = {
        1: {'age': 25, 'sex': 'M'},
        2: {'age': 23, 'sex': 'F'},
        3: {'age': 50, 'sex': 'F'},
        4: {'age': 20, 'sex': 'M'}
    }
    
    # Visualize the original pedigree
    print("Original pedigree:")
    visualize_pedigree(pedigree, "Original Pedigree")
    
    # Add individual 2 (half-sibling level IBD)
    best_pedigrees_2 = add_individual_to_pedigree_simple(2, pedigree, id_to_shared_ibd, id_to_info)
    best_ped_2, likelihood_2 = best_pedigrees_2[0]
    
    print(f"\
Added individual 2 with likelihood {likelihood_2:.2f}:")
    visualize_pedigree(best_ped_2, "Pedigree with Individual 2")
    
    # Add individual 3 (parent-child level IBD)
    best_pedigrees_3 = add_individual_to_pedigree_simple(3, pedigree, id_to_shared_ibd, id_to_info)
    best_ped_3, likelihood_3 = best_pedigrees_3[0]
    
    print(f"\
Added individual 3 with likelihood {likelihood_3:.2f}:")
    visualize_pedigree(best_ped_3, "Pedigree with Individual 3")
    
    # Add individual 4 (first cousin level IBD)
    best_pedigrees_4 = add_individual_to_pedigree_simple(4, pedigree, id_to_shared_ibd, id_to_info)
    best_ped_4, likelihood_4 = best_pedigrees_4[0]
    
    print(f"\
Added individual 4 with likelihood {likelihood_4:.2f}:")
    visualize_pedigree(best_ped_4, "Pedigree with Individual 4")

# Run the demonstration
demonstrate_incremental_addition()

### 2.1 Building a Pedigree Incrementally

Now let's implement a function to build a pedigree incrementally by adding individuals one at a time:

In [None]:
def order_individuals_by_connectivity(id_to_shared_ibd):
    """
    Order individuals by their IBD connectivity.
    
    Args:
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        
    Returns:
        ordered_ids: List of IDs ordered by connectivity
    """
    # Get all individuals
    all_ids = set()
    id_to_total_ibd = defaultdict(float)
    id_to_connection_count = defaultdict(int)
    
    # Calculate total IBD and connection count for each individual
    for (id1, id2), segments in id_to_shared_ibd.items():
        all_ids.add(id1)
        all_ids.add(id2)
        
        total_cm = sum(seg.get('length_cm', 0) for seg in segments)
        
        id_to_total_ibd[id1] += total_cm
        id_to_total_ibd[id2] += total_cm
        
        id_to_connection_count[id1] += 1
        id_to_connection_count[id2] += 1
    
    # Calculate connectivity score (combination of total IBD and connection count)
    id_to_score = {}
    for id_val in all_ids:
        # Weighted sum of total IBD and connection count
        score = id_to_total_ibd[id_val] + 100 * id_to_connection_count[id_val]
        id_to_score[id_val] = score
    
    # Sort by score (descending)
    ordered_ids = sorted(all_ids, key=lambda x: id_to_score[x], reverse=True)
    
    return ordered_ids

def build_pedigree_incremental(id_to_shared_ibd, id_to_info=None, n_keep=3, id_order=None):
    """
    Build a pedigree incrementally by adding individuals one at a time.
    
    Args:
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        n_keep: Number of top pedigrees to keep
        id_order: Order in which to add individuals (if None, determined automatically)
        
    Returns:
        best_pedigree: The final best pedigree
    """
    id_to_info = id_to_info or {}
    
    # If no specific order provided, order by IBD connectivity
    if id_order is None:
        id_order = order_individuals_by_connectivity(id_to_shared_ibd)
    
    # Start with the first individual
    first_id = id_order[0]
    up_dct = {first_id: {}}  # Single individual pedigree
    
    # Add remaining individuals one at a time
    for i, id_val in enumerate(id_order[1:], 1):
        print(f"\
Adding individual {id_val} ({i}/{len(id_order)-1})...")
        
        # Add this individual to the current pedigree
        result = add_individual_to_pedigree_simple(
            id_val=id_val,
            up_dct=up_dct,
            id_to_shared_ibd=id_to_shared_ibd,
            id_to_info=id_to_info,
            n_keep=n_keep
        )
        
        # Update the current pedigree
        if result:
            up_dct, likelihood = result[0]
            print(f"  Added with likelihood {likelihood:.2f}")
        else:
            print("  Failed to add individual")
    
    return up_dct

# Demonstrate the function
def demonstrate_full_incremental_build():
    # Create IBD sharing data for a small group of individuals
    id_to_shared_ibd = {
        # Family 1: Parent (1) and children (2, 3)
        (1, 2): [{'length_cm': 1800}],  # Parent-child
        (1, 3): [{'length_cm': 1800}],  # Parent-child
        (2, 3): [{'length_cm': 1700}],  # Full siblings
        
        # Family 2: Parent (4) and children (5, 6)
        (4, 5): [{'length_cm': 1800}],  # Parent-child
        (4, 6): [{'length_cm': 1800}],  # Parent-child
        (5, 6): [{'length_cm': 1700}],  # Full siblings
        
        # Connection between families: 1 and 4 are siblings (share ~900 cM)
        (1, 4): [{'length_cm': 900}],   # Half siblings
        
        # More distant relationships
        (2, 5): [{'length_cm': 400}],   # First cousins
        (3, 6): [{'length_cm': 400}],   # First cousins
    }
    
    # Create demographic information
    id_to_info = {
        1: {'age': 45, 'sex': 'M'},  # Parent in family 1
        2: {'age': 20, 'sex': 'F'},  # Child 1 in family 1
        3: {'age': 18, 'sex': 'M'},  # Child 2 in family 1
        4: {'age': 42, 'sex': 'F'},  # Parent in family 2
        5: {'age': 19, 'sex': 'M'},  # Child 1 in family 2
        6: {'age': 17, 'sex': 'F'},  # Child 2 in family 2
    }
    
    # Order individuals by connectivity
    ordered_ids = order_individuals_by_connectivity(id_to_shared_ibd)
    print(f"Building pedigree with individuals ordered by connectivity: {ordered_ids}")
    
    # Build the pedigree incrementally
    final_pedigree = build_pedigree_incremental(
        id_to_shared_ibd=id_to_shared_ibd,
        id_to_info=id_to_info,
        n_keep=3,
        id_order=ordered_ids
    )
    
    # Visualize the final pedigree
    print("\
Final pedigree:")
    visualize_pedigree(final_pedigree, "Final Incrementally Built Pedigree")

# Run the demonstration
demonstrate_full_incremental_build()

## Part 3: Optimizing Incremental Addition

In real-world applications, Bonsai v3 includes several optimizations to make the incremental addition process more efficient, especially for large pedigrees. Let's implement some of these optimizations:

In [None]:
def optimize_addition(id_val, up_dct, id_to_shared_ibd, id_to_info=None):
    """
    Apply optimizations to make the addition process more efficient.
    
    Args:
        id_val: ID of the individual to add
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        
    Returns:
        optimized_data: Tuple of (filtered_points, max_up, filtered_ibd)
    """
    id_to_info = id_to_info or {}
    
    # Calculate IBD threshold based on pedigree size
    pedigree_size = len(up_dct)
    ibd_threshold = max(20, 10 * math.log(pedigree_size + 1))  # Avoid log(0)
    
    print(f"Using IBD threshold of {ibd_threshold:.1f} cM based on pedigree size {pedigree_size}")
    
    # Find potential connection points
    connection_points = find_potential_connection_points(id_val, up_dct, id_to_shared_ibd)
    
    # Filter by IBD threshold
    filtered_points = [(id_i, ibd) for id_i, ibd in connection_points if ibd >= ibd_threshold]
    
    print(f"Found {len(connection_points)} potential connection points, {len(filtered_points)} above threshold")
    
    # Limit the number of connection points to evaluate
    max_points = min(len(filtered_points), 10)
    top_points = filtered_points[:max_points]
    
    # Determine optimal max_up parameter based on IBD amounts
    if top_points:
        max_ibd = top_points[0][1]
        if max_ibd > 1700:  # Parent-child level
            suggested_max_up = 1
        elif max_ibd > 700:  # Sibling level
            suggested_max_up = 2
        elif max_ibd > 200:  # First cousin level
            suggested_max_up = 3
        elif max_ibd > 50:   # Second cousin level
            suggested_max_up = 4
        else:
            suggested_max_up = 5
            
        print(f"Using max_up={suggested_max_up} based on maximum IBD={max_ibd:.1f} cM")
    else:
        suggested_max_up = 3  # Default
        print(f"Using default max_up={suggested_max_up} (no connection points above threshold)")
    
    # Filter id_to_shared_ibd to include only relevant pairs
    relevant_ids = set([id_val])
    relevant_ids.update([id_i for id_i, _ in top_points])
    
    filtered_ibd = {}
    for pair, segments in id_to_shared_ibd.items():
        id1, id2 = pair
        if id1 in relevant_ids or id2 in relevant_ids:
            filtered_ibd[pair] = segments
    
    print(f"Filtered IBD data from {len(id_to_shared_ibd)} to {len(filtered_ibd)} relevant pairs")
    
    return (top_points, suggested_max_up, filtered_ibd)

def add_individual_optimized(id_val, up_dct, id_to_shared_ibd, id_to_info=None, n_keep=3):
    """
    Add a single individual to an existing pedigree with optimizations.
    
    Args:
        id_val: ID of the individual to add
        up_dct: Up-node dictionary representing the existing pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        n_keep: Number of top pedigrees to keep
        
    Returns:
        best_pedigrees: List of (pedigree, likelihood) pairs
    """
    id_to_info = id_to_info or {}
    
    # Apply optimizations
    top_points, max_up, filtered_ibd = optimize_addition(id_val, up_dct, id_to_shared_ibd, id_to_info)
    
    # If no connection points found, just add the individual without connections
    if not top_points:
        new_ped = copy.deepcopy(up_dct)
        new_ped[id_val] = {}  # Add as isolated individual
        return [(new_ped, 0.0)]
    
    # Evaluate relationship configurations for each connection point
    all_pedigrees = []
    
    for existing_id, shared_cm in top_points:
        # Get possible relationship configurations
        rel_configs = evaluate_relationship_configs(id_val, existing_id, shared_cm, id_to_info)
        
        # Limit the depth of relationships based on max_up
        filtered_configs = [rc for rc in rel_configs if rc[0] <= max_up and rc[1] <= max_up]
        
        # Add individual with each relationship configuration
        for up, down, num_ancs, likelihood, description in filtered_configs:
            # Create new pedigree with this relationship
            new_ped = add_with_relationship(id_val, existing_id, up_dct, up, down, num_ancs)
            
            # Add to list of potential pedigrees
            all_pedigrees.append((new_ped, likelihood, description))
    
    # Sort by likelihood and keep top n_keep
    all_pedigrees.sort(key=lambda x: x[1], reverse=True)
    best_pedigrees = [(ped, likelihood) for ped, likelihood, _ in all_pedigrees[:n_keep]]
    
    # Print the top relationship configurations
    print("Top relationship configurations:")
    for i, (_, likelihood, description) in enumerate(all_pedigrees[:3]):
        print(f"  {i+1}. {description}: likelihood={likelihood:.2f}")
    
    return best_pedigrees

# Demonstrate the optimized function
def demonstrate_optimized_addition():
    # Create a larger pedigree with 20 individuals
    up_dct = {}
    for i in range(1, 21):
        up_dct[i] = {}
    
    # Add some relationship structure
    # Family 1: Individuals 1-5
    up_dct[2][1] = 1  # 1 is parent of 2
    up_dct[3][1] = 1  # 1 is parent of 3
    up_dct[4][2] = 1  # 2 is parent of 4
    up_dct[5][3] = 1  # 3 is parent of 5
    
    # Family 2: Individuals 6-10
    up_dct[7][6] = 1  # 6 is parent of 7
    up_dct[8][6] = 1  # 6 is parent of 8
    up_dct[9][7] = 1  # 7 is parent of 9
    up_dct[10][8] = 1  # 8 is parent of 10
    
    # Family 3: Individuals 11-15
    up_dct[12][11] = 1  # 11 is parent of 12
    up_dct[13][11] = 1  # 11 is parent of 13
    up_dct[14][12] = 1  # 12 is parent of 14
    up_dct[15][13] = 1  # 13 is parent of 15
    
    # Family 4: Individuals 16-20
    up_dct[17][16] = 1  # 16 is parent of 17
    up_dct[18][16] = 1  # 16 is parent of 18
    up_dct[19][17] = 1  # 17 is parent of 19
    up_dct[20][18] = 1  # 18 is parent of 20
    
    # Create IBD sharing data for new individual 21
    id_to_shared_ibd = {
        (5, 21): [{'length_cm': 1800}],   # Parent-child level with individual 5
        (10, 21): [{'length_cm': 400}],   # First cousin level with individual 10
        (15, 21): [{'length_cm': 100}],   # Second cousin level with individual 15
        (20, 21): [{'length_cm': 50}]     # Very distant with individual 20
    }
    
    # Create demographic information
    id_to_info = {
        5: {'age': 35, 'sex': 'F'},
        10: {'age': 30, 'sex': 'M'},
        15: {'age': 28, 'sex': 'F'},
        20: {'age': 25, 'sex': 'M'},
        21: {'age': 10, 'sex': 'M'}
    }
    
    # Add the new individual with optimization
    print("Using optimized incremental addition:")
    result = add_individual_optimized(21, up_dct, id_to_shared_ibd, id_to_info)
    
    # Visualize the result
    if result:
        best_ped, likelihood = result[0]
        print(f"\
Added individual 21 with likelihood {likelihood:.2f}")
        
        # For large pedigrees, visualize only the relevant portion
        relevant_ids = {21}  # Start with the new individual
        
        # Add direct connections
        for id_val in best_ped:
            if id_val == 21 or 21 in best_ped.get(id_val, {}):
                relevant_ids.add(id_val)
                # Add parents
                relevant_ids.update(best_ped.get(id_val, {}).keys())
        
        # Create a subgraph with only relevant individuals
        subgraph = {id_val: parents for id_val, parents in best_ped.items() 
                   if id_val in relevant_ids}
        
        # Add additional parents
        for id_val, parents in subgraph.items():
            for parent_id in parents:
                if parent_id not in subgraph:
                    subgraph[parent_id] = best_ped.get(parent_id, {})
        
        # Visualize the subgraph
        visualize_pedigree(subgraph, "Pedigree with Individual 21 (Relevant Portion)")
    else:
        print("Failed to add individual")

# Run the demonstration
demonstrate_optimized_addition()

## Summary

In this lab, we explored how Bonsai v3 efficiently adds new individuals to existing pedigrees one at a time, rather than rebuilding the entire structure. Key takeaways include:

1. **Connection Point Identification**: Finding individuals in the existing pedigree who share IBD with the new individual.

2. **Relationship Evaluation**: Systematically evaluating different relationship configurations based on IBD sharing and demographic constraints.

3. **Physical Addition**: Implementing the addition of a new individual with a specific relationship to an existing individual.

4. **Incremental Building**: Using these techniques to build a pedigree incrementally by adding individuals one at a time in an optimal order.

5. **Optimizations**: Implementing optimizations to make the process more efficient, especially for large pedigrees.

This incremental approach is critical for performance when working with large datasets and for updating pedigrees as new genetic data becomes available. It allows Bonsai v3 to scale to much larger pedigrees than would be possible with direct optimization of the entire structure.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab17_Incremental_Addition.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive