# Lab 16: Merging Pedigrees with Optimal Connection Points

## Overview

In this lab, we'll explore how Bonsai v3 finds and uses optimal connection points to merge separate pedigrees into larger structures. This process is crucial for reconstructing complex family networks from genetic data, as it determines how smaller family units connect to form larger genealogies.

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
import copy
import random
import math
from collections import defaultdict

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\
## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\
Methods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\
## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_source(obj):
    """Display the source code of an object (function or class)"""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\
{source}\
```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Lab 16: Merging Pedigrees with Optimal Connection Points

In this lab, we'll explore how Bonsai v3 finds and uses optimal connection points to merge separate pedigrees into larger structures. This involves:

1. Identifying potential connection points in each pedigree
2. Evaluating different relationship configurations for connecting these points
3. Calculating likelihoods for each connection option
4. Physically merging pedigrees through the optimal connection points

This process is essential for reconstructing large family networks from genetic data, as it determines how smaller family units connect to form larger genealogies.

## Part 1: Understanding Connection Points

Let's start by examining how Bonsai v3 identifies potential connection points within pedigrees.

In [None]:
# Import the connection point functions from Bonsai v3
if not is_jupyterlite():
    try:
        from utils.bonsaitree.bonsaitree.v3.pedigrees import get_possible_connection_point_set
        
        # Display the source code of the function
        print("Source code for get_possible_connection_point_set:")
        view_source(get_possible_connection_point_set)
    except (ImportError, AttributeError) as e:
        print(f"Could not import function: {e}")
else:
    print("Cannot display source code in JupyterLite environment.")

### 1.1 What are Connection Points?

Connection points are positions within a pedigree where it can be connected to another pedigree. In Bonsai v3, connection points are represented as tuples of `(id, parent_id, direction)` where:

- `id`: The ID of the individual at the connection point
- `parent_id`: The ID of the individual's parent (if connecting through a parent relationship), or `None`
- `direction`: Indicates which direction the connection should go (e.g., up to ancestors, down to descendants)

Let's implement a simplified version of the connection point finder to better understand how it works:

In [None]:
def simplified_get_connection_points(up_dct):
    """
    Find all possible connection points in a pedigree.
    
    Args:
        up_dct: Up-node dictionary representing the pedigree
        
    Returns:
        con_pt_set: Set of connection point tuples (id, parent_id, direction)
    """
    con_pt_set = set()
    
    # Add connection points for each individual
    for id_val in up_dct:
        # Direct connection through the individual
        con_pt_set.add((id_val, None, None))
        
        # Connections through parents
        for parent_id in up_dct.get(id_val, {}):
            # Connect as a child of this parent
            con_pt_set.add((id_val, parent_id, 'up'))
            
            # Connect as a sibling through this parent
            con_pt_set.add((id_val, parent_id, 'down'))
    
    return con_pt_set

# Example usage
def demonstrate_connection_points():
    # Create a simple pedigree
    #   -1 (grandmother)    -2 (grandfather)
    #             \\            /
    #              \\          /
    #               -3 (mother)    -4 (father)
    #                       \\       /
    #                        \\     /
    #                          1 (child)
    
    pedigree = {
        -1: {},            # Grandmother
        -2: {},            # Grandfather
        -3: {-1: 1, -2: 1},  # Mother with her parents
        -4: {},            # Father
        1: {-3: 1, -4: 1}   # Child with mother and father
    }
    
    # Find connection points
    con_pts = simplified_get_connection_points(pedigree)
    
    # Display the connection points
    print("Connection points in the example pedigree:")
    for pt in sorted(con_pts, key=lambda x: (x[0], x[1] or 0)):
        id_val, parent_id, direction = pt
        
        # Generate a human-readable description
        if parent_id is None:
            description = f"Direct connection through individual {id_val}"
        elif direction == 'up':
            description = f"Connect as a child of {parent_id} through {id_val}"
        elif direction == 'down':
            description = f"Connect as a sibling of {id_val} through parent {parent_id}"
        else:
            description = f"Unspecified connection through {id_val} and {parent_id}"
        
        print(f"({id_val}, {parent_id}, {direction}): {description}")
    
    # Visualize the pedigree
    visualize_pedigree(pedigree, title="Example Pedigree")

# Helper function to visualize pedigrees
def visualize_pedigree(up_node_dict, title="Pedigree"):
    # Create a directed graph (edges point from child to parent)
    G = nx.DiGraph()
    
    # Add all nodes to the graph
    all_ids = set(up_node_dict.keys())
    for parents in up_node_dict.values():
        all_ids.update(parents.keys())
    
    # Add nodes with colors (blue for genotyped, gray for ungenotyped)
    for node_id in all_ids:
        if node_id > 0:  # Genotyped individuals have positive IDs
            G.add_node(node_id, color='lightblue')
        else:  # Ungenotyped individuals have negative IDs
            G.add_node(node_id, color='lightgray')
    
    # Add edges (from child to parent)
    for child, parents in up_node_dict.items():
        for parent in parents:
            G.add_edge(child, parent)
    
    # Create plot
    plt.figure(figsize=(10, 6))
    plt.title(title)
    
    # Get node colors
    node_colors = [G.nodes[n]['color'] for n in G.nodes]
    
    # Set layout (tree layout looks nice for pedigrees)
    pos = nx.spring_layout(G, seed=42)  # For reproducibility
    
    # Draw the graph
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            node_size=800, font_weight='bold')
    
    plt.tight_layout()
    plt.show()

# Run the demonstration
demonstrate_connection_points()

### 1.2 Restricting Connection Points

One of the key challenges in finding optimal connection points is the vast search space of possible connections. Bonsai v3 addresses this by intelligently restricting the search space using several specialized functions.

Let's look at the `restrict_connection_point_set` function, which allows us to focus on connection points involving specific individuals:

In [None]:
# Import the connection restriction function from Bonsai v3
if not is_jupyterlite():
    try:
        from utils.bonsaitree.bonsaitree.v3.connections import get_restricted_connection_point_sets
        
        # Display the source code of the function
        print("Source code for get_restricted_connection_point_sets:")
        view_source(get_restricted_connection_point_sets)
    except (ImportError, AttributeError) as e:
        print(f"Could not import function: {e}")
else:
    print("Cannot display source code in JupyterLite environment.")

In [None]:
def simplified_restrict_connection_points(up_dct, con_pt_set, focused_ids):
    """
    Restrict connection points to those involving specific individuals.
    
    Args:
        up_dct: Up-node dictionary representing the pedigree
        con_pt_set: Set of all connection points
        focused_ids: List of IDs to focus on
        
    Returns:
        restricted_set: Restricted set of connection points
    """
    # Define helper functions for finding ancestors and descendants
    def get_ancestors(iid, up_dct, ancestors=None):
        if ancestors is None:
            ancestors = set()
        for parent in up_dct.get(iid, {}):
            ancestors.add(parent)
            get_ancestors(parent, up_dct, ancestors)
        return ancestors
    
    def get_descendants(iid, up_dct, descendants=None):
        if descendants is None:
            descendants = set()
        for child, parents in up_dct.items():
            if iid in parents:
                descendants.add(child)
                get_descendants(child, up_dct, descendants)
        return descendants
    
    # Get subtrees for each focused ID
    subtree_ids = set(focused_ids)  # Start with the focused IDs themselves
    
    for iid in focused_ids:
        # Add ancestors and descendants
        subtree_ids.update(get_ancestors(iid, up_dct))
        subtree_ids.update(get_descendants(iid, up_dct))
    
    # Restrict connection points to those in the subtree
    restricted_set = set()
    for pt in con_pt_set:
        id_val, parent_id, direction = pt
        
        # Check if the connection point involves the subtree
        if id_val in subtree_ids and (parent_id is None or parent_id in subtree_ids):
            restricted_set.add(pt)
    
    return restricted_set

# Demonstrate the function
def demonstrate_restricted_connection_points():
    # Create a larger pedigree
    #   -1 (grandmother)    -2 (grandfather)     -5      -6
    #             \\            /                  |       |
    #              \\          /                   |       |
    #               -3 (mother)    -4 (father)   -7      -8
    #                       \\       /              \\      /
    #                        \\     /                \\    /
    #                          1 (child)              2
    
    pedigree = {
        -1: {},            # Grandmother
        -2: {},            # Grandfather
        -3: {-1: 1, -2: 1},  # Mother with her parents
        -4: {},            # Father
        1: {-3: 1, -4: 1},   # Child with mother and father
        -5: {},            # Another grandmother
        -6: {},            # Another grandfather
        -7: {-5: 1},       # Another mother
        -8: {-6: 1},       # Another father
        2: {-7: 1, -8: 1}   # Another child
    }
    
    # Find all connection points
    all_con_pts = simplified_get_connection_points(pedigree)
    print(f"Total connection points: {len(all_con_pts)}")
    
    # Restrict to connection points involving individual 1 and their ancestors
    restricted_con_pts = simplified_restrict_connection_points(pedigree, all_con_pts, [1])
    print(f"Restricted connection points: {len(restricted_con_pts)}")
    
    # Display some of the restricted connection points
    print("\
Sample of restricted connection points:")
    for pt in list(restricted_con_pts)[:5]:  # Show first 5
        print(f"- {pt}")
    
    # Visualize the pedigree
    visualize_pedigree(pedigree, title="Larger Example Pedigree")

# Run the demonstration
demonstrate_restricted_connection_points()

### 1.3 Finding Likely Connection Points

Another important aspect of Bonsai v3's connection point selection is identifying the most likely connection points based on IBD sharing patterns. The `get_likely_con_pt_set` function uses the correlation between IBD sharing and relationship distances to find promising connection points:

In [None]:
# Import the likely connection point function from Bonsai v3
if not is_jupyterlite():
    try:
        from utils.bonsaitree.bonsaitree.v3.connections import get_likely_con_pt_set
        
        # Display the source code of the function
        print("Source code for get_likely_con_pt_set:")
        view_source(get_likely_con_pt_set)
    except (ImportError, AttributeError) as e:
        print(f"Could not import function: {e}")
else:
    print("Cannot display source code in JupyterLite environment.")

In [None]:
def simplified_get_likely_connection_points(up_dct, id_to_shared_ibd, con_pt_set):
    """
    Find likely connection points based on IBD sharing patterns.
    
    Args:
        up_dct: Up-node dictionary representing the pedigree
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        con_pt_set: Set of connection points to evaluate
        
    Returns:
        likely_con_pts: Set of likely connection points
    """
    # Calculate IBD sharing for each pair
    id_pair_to_cm = {}
    for (id1, id2), segments in id_to_shared_ibd.items():
        id_pair_to_cm[(id1, id2)] = sum(seg.get('length_cm', 0) for seg in segments)
    
    # Calculate relationship distances in the pedigree
    def get_rel_distance(id1, id2, up_dct):
        # Simple implementation - just return a placeholder value
        # In reality, this would calculate the actual relationship distance
        return 1
    
    # Evaluate each connection point
    con_pt_scores = {}
    for pt in con_pt_set:
        id_val, parent_id, direction = pt
        
        # Find all pairs involving this individual
        ibd_values = []
        distance_values = []
        
        for (id1, id2) in id_pair_to_cm:
            # Check if one of the IDs matches our connection point
            if id1 == id_val or id2 == id_val:
                # Get the other ID
                other_id = id2 if id1 == id_val else id1
                
                # Add IBD amount
                ibd_values.append(id_pair_to_cm[(id1, id2)])
                
                # Add relationship distance
                distance_values.append(get_rel_distance(id_val, other_id, up_dct))
        
        # Calculate correlation if we have enough data points
        if len(ibd_values) >= 3:
            # Calculate correlation coefficient
            correlation = np.corrcoef(ibd_values, distance_values)[0, 1]
            con_pt_scores[pt] = correlation
    
    # Find points with strong negative correlation
    # (negative because closer relationships = more IBD)
    likely_con_pts = set()
    for pt, score in con_pt_scores.items():
        if score < -0.3:  # Strong negative correlation
            likely_con_pts.add(pt)
    
    return likely_con_pts

## Part 2: Evaluating Connection Configurations

Once potential connection points are identified, Bonsai v3 evaluates different configurations for connecting them. This involves determining the optimal relationship types and degrees for connecting the points, which is handled by the `get_connection_degs_and_log_likes` function.

In [None]:
# Import the connection evaluation function from Bonsai v3
if not is_jupyterlite():
    try:
        from utils.bonsaitree.bonsaitree.v3.connections import get_connection_degs_and_log_likes
        
        # Display the source code of the function
        print("Source code for get_connection_degs_and_log_likes:")
        view_source(get_connection_degs_and_log_likes)
    except (ImportError, AttributeError) as e:
        print(f"Could not import function: {e}")
else:
    print("Cannot display source code in JupyterLite environment.")

In [None]:
def simplified_evaluate_connection(id1, id2, id_to_shared_ibd, id_to_info=None):
    """
    Evaluate different relationship configurations for connecting two individuals.
    
    Args:
        id1, id2: IDs of individuals to connect
        id_to_shared_ibd: Dict mapping ID pairs to their IBD segments
        id_to_info: Dict with demographic information for individuals
        
    Returns:
        deg_ll_list: List of (up, down, num_ancs, log_likelihood) tuples
    """
    id_to_info = id_to_info or {}
    
    # Get IBD segments between id1 and id2
    pair = (min(id1, id2), max(id1, id2))
    segments = id_to_shared_ibd.get(pair, [])
    total_cm = sum(seg.get('length_cm', 0) for seg in segments)
    
    # Get demographic information
    info1 = id_to_info.get(id1, {})
    info2 = id_to_info.get(id2, {})
    age1 = info1.get('age')
    age2 = info2.get('age')
    sex1 = info1.get('sex')
    sex2 = info2.get('sex')
    
    # Define possible relationship configurations
    relationship_configs = [
        # (up, down, num_ancs, description)
        (0, 0, 2, "Identical/Self"),          # Self (identical)
        (0, 1, 1, "Parent-Child (id1→id2)"),  # Parent-child (id1 is parent of id2)
        (1, 0, 1, "Parent-Child (id2→id1)"),  # Parent-child (id2 is parent of id1)
        (1, 1, 2, "Full Siblings"),           # Full siblings
        (1, 1, 1, "Half Siblings"),           # Half siblings
        (2, 2, 2, "First Cousins"),           # First cousins
        (2, 1, 1, "Half-Aunt/Uncle-Niece/Nephew"), # Half-aunt/uncle-niece/nephew
        (1, 2, 1, "Half-Niece/Nephew-Aunt/Uncle"), # Half-niece/nephew-aunt/uncle
        (3, 3, 2, "Second Cousins")           # Second cousins
    ]
    
    # Expected IBD ranges for different relationships
    ibd_ranges = {
        "Identical/Self": (3400, 3600),
        "Parent-Child (id1→id2)": (1700, 3400),
        "Parent-Child (id2→id1)": (1700, 3400),
        "Full Siblings": (1700, 2800),
        "Half Siblings": (700, 1800),
        "First Cousins": (200, 900),
        "Half-Aunt/Uncle-Niece/Nephew": (400, 1200),
        "Half-Niece/Nephew-Aunt/Uncle": (400, 1200),
        "Second Cousins": (50, 400)
    }
    
    # Simple filter for age compatibility
    def is_age_compatible(rel_desc, age1, age2):
        if age1 is None or age2 is None:
            return True  # Can't check without ages
            
        if "Parent-Child (id1→id2)" in rel_desc:
            return age1 > age2 + 12  # Parent should be older than child
            
        if "Parent-Child (id2→id1)" in rel_desc:
            return age2 > age1 + 12  # Parent should be older than child
            
        if "Siblings" in rel_desc:
            return abs(age1 - age2) < 30  # Siblings typically close in age
            
        return True  # Default to compatible
    
    # Evaluate each relationship configuration
    deg_ll_list = []
    for up, down, num_ancs, description in relationship_configs:
        # Check age compatibility
        if not is_age_compatible(description, age1, age2):
            continue
            
        # Check sex constraints (simplified)
        if "Parent" in description and sex1 == sex2:
            continue  # Parent and child can't have same sex (simplified constraint)
            
        # Check IBD compatibility
        min_cm, max_cm = ibd_ranges.get(description, (0, float('inf')))
        
        # Calculate likelihood based on IBD amount
        if min_cm <= total_cm <= max_cm:
            # Higher likelihood for IBD in middle of range
            range_center = (min_cm + max_cm) / 2
            distance_from_center = abs(total_cm - range_center)
            range_width = (max_cm - min_cm) / 2
            
            # Normalize distance to [0, 1] range
            normalized_distance = min(distance_from_center / range_width, 1.0)
            
            # Calculate log-likelihood (higher for values closer to center)
            log_like = math.log(1 + total_cm) * (1 - normalized_distance)
        else:
            # Outside expected range - lower likelihood
            log_like = math.log(1 + total_cm) * 0.1
        
        deg_ll_list.append((up, down, num_ancs, log_like, description))
    
    # Sort by likelihood (descending)
    deg_ll_list.sort(key=lambda x: x[3], reverse=True)
    
    return deg_ll_list

# Demonstrate the function
def demonstrate_evaluate_connection():
    # Create some IBD sharing data
    id_to_shared_ibd = {
        (1, 2): [{'length_cm': 1800}],  # Parent-child level
        (1, 3): [{'length_cm': 900}],   # Half-sibling level
        (1, 4): [{'length_cm': 400}],   # First cousin level
        (1, 5): [{'length_cm': 200}]    # Second cousin level
    }
    
    # Create some biographical information
    id_to_info = {
        1: {'age': 50, 'sex': 'M'},
        2: {'age': 25, 'sex': 'F'},
        3: {'age': 48, 'sex': 'F'},
        4: {'age': 30, 'sex': 'M'},
        5: {'age': 15, 'sex': 'F'}
    }
    
    # Evaluate different pairs
    pairs = [(1, 2), (1, 3), (1, 4), (1, 5)]
    
    for id1, id2 in pairs:
        print(f"\
Evaluating connection between individuals {id1} and {id2}:")
        
        # Get IBD amount
        pair = (min(id1, id2), max(id1, id2))
        total_cm = sum(seg.get('length_cm', 0) for seg in id_to_shared_ibd.get(pair, []))
        print(f"Total shared IBD: {total_cm} cM")
        
        # Evaluate connection configurations
        results = simplified_evaluate_connection(id1, id2, id_to_shared_ibd, id_to_info)
        
        # Display top results
        print("Top 3 relationship configurations:")
        for i, (up, down, num_ancs, log_like, description) in enumerate(results[:3]):
            print(f"{i+1}. {description} (up={up}, down={down}, ancs={num_ancs}): {log_like:.2f}")

# Run the demonstration
demonstrate_evaluate_connection()

## Part 3: Physically Merging Pedigrees

Once optimal connection points and relationship configurations are identified, Bonsai v3 uses this information to physically merge pedigrees. This is implemented in the `connect_pedigrees_through_points` function:

In [None]:
# Import the pedigree connection function from Bonsai v3
if not is_jupyterlite():
    try:
        from utils.bonsaitree.bonsaitree.v3.connections import connect_pedigrees_through_points
        
        # Display the source code of the function
        print("Source code for connect_pedigrees_through_points:")
        view_source(connect_pedigrees_through_points)
    except (ImportError, AttributeError) as e:
        print(f"Could not import function: {e}")
else:
    print("Cannot display source code in JupyterLite environment.")

In [None]:
def simplified_connect_pedigrees(id1, id2, up_dct1, up_dct2, up, down, num_ancs):
    """
    Connect two pedigrees through specified individuals with a given relationship.
    
    Args:
        id1: ID in the first pedigree
        id2: ID in the second pedigree
        up_dct1, up_dct2: The pedigrees to connect
        up, down, num_ancs: Relationship parameters
        
    Returns:
        merged: The merged pedigree
    """
    # Create copies to avoid modifying originals
    up_dct1 = copy.deepcopy(up_dct1)
    up_dct2 = copy.deepcopy(up_dct2)
    
    # Ensure individuals exist in their pedigrees
    if id1 not in up_dct1:
        up_dct1[id1] = {}
    if id2 not in up_dct2:
        up_dct2[id2] = {}
    
    # Handle different relationship types
    if up == 0 and down == 0:  # Same individual
        # This is a special case where id1 and id2 are the same individual
        return None  # Not implemented in simplified version
        
    elif up == 0 and down == 1:  # id1 is parent of id2
        # Add id1 as parent of id2
        up_dct2[id2][id1] = 1
        
    elif up == 1 and down == 0:  # id2 is parent of id1
        # Add id2 as parent of id1
        up_dct1[id1][id2] = 1
        
    elif up == 1 and down == 1:  # Siblings
        # Create common ancestors
        if num_ancs == 1:  # Half siblings (one common ancestor)
            common_anc = -1000  # Use a very negative ID to avoid conflicts
            up_dct1[id1][common_anc] = 1
            up_dct2[id2][common_anc] = 1
            up_dct1[common_anc] = {}
            
        elif num_ancs == 2:  # Full siblings (two common ancestors)
            common_anc1 = -1000
            common_anc2 = -1001
            up_dct1[id1][common_anc1] = 1
            up_dct1[id1][common_anc2] = 1
            up_dct2[id2][common_anc1] = 1
            up_dct2[id2][common_anc2] = 1
            up_dct1[common_anc1] = {}
            up_dct1[common_anc2] = {}
            
    else:  # More distant relationships
        # Create ancestor chains as needed
        curr_id1 = id1
        for i in range(up):
            anc_id = -1000 - i
            if curr_id1 not in up_dct1:
                up_dct1[curr_id1] = {}
            up_dct1[curr_id1][anc_id] = 1
            up_dct1[anc_id] = {}
            curr_id1 = anc_id
            
        curr_id2 = id2
        for i in range(down):
            anc_id = -2000 - i
            if curr_id2 not in up_dct2:
                up_dct2[curr_id2] = {}
            up_dct2[curr_id2][anc_id] = 1
            up_dct2[anc_id] = {}
            curr_id2 = anc_id
            
        # Connect the chains at the top through common ancestors
        if num_ancs >= 1:
            common_anc1 = -3000
            up_dct1[curr_id1][common_anc1] = 1
            up_dct2[curr_id2][common_anc1] = 1
            up_dct1[common_anc1] = {}
            
        if num_ancs >= 2:
            common_anc2 = -3001
            up_dct1[curr_id1][common_anc2] = 1
            up_dct2[curr_id2][common_anc2] = 1
            up_dct1[common_anc2] = {}
    
    # Merge the pedigrees
    merged = copy.deepcopy(up_dct1)
    
    # Add all nodes from up_dct2
    for node, parents in up_dct2.items():
        if node not in merged:
            merged[node] = parents.copy()
        else:
            # Merge parents
            for parent, deg in parents.items():
                merged[node][parent] = deg
    
    return merged

# Demonstrate the function
def demonstrate_connect_pedigrees():
    # Create two simple pedigrees
    # Pedigree 1: Individual 1
    ped1 = {1: {}}
    
    # Pedigree 2: Individual 2
    ped2 = {2: {}}
    
    # Connect as parent-child (1 is parent of 2)
    merged1 = simplified_connect_pedigrees(1, 2, ped1, ped2, 0, 1, 1)
    
    print("Connecting as parent-child (1 is parent of 2):")
    print(f"Merged pedigree: {merged1}")
    visualize_pedigree(merged1, title="1 as Parent of 2")
    
    # Connect as siblings
    merged2 = simplified_connect_pedigrees(1, 2, ped1, ped2, 1, 1, 2)
    
    print("\
Connecting as full siblings:")
    print(f"Merged pedigree: {merged2}")
    visualize_pedigree(merged2, title="1 and 2 as Full Siblings")
    
    # Connect as first cousins
    merged3 = simplified_connect_pedigrees(1, 2, ped1, ped2, 2, 2, 2)
    
    print("\
Connecting as first cousins:")
    print(f"Merged pedigree: {merged3}")
    visualize_pedigree(merged3, title="1 and 2 as First Cousins")

# Run the demonstration
demonstrate_connect_pedigrees()

## Part 4: Putting It All Together - Merging Pedigrees with Optimal Connection Points

Now let's put all the pieces together to demonstrate how Bonsai v3 finds and uses optimal connection points to merge pedigrees:

In [None]:
def demonstrate_optimal_pedigree_merging():
    # Create two pedigrees
    # Pedigree 1: A small family with a parent and child
    ped1 = {
        -1: {},          # Ungenotyped parent
        1: {-1: 1}       # Genotyped child
    }
    
    # Pedigree 2: Another small family with a parent and child
    ped2 = {
        -2: {},          # Ungenotyped parent
        2: {-2: 1}       # Genotyped child
    }
    
    # Visualize the original pedigrees
    print("Original pedigrees:")
    visualize_pedigree(ped1, title="Pedigree 1")
    visualize_pedigree(ped2, title="Pedigree 2")
    
    # Create IBD sharing data - individuals 1 and 2 share DNA (half-sibling level)
    id_to_shared_ibd = {(1, 2): [{'length_cm': 900}]}
    
    # Demographic information
    id_to_info = {
        1: {'age': 30, 'sex': 'M'},
        2: {'age': 28, 'sex': 'F'}
    }
    
    # Step 1: Find connection points in each pedigree
    con_pts1 = simplified_get_connection_points(ped1)
    con_pts2 = simplified_get_connection_points(ped2)
    
    print(f"\
Connection points in pedigree 1: {len(con_pts1)}")
    print(f"Connection points in pedigree 2: {len(con_pts2)}")
    
    # Step 2: Evaluate relationship configurations
    relationship_results = simplified_evaluate_connection(1, 2, id_to_shared_ibd, id_to_info)
    
    print("\
Top relationship configurations:")
    for i, (up, down, num_ancs, log_like, description) in enumerate(relationship_results[:3]):
        print(f"{i+1}. {description} (up={up}, down={down}, ancs={num_ancs}): {log_like:.2f}")
    
    # Step 3: Connect pedigrees using the top relationship configuration
    best_rel = relationship_results[0]
    up, down, num_ancs, _, description = best_rel
    
    merged = simplified_connect_pedigrees(1, 2, ped1, ped2, up, down, num_ancs)
    
    print(f"\
Merged pedigree using {description} relationship:")
    visualize_pedigree(merged, title=f"Merged Pedigree ({description})")
    
    # Try a different relationship configuration
    second_best_rel = relationship_results[1]
    up2, down2, num_ancs2, _, description2 = second_best_rel
    
    merged2 = simplified_connect_pedigrees(1, 2, ped1, ped2, up2, down2, num_ancs2)
    
    print(f"\
Alternative merged pedigree using {description2} relationship:")
    visualize_pedigree(merged2, title=f"Merged Pedigree ({description2})")

# Run the demonstration
demonstrate_optimal_pedigree_merging()

## Part 5: The Real Bonsai v3 Implementation

Our simplified examples have illustrated the core concepts of how Bonsai v3 finds and uses optimal connection points to merge pedigrees. However, the real implementation includes many sophisticated features that make it more accurate and robust. Let's try using the real Bonsai v3 implementation to merge pedigrees:

In [None]:
if not is_jupyterlite():
    try:
        # Import necessary functions from Bonsai v3
        from utils.bonsaitree.bonsaitree.v3.connections import combine_pedigrees
        from utils.bonsaitree.bonsaitree.v3.pw_log_like import PwLogLike
        
        # Create sample pedigrees
        ped1 = {
            -1: {},          # Ungenotyped parent
            1: {-1: 1}       # Genotyped child
        }
        
        ped2 = {
            -2: {},          # Ungenotyped parent
            2: {-2: 1}       # Genotyped child
        }
        
        # Create IBD sharing data
        id_to_shared_ibd = {(1, 2): [{
            'id1': 1,
            'id2': 2,
            'chrom': 1,
            'start_cm': 0,
            'end_cm': 900,
            'length_cm': 900
        }]}
        
        # Create biographical information
        id_to_info = {
            1: {'id': 1, 'age': 30, 'sex': 'M'},
            2: {'id': 2, 'age': 28, 'sex': 'F'}
        }
        
        # Create a PwLogLike instance
        unphased_ibd_seg_list = [id_to_shared_ibd[(1, 2)][0]]
        bio_info = [id_to_info[1], id_to_info[2]]
        
        try:
            # Create a PwLogLike instance
            pw_ll = PwLogLike(bio_info=bio_info, unphased_ibd_seg_list=unphased_ibd_seg_list)
            
            # Combine the pedigrees
            combined_results = combine_pedigrees(
                up_dct1=ped1,
                up_dct2=ped2,
                id_to_shared_ibd=id_to_shared_ibd,
                id_to_info=id_to_info,
                pw_ll=pw_ll,
                max_up=3,
                keep_num=3,
                return_many=True
            )
            
            print("Successfully used real Bonsai v3 combine_pedigrees function!")
            print(f"Number of combined pedigrees: {len(combined_results)}")
            
            # Visualize the top combined pedigree
            if combined_results:
                best_ped, likelihood = combined_results[0]
                print(f"Best pedigree has likelihood {likelihood:.2f}")
                visualize_pedigree(best_ped, title="Bonsai v3 Combined Pedigree")
                
                # Display other top results
                for i, (ped, ll) in enumerate(combined_results[1:3], 1):
                    print(f"\
Alternative pedigree {i} (likelihood: {ll:.2f}):")
                    visualize_pedigree(ped, title=f"Alternative {i} (LL={ll:.2f})")
        except Exception as e:
            print(f"Error using Bonsai v3 functions: {e}")
            print("Using simplified implementation instead")
            
    except ImportError as e:
        print(f"Could not import Bonsai v3 functions: {e}")
        print("Using simplified implementation instead")
else:
    print("Running in JupyterLite environment - skipping real Bonsai v3 implementation")

## Summary

In this lab, we explored how Bonsai v3 finds and uses optimal connection points to merge separate pedigrees into larger structures. Key takeaways include:

1. **Connection Points**: Positions within pedigrees where they can be connected to other pedigrees, represented as tuples of (id, parent_id, direction).

2. **Restricting Connection Points**: Bonsai v3 intelligently restricts the search space by focusing on connection points involving individuals who share IBD.

3. **Evaluating Connections**: Different relationship configurations are evaluated based on IBD sharing patterns, demographic constraints, and other factors.

4. **Physical Merging**: Once optimal connection points and relationship configurations are identified, pedigrees are physically merged through these points.

5. **Multiple Hypotheses**: Bonsai v3 maintains multiple alternative pedigree configurations, ranked by their likelihood.

This process is essential for reconstructing large family networks from genetic data, as it determines how smaller family units connect to form larger genealogies.

The concepts and techniques explored in this lab build on the foundation established in Lab 15 (Combine Up Dicts) and are essential for understanding how Bonsai v3 scales to larger and more complex pedigrees.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab16_Merging_Pedigrees.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive