# Lab 13: Connecting Individuals into Small Structures

## Overview

In this lab, we'll explore how Bonsai v3 connects individuals into small pedigree structures. Building on our understanding of connection points and relationship assessment, we'll examine the algorithms and techniques used to construct coherent family units from genetic data.

In [ ]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
import copy

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [ ]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [ ]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\
## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\
Methods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\
## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_source(obj):
    """Display the source code of an object (function or class)"""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\
{source}\
```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [ ]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Lab 13: Connecting Individuals into Small Structures

In this lab, we'll explore how Bonsai v3 takes individuals who share genetic data (IBD segments) and connects them into small, coherent pedigree structures. This process involves:

1. Identifying the most likely relationships between pairs of individuals
2. Finding optimal ways to connect multiple individuals into a single pedigree
3. Building and evaluating small pedigree structures

We'll focus on the core algorithms and techniques used in Bonsai v3 for constructing these small pedigree structures, a critical step in the larger process of pedigree reconstruction.

## Part 1: Building Basic Pedigree Structures

Let's begin by exploring how Bonsai v3 builds basic pedigree structures. We'll look at the core functions used to create and manipulate these structures, focusing on the building blocks of small pedigrees.

In [ ]:
# Import key pedigree functions for examination
from utils.bonsaitree.bonsaitree.v3.pedigrees import (
    get_possible_connection_point_set as actual_get_possible_connection_point_set,
    get_partner_id_set as actual_get_partner_id_set,
    add_parent as actual_add_parent,
    get_min_id as actual_get_min_id,
    reverse_node_dict as actual_reverse_node_dict,
    fill_in_partners as actual_fill_in_partners
)

from utils.bonsaitree.bonsaitree.v3.connections import (
    connect_pedigrees_through_points as actual_connect_pedigrees_through_points,
    extend_up as actual_extend_up
)

# Examine the source code of key functions if not in JupyterLite
if not is_jupyterlite():
    print("Source code for key Bonsai v3 functions:")
    
    print("\
1. get_possible_connection_point_set:")
    view_source(actual_get_possible_connection_point_set)
    
    print("\
2. add_parent:")
    view_source(actual_add_parent)
    
    print("\
3. get_partner_id_set:")
    view_source(actual_get_partner_id_set)
    
    print("\
4. connect_pedigrees_through_points:")
    view_source(actual_connect_pedigrees_through_points)
    
    print("\
5. extend_up:")
    view_source(actual_extend_up)

# Now import the functions for actual use
from utils.bonsaitree.bonsaitree.v3.pedigrees import (
    get_possible_connection_point_set,
    get_partner_id_set,
    add_parent,
    get_min_id,
    reverse_node_dict
)

# For visualization
def visualize_pedigree(up_node_dict, title="Pedigree", highlight_nodes=None, individual_metadata=None):
    """Visualize a pedigree from an up_node_dict using networkx.
    
    Args:
        up_node_dict: Dictionary mapping individuals to their parents
        title: Title for the visualization
        highlight_nodes: Set of nodes to highlight
        individual_metadata: Dictionary mapping individuals to their metadata (age, sex, etc.)
    """
    # Create a directed graph (edges point from child to parent)
    G = nx.DiGraph()
    
    # Add all nodes to the graph (combine all IDs from keys and values)
    all_ids = set(up_node_dict.keys())
    for parents in up_node_dict.values():
        all_ids.update(parents.keys())
    
    # Create node labels
    node_labels = {}
    for node_id in all_ids:
        label = str(node_id)
        if individual_metadata and node_id in individual_metadata:
            metadata = individual_metadata[node_id]
            if 'sex' in metadata and metadata['sex']:
                label += f" ({metadata['sex']})"
            if 'age' in metadata and metadata['age'] is not None:
                label += f"\\\
Age: {metadata['age']}"
        node_labels[node_id] = label
    
    # Create a color map - blue for males, pink for females, gray for unknown
    highlight_nodes = highlight_nodes or set()
    color_map = []
    for node_id in all_ids:
        if node_id in highlight_nodes:
            color_map.append('red')
        elif individual_metadata and node_id in individual_metadata and 'sex' in individual_metadata[node_id]:
            if individual_metadata[node_id]['sex'] == 'M':
                color_map.append('lightblue')
            elif individual_metadata[node_id]['sex'] == 'F':
                color_map.append('pink')
            else:
                color_map.append('lightgray')
        else:
            color_map.append('lightgray')
    
    # Add edges (from child to parent)
    edges = []
    for child, parents in up_node_dict.items():
        for parent in parents:
            edges.append((child, parent))
    
    G.add_edges_from(edges)
    
    # Create plot
    plt.figure(figsize=(10, 6))
    plt.title(title)
    
    # Layout: By default, parents are shown above children (opposite arrow direction)
    pos = nx.spring_layout(G, seed=42)  # For reproducibility, use a fixed seed
    
    # Draw nodes
    nx.draw(G, pos, with_labels=True, labels=node_labels, node_color=color_map, 
            node_size=800, font_weight='bold')
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5, arrows=True)
    
    plt.tight_layout()
    plt.show()

In [ ]:
# For JupyterLite compatibility, provide simplified implementations
if is_jupyterlite():
    print("Using simplified implementations for JupyterLite compatibility. These are not the actual Bonsai v3 functions.")
    
    def reverse_node_dict(dct):
        """Reverse a node dict. If it's a down dict make it an up dict and vice versa.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        rev_dct = {}
        for i, info in dct.items():
            for a, d in info.items():
                if a not in rev_dct:
                    rev_dct[a] = {}
                rev_dct[a][i] = d
        return rev_dct
    
    def get_min_id(dct):
        """Get the minimal ID in a node dict.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        all_ids = set(dct.keys())
        for parents in dct.values():
            all_ids.update(parents.keys())
        min_id = min(all_ids) if all_ids else 0
        return min(-1, min_id)  # ensure ID is negative
    
    def add_parent(node, up_dct, min_id=None):
        """Add an ungenotyped parent to node in up_dct.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        import copy
        up_dct = copy.deepcopy(up_dct)
        
        if node not in up_dct:
            raise ValueError(f"Node {node} is not in up dct.")
            
        pid_dict = up_dct[node]
        if len(pid_dict) >= 2:
            return up_dct, None
            
        if min_id is None:
            min_id = get_min_id(up_dct)
            
        new_pid = min_id - 1
        up_dct[node][new_pid] = 1
        up_dct[new_pid] = {}
        
        return up_dct, new_pid
    
    def get_partner_id_set(node, up_dct):
        """Find the set of partners of node in pedigree up_dct.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        down_dct = reverse_node_dict(up_dct)
        child_id_set = {c for c, d in down_dct.get(node, {}).items() if d == 1}
        partner_id_set = set()
        for cid in child_id_set:
            pids = {p for p, d in up_dct.get(cid, {}).items() if d == 1}
            partner_id_set |= pids
        partner_id_set -= {node}
        return partner_id_set
    
    def get_possible_connection_point_set(ped):
        """Find all possible points through which a pedigree can be connected to another pedigree.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        point_set = set()
        all_ids = set(ped.keys())
        for parents in ped.values():
            all_ids.update(parents.keys())
            
        for a in all_ids:
            parent_to_deg = ped.get(a, {})
            if len(parent_to_deg) < 2:
                point_set.add((a, None, 1))  # Can connect upward
                
            partners = get_partner_id_set(a, ped)
            point_set.add((a, None, 0))  # Can connect downward
            for partner in partners:
                if (partner, a, 0) not in point_set:
                    point_set.add((a, partner, 0))
                point_set.add((a, partner, None))
                
            point_set.add((a, None, None))  # Can replace node
            
        return point_set
    
    def fill_in_partners(up_dct):
        """Fill in all missing partners in a pedigree.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        import copy
        up_dct = copy.deepcopy(up_dct)
        
        # Get all nodes with exactly one parent
        for node, parents in up_dct.items():
            if len(parents) == 1:
                parent = list(parents.keys())[0]
                # Check if this parent has a partner who is already a parent of node
                partners = get_partner_id_set(parent, up_dct)
                if not any(p in parents for p in partners):
                    # Add a new partner
                    min_id = get_min_id(up_dct)
                    new_partner = min_id - 1
                    up_dct[node][new_partner] = 1
                    up_dct[new_partner] = {}
        
        return up_dct
    
    def extend_up(iid, deg, num_ancs, up_dct):
        """Extend a lineage up from iid in up node dict up_dct.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        import copy
        up_dct = copy.deepcopy(up_dct)
        
        if deg == 0:
            return up_dct, None, iid, None
        
        min_id = get_min_id(up_dct)
        new_id = min(-1, min_id-1)
        
        prev_id = None
        part_id = None
        while deg > 0:
            if iid not in up_dct:
                up_dct[iid] = {}
                
            if len(up_dct[iid]) >= 2:
                raise ValueError(f"Cannot add parent to node {iid} - already has 2 parents")
                
            up_dct[iid][new_id] = 1
            
            if deg == 1 and num_ancs == 2:
                part_id = new_id - 1
                up_dct[iid][part_id] = 1
                
            prev_id = iid
            if deg > 1:
                iid = new_id
                new_id -= 1
                
            deg -= 1
            
        return up_dct, prev_id, new_id, part_id
    
    def connect_pedigrees_through_points(id1, id2, pid1, pid2, up_dct1, up_dct2, deg1, deg2, num_ancs, simple=True):
        """Connect up_dct1 to up_dct2 through points.
        
        This is a simplified implementation for JupyterLite compatibility.
        """
        import copy
        
        # Can't connect "on" genotyped nodes
        if deg1 == deg2 == 0 and (id1 > 0 and id2 > 0) and id1 != id2:
            return []
        
        # Can't connect "on" genotyped or non-existent partner nodes
        if deg1 == deg2 == 0 and (pid1 != pid2):
            if pid1 is None or pid2 is None:
                return []
            elif pid1 > 0 and pid2 > 0:
                return []
                
        up_dct1 = copy.deepcopy(up_dct1)
        up_dct2 = copy.deepcopy(up_dct2)
        
        if deg1 > 0:
            up_dct1, _, id1, pid1 = extend_up(id1, deg1, num_ancs, up_dct1)
            
        if deg2 > 0:
            up_dct2, _, id2, pid2 = extend_up(id2, deg2, num_ancs, up_dct2)
            
        # Shift IDs in up_dct2 to avoid conflicts
        min_id = get_min_id(up_dct1) - 1
        
        # Simple ID mapping implementation
        old_to_new = {}
        for node in list(up_dct2.keys()):
            if node < 0:  # Only remap ungenotyped nodes
                old_to_new[node] = min_id
                min_id -= 1
                
        # Apply mapping to up_dct2
        if old_to_new:
            new_up_dct2 = {}
            for node, parents in up_dct2.items():
                new_node = old_to_new.get(node, node)
                new_parents = {}
                for p, d in parents.items():
                    new_p = old_to_new.get(p, p)
                    new_parents[new_p] = d
                new_up_dct2[new_node] = new_parents
            up_dct2 = new_up_dct2
            
            # Update id2 and pid2 if they were remapped
            id2 = old_to_new.get(id2, id2)
            if pid2 is not None:
                pid2 = old_to_new.get(pid2, pid2)
        
        # Connect the pedigrees
        if simple:
            # Create simple ID mapping
            id_map = {id1: id2}
            if pid1 is not None and pid2 is not None:
                id_map[pid1] = pid2
                
            # Connect on these points
            combined = copy.deepcopy(up_dct1)
            
            # Add all nodes from up_dct2 not in the mapping
            for node, parents in up_dct2.items():
                if node not in id_map.values():
                    if node not in combined:
                        combined[node] = {}
                    
                    # Add parents, mapping if necessary
                    for p, d in parents.items():
                        # If parent is mapped, use the mapping
                        mapped_p = next((v for k, v in id_map.items() if v == p), p)
                        combined[node][mapped_p] = d
            
            # Add connections for mapped nodes
            for node1, node2 in id_map.items():
                # Transfer parents from node2 to node1
                for p, d in up_dct2.get(node2, {}).items():
                    mapped_p = next((v for k, v in id_map.items() if v == p), p)
                    combined[node1][mapped_p] = d
                    
                # For all children of node2, make them children of node1
                for child, parents in up_dct2.items():
                    if node2 in parents:
                        degree = parents[node2]
                        if child not in id_map.values():  # Only for non-mapped children
                            if child not in combined:
                                combined[child] = {}
                            combined[child][node1] = degree
            
            return [combined]
        
        # For non-simple connections, return just the simple one for this implementation
        return [combined]

### 1.1 Building Basic Family Units

Let's examine how Bonsai v3 builds small family units like parent-child relationships and sibling sets. We'll start by exploring the core functions that handle these operations.

In [ ]:
# Let's look at the source code for add_parent if not in JupyterLite
if not is_jupyterlite():
    print("Source code for add_parent:")
    view_source(add_parent)
else:
    print("Using simplified add_parent implementation for JupyterLite.")

In [ ]:
# Let's demonstrate building a small family unit
def create_parent_child_unit(child_id, parent1_id=None, parent2_id=None):
    """Create a simple parent-child unit.
    
    Args:
        child_id: ID of the child
        parent1_id: ID of first parent (optional, will create if None)
        parent2_id: ID of second parent (optional, will create if None)
        
    Returns:
        pedigree: Up-node dictionary representing the pedigree
    """
    pedigree = {child_id: {}}
    
    # Create first parent if not provided
    if parent1_id is None:
        min_id = get_min_id(pedigree)
        parent1_id = min_id - 1
    
    # Add first parent
    pedigree[child_id][parent1_id] = 1
    if parent1_id not in pedigree:
        pedigree[parent1_id] = {}
    
    # Add second parent if needed
    if parent2_id is not None:
        pedigree[child_id][parent2_id] = 1
        if parent2_id not in pedigree:
            pedigree[parent2_id] = {}
    elif parent2_id is None and parent1_id is not None:
        # Add an ungenotyped second parent
        pedigree, new_parent_id = add_parent(child_id, pedigree)
        if new_parent_id not in pedigree:
            pedigree[new_parent_id] = {}
    
    return pedigree

# Create a simple parent-child unit with a child (ID 1) and two parents (IDs 2 and 3)
parent_child_unit = create_parent_child_unit(child_id=1, parent1_id=2, parent2_id=3)

# Create metadata for visualization
metadata = {
    1: {"sex": "M", "age": 15},  # Male child, 15 years old
    2: {"sex": "M", "age": 45},  # Male parent, 45 years old
    3: {"sex": "F", "age": 42}   # Female parent, 42 years old
}

# Visualize the parent-child unit
visualize_pedigree(parent_child_unit, "Simple Parent-Child Unit", individual_metadata=metadata)

### 1.2 Building Sibling Groups

Now let's create a function to build a sibling group - multiple children who share the same parents:

In [ ]:
def create_sibling_group(child_ids, parent1_id=None, parent2_id=None):
    """Create a sibling group with the specified children and parents.
    
    Args:
        child_ids: List of IDs for the children
        parent1_id: ID of first parent (optional, will create if None)
        parent2_id: ID of second parent (optional, will create if None)
        
    Returns:
        pedigree: Up-node dictionary representing the pedigree
    """
    # Initialize an empty pedigree
    pedigree = {}
    
    # Create entries for each child
    for child_id in child_ids:
        pedigree[child_id] = {}
    
    # Create parents if not provided
    if parent1_id is None:
        min_id = get_min_id(pedigree)
        parent1_id = min_id - 1
    
    if parent2_id is None:
        min_id = get_min_id(pedigree)
        parent2_id = min_id - 1
    
    # Add parents to each child
    for child_id in child_ids:
        pedigree[child_id][parent1_id] = 1
        pedigree[child_id][parent2_id] = 1
    
    # Add parent entries to the pedigree
    pedigree[parent1_id] = {}
    pedigree[parent2_id] = {}
    
    return pedigree

# Create a sibling group with three children (IDs 1, 2, 3) and two parents (IDs 4 and 5)
sibling_group = create_sibling_group(child_ids=[1, 2, 3], parent1_id=4, parent2_id=5)

# Create metadata for visualization
metadata = {
    1: {"sex": "M", "age": 17},  # Male sibling, 17 years old
    2: {"sex": "F", "age": 15},  # Female sibling, 15 years old
    3: {"sex": "M", "age": 12},  # Male sibling, 12 years old
    4: {"sex": "M", "age": 45},  # Male parent, 45 years old
    5: {"sex": "F", "age": 42}   # Female parent, 42 years old
}

# Visualize the sibling group
visualize_pedigree(sibling_group, "Sibling Group with Three Children", individual_metadata=metadata)

### 1.3 Building More Complex Family Structures

Now let's create a more complex family structure with half-siblings, which share one parent but have different second parents:

In [ ]:
def create_half_sibling_structure(child1_id, child2_id, common_parent_id=None, parent1_id=None, parent2_id=None):
    """Create a half-sibling structure with two children sharing one parent.
    
    Args:
        child1_id: ID of the first child
        child2_id: ID of the second child
        common_parent_id: ID of the parent shared by both children (optional)
        parent1_id: ID of the second parent of the first child (optional)
        parent2_id: ID of the second parent of the second child (optional)
        
    Returns:
        pedigree: Up-node dictionary representing the pedigree
    """
    # Initialize an empty pedigree
    pedigree = {child1_id: {}, child2_id: {}}
    
    # Create common parent if not provided
    if common_parent_id is None:
        min_id = get_min_id(pedigree)
        common_parent_id = min_id - 1
    
    # Add common parent to both children
    pedigree[child1_id][common_parent_id] = 1
    pedigree[child2_id][common_parent_id] = 1
    pedigree[common_parent_id] = {}
    
    # Create and add second parent for first child if needed
    if parent1_id is None:
        min_id = get_min_id(pedigree)
        parent1_id = min_id - 1
    pedigree[child1_id][parent1_id] = 1
    pedigree[parent1_id] = {}
    
    # Create and add second parent for second child if needed
    if parent2_id is None:
        min_id = get_min_id(pedigree)
        parent2_id = min_id - 1
    pedigree[child2_id][parent2_id] = 1
    pedigree[parent2_id] = {}
    
    return pedigree

# Create a half-sibling structure with two children (IDs 1 and 2)
# They share parent ID 5, but have different second parents (IDs 3 and 4)
half_sibs = create_half_sibling_structure(
    child1_id=1, 
    child2_id=2, 
    common_parent_id=5, 
    parent1_id=3, 
    parent2_id=4
)

# Create metadata for visualization
metadata = {
    1: {"sex": "M", "age": 17},  # Male half-sibling, 17 years old
    2: {"sex": "F", "age": 15},  # Female half-sibling, 15 years old
    3: {"sex": "M", "age": 45},  # First child's second parent, 45 years old
    4: {"sex": "M", "age": 40},  # Second child's second parent, 40 years old
    5: {"sex": "F", "age": 42}   # Common parent (mother), 42 years old
}

# Visualize the half-sibling structure
visualize_pedigree(half_sibs, "Half-Sibling Structure", individual_metadata=metadata)

## Part 2: Connecting Small Structures

Now that we've learned how to build basic pedigree structures, let's explore how Bonsai connects these small structures to form more complex pedigrees.

### 2.1 Finding Connection Points

In the previous lab (Lab 11), we explored how Bonsai identifies connection points within pedigrees. These connection points represent positions where different pedigrees can be connected. Let's apply this to our small pedigree structures:

In [ ]:
# Find connection points in our sibling structure
sibling_connection_points = get_possible_connection_point_set(sibling_group)

# Convert to a more readable format
sibling_cp_df = pd.DataFrame([
    {
        "Primary ID": cp[0],
        "Secondary ID": cp[1] if cp[1] is not None else "None",
        "Direction": {
            0: "Down (add child)",
            1: "Up (add parent)",
            None: "Replace/Lateral"
        }.get(cp[2], str(cp[2]))
    }
    for cp in sibling_connection_points
])

# Display connection points
print(f"Found {len(sibling_connection_points)} possible connection points in the sibling group:")
display(sibling_cp_df.sample(min(10, len(sibling_cp_df))))  # Display a sample of up to 10 connection points

# Highlight some potential connection points in the visualization
connection_points_to_highlight = {cp[0] for cp in list(sibling_connection_points)[:3]}
connection_points_to_highlight.update(cp[1] for cp in list(sibling_connection_points)[:3] if cp[1] is not None)

visualize_pedigree(
    sibling_group, 
    "Sibling Group with Potential Connection Points Highlighted",
    highlight_nodes=connection_points_to_highlight,
    individual_metadata=metadata
)

### 2.2 Connecting Two Pedigrees

Now let's implement a function to connect two pedigrees based on a specified connection point:

In [ ]:
# First, let's examine how Bonsai v3 actually connects pedigrees
from utils.bonsaitree.bonsaitree.v3.connections import (
    connect_pedigrees as actual_connect_pedigrees,
    connect_pedigrees_through_points
)

if not is_jupyterlite():
    print("Source code for connect_pedigrees in Bonsai v3:")
    view_source(actual_connect_pedigrees)
else:
    print("Cannot display actual Bonsai v3 functions in JupyterLite environment.")
    
# For our lab, we'll implement a simplified version of connect_pedigrees
def connect_pedigrees(pedigree1, pedigree2, connection_point):
    """Connect two pedigrees using the specified connection point.
    
    Args:
        pedigree1: First pedigree (up-node dictionary)
        pedigree2: Second pedigree (up-node dictionary)
        connection_point: Tuple (id1, id2, dir) specifying how to connect
            id1: ID in pedigree1
            id2: Optional partner ID in pedigree1 (can be None)
            dir: Direction of connection (0=down, 1=up, None=replace/lateral)
            
    Returns:
        combined_pedigree: The combined pedigree as an up-node dictionary
    """
    import copy
    combined_pedigree = copy.deepcopy(pedigree1)
    pedigree2_copy = copy.deepcopy(pedigree2)
    
    # Extract connection information
    id1, id2, direction = connection_point
    
    # Get lowest ID in both pedigrees to use for new ungenotyped individuals
    all_ids1 = set(pedigree1.keys()).union(*[set(parents.keys()) for parents in pedigree1.values()])
    all_ids2 = set(pedigree2.keys()).union(*[set(parents.keys()) for parents in pedigree2.values()])
    min_id = min(min(all_ids1), min(all_ids2)) - 1
    if min_id > 0:  # Ensure negative ID for ungenotyped individuals
        min_id = -1
    
    # Adjust IDs in pedigree2 to avoid conflicts
    id_map = {}
    for old_id in all_ids2:
        if old_id in all_ids1:  # If ID already exists in pedigree1
            if old_id > 0:  # Only remap genotyped IDs
                new_id = min(all_ids1) - 1  # Generate a new ID
                if new_id > 0:  # Ensure it's negative for ungenotyped
                    new_id = min_id
                    min_id -= 1
                id_map[old_id] = new_id
    
    # Apply the ID mapping to pedigree2
    if id_map:
        remapped_pedigree2 = {}
        for node, parents in pedigree2_copy.items():
            new_node = id_map.get(node, node)
            remapped_parents = {id_map.get(p, p): d for p, d in parents.items()}
            remapped_pedigree2[new_node] = remapped_parents
        pedigree2_copy = remapped_pedigree2
    
    # Connect based on direction
    if direction == 0:  # Connect downward (add as child)
        # Create a new individual as child of id1 (and id2 if provided)
        connector_id = min_id
        min_id -= 1
        
        # Add the connector as child of id1 (and id2 if provided)
        combined_pedigree[connector_id] = {id1: 1}
        if id2 is not None:
            combined_pedigree[connector_id][id2] = 1
        
        # Make the connector the parent of all founders in pedigree2
        # Get founders (nodes with no parents) in pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        for founder in founders2:
            pedigree2_copy[founder][connector_id] = 1
    
    elif direction == 1:  # Connect upward (add as parent)
        # Get founders in pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        
        if len(founders2) == 1:  # If pedigree2 has a single founder, connect directly
            combined_pedigree[id1][founders2[0]] = 1
        else:  # Otherwise, create a connector individual
            connector_id = min_id
            min_id -= 1
            
            # Add the connector as parent of id1
            combined_pedigree[id1][connector_id] = 1
            combined_pedigree[connector_id] = {}
            
            # Make the founders of pedigree2 parents of the connector
            for founder in founders2:
                combined_pedigree[connector_id][founder] = 1
    
    else:  # Replace/lateral connection
        # We'll implement this as replacing id1 with a founder from pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        if founders2:  # If there are founders in pedigree2
            replaced_founder = founders2[0]
            
            # Replace all occurrences of id1 with replaced_founder
            for node, parents in combined_pedigree.items():
                if id1 in parents:
                    degree = parents.pop(id1)
                    parents[replaced_founder] = degree
            
            # Handle the connections of id1
            if id1 in combined_pedigree:
                # Transfer the parents of id1 to replaced_founder
                if replaced_founder not in combined_pedigree:
                    combined_pedigree[replaced_founder] = {}
                combined_pedigree[replaced_founder].update(combined_pedigree[id1])
                del combined_pedigree[id1]  # Remove id1 from the pedigree
    
    # Merge the modified pedigree2 into the combined pedigree
    for node, parents in pedigree2_copy.items():
        if node not in combined_pedigree:
            combined_pedigree[node] = parents
        else:
            combined_pedigree[node].update(parents)
    
    return combined_pedigree

In [ ]:
# Let's connect our sibling group and half-sibling structure
# We'll connect them by making one of the parents in the sibling group (ID 4)
# a child of one of the parents in the half-sibling structure (ID 5)

# The connection point will be on the half-sibling structure
# We'll connect downward from ID 5 (the common parent in the half-siblings)
connection_point = (5, None, 0)  # Connect downward from ID 5

# Create metadata for the connected pedigree
combined_metadata = {
    # Sibling group (with adjusted IDs if needed)
    1: {"sex": "M", "age": 17},  # Male sibling, 17 years old
    2: {"sex": "F", "age": 15},  # Female sibling, 15 years old
    3: {"sex": "M", "age": 12},  # Male sibling, 12 years old
    4: {"sex": "M", "age": 45},  # Male parent, 45 years old
    5: {"sex": "F", "age": 42},  # Female parent, 42 years old
    
    # Half-sibling structure
    6: {"sex": "M", "age": 17},  # First half-sibling
    7: {"sex": "F", "age": 15},  # Second half-sibling
    8: {"sex": "M", "age": 45},  # First child's second parent
    9: {"sex": "M", "age": 40},  # Second child's second parent
    10: {"sex": "F", "age": 70}  # Common parent (mother)
}

# Rename the half-siblings to use IDs 6 and 7
half_sibs_renamed = {
    6: half_sibs[1],
    7: half_sibs[2],
    8: half_sibs[3],
    9: half_sibs[4],
    10: half_sibs[5]
}

# Connect the pedigrees
combined_pedigree = connect_pedigrees(half_sibs_renamed, sibling_group, connection_point)

# Visualize the combined pedigree
visualize_pedigree(combined_pedigree, "Combined Pedigree", individual_metadata=combined_metadata)

## Part 3: Evaluating Small Pedigree Structures

Once we've built and connected small pedigree structures, we need to evaluate them to determine how well they explain the observed genetic data. Let's explore how Bonsai evaluates these structures.

### 3.1 Simulating IBD Data for Evaluation

In a real-world scenario, Bonsai would evaluate pedigrees based on observed IBD (Identity by Descent) segments. Let's create a simple function to simulate IBD data for evaluation purposes:

In [ ]:
from utils.bonsaitree.bonsaitree.v3.pedigrees import get_simple_rel_tuple

# For JupyterLite compatibility
if is_jupyterlite():
    def get_simple_rel_tuple(up_node_dict, i, j):
        """Get relationship tuple (up, down, num_ancs) between individuals i and j."""
        if i == j:
            return (0, 0, 2)
        
        # Simple implementation for JupyterLite - this would be more complex in reality
        if j in up_node_dict.get(i, {}):
            return (1, 0, 1)  # i is child of j
        elif i in up_node_dict.get(j, {}):
            return (0, 1, 1)  # i is parent of j
        
        # Check for siblings/cousins (simplified)
        i_parents = set(up_node_dict.get(i, {}).keys())
        j_parents = set(up_node_dict.get(j, {}).keys())
        common_parents = i_parents.intersection(j_parents)
        
        if common_parents:
            if len(common_parents) == 2:
                return (1, 1, 2)  # Full siblings
            else:
                return (1, 1, 1)  # Half siblings
        
        # Default - no relationship found
        return None

def simulate_ibd_segments(rel_tuple, num_segments=10, noise_level=0.1):
    """Simulate IBD segments for a given relationship tuple.
    
    Args:
        rel_tuple: (up, down, num_ancs) tuple representing the relationship
        num_segments: Number of segments to simulate
        noise_level: Level of noise to add to segment lengths
        
    Returns:
        segments: List of simulated IBD segments
    """
    if rel_tuple is None:
        return []  # No relationship, no IBD
    
    up, down, num_ancs = rel_tuple
    degree = up + down
    
    # Different relationships have different expected amounts of IBD
    if degree == 0:  # Self
        expected_total_cm = 3400  # Entire genome
        avg_segment_cm = 340
    elif degree == 1:  # Parent-child
        expected_total_cm = 3400 / 2  # Half the genome
        avg_segment_cm = 170
    elif degree == 2 and num_ancs == 2:  # Full siblings
        expected_total_cm = 2550  # ~75% of the genome
        avg_segment_cm = 85
    elif degree == 2 and num_ancs == 1:  # Half siblings/grandparents
        expected_total_cm = 1700  # ~50% of the genome
        avg_segment_cm = 42.5
    elif degree == 3:  # First cousins once removed
        expected_total_cm = 850  # ~25% of the genome
        avg_segment_cm = 21.25
    elif degree == 4:  # Second cousins
        expected_total_cm = 425  # ~12.5% of the genome
        avg_segment_cm = 10.6
    elif degree == 5:  # Second cousins once removed
        expected_total_cm = 212.5  # ~6.25% of the genome
        avg_segment_cm = 5.3
    elif degree == 6:  # Third cousins
        expected_total_cm = 106.25  # ~3.125% of the genome
        avg_segment_cm = 5.3 / 2
    else:  # More distant
        expected_total_cm = 53.125  # ~1.5625% of the genome
        avg_segment_cm = 5.3 / 4
    
    # Generate simulated segments
    segments = []
    chromosomes = list(range(1, 23))  # Chromosomes 1-22
    
    for _ in range(num_segments):
        # Select a random chromosome
        chromosome = random.choice(chromosomes)
        
        # Generate a segment length with some noise
        segment_cm = avg_segment_cm * (1 + noise_level * (random.random() - 0.5))
        
        # Generate random start and end positions (in genetic distance)
        max_pos = 100 + 20 * chromosome  # Approximate chromosome length
        start_cm = random.uniform(0, max_pos - segment_cm)
        end_cm = start_cm + segment_cm
        
        segments.append({
            "chromosome": chromosome,
            "start_cm": start_cm,
            "end_cm": end_cm,
            "length_cm": segment_cm
        })
    
    return segments

# Create a function to simulate IBD data for a pedigree
def simulate_pedigree_ibd(pedigree, genotyped_ids=None):
    """Simulate IBD segments for all genotyped individuals in a pedigree.
    
    Args:
        pedigree: Up-node dictionary representing the pedigree
        genotyped_ids: List of IDs to simulate (defaults to all positive IDs)
        
    Returns:
        ibd_data: Dictionary mapping pairs of IDs to their simulated IBD segments
    """
    # If no genotyped IDs are provided, use all positive IDs
    if genotyped_ids is None:
        all_ids = set(pedigree.keys()).union(*[set(parents.keys()) for parents in pedigree.values()])
        genotyped_ids = [i for i in all_ids if i > 0]
    
    # Create a dictionary to store IBD segments for each pair
    ibd_data = {}
    
    # For each pair of genotyped individuals
    for i in range(len(genotyped_ids)):
        for j in range(i + 1, len(genotyped_ids)):
            id1 = genotyped_ids[i]
            id2 = genotyped_ids[j]
            
            # Get the relationship tuple
            rel_tuple = get_simple_rel_tuple(pedigree, id1, id2)
            
            # Simulate IBD segments based on the relationship
            segments = simulate_ibd_segments(rel_tuple)
            
            # Store the segments
            pair_key = (id1, id2)
            ibd_data[pair_key] = segments
    
    return ibd_data

# Simulate IBD data for our combined pedigree
# Use only genotyped IDs (positive numbers)
all_ids = set(combined_pedigree.keys()).union(*[set(parents.keys()) for parents in combined_pedigree.values()])
genotyped_ids = sorted([i for i in all_ids if i > 0])
print(f"Genotyped individuals in the combined pedigree: {genotyped_ids}")

# Simulate IBD data
ibd_data = simulate_pedigree_ibd(combined_pedigree, genotyped_ids)

# Display summary of IBD sharing
ibd_summary = []
for (id1, id2), segments in ibd_data.items():
    total_cm = sum(seg["length_cm"] for seg in segments)
    num_segments = len(segments)
    
    # Get relationship information
    rel_tuple = get_simple_rel_tuple(combined_pedigree, id1, id2)
    relationship = "Unknown"
    if rel_tuple:
        up, down, num_ancs = rel_tuple
        degree = up + down
        if degree == 0:
            relationship = "Self"
        elif degree == 1:
            relationship = "Parent-Child"
        elif degree == 2 and num_ancs == 2:
            relationship = "Full Siblings"
        elif degree == 2 and num_ancs == 1:
            relationship = "Half Siblings/Grandparent"
        elif degree == 3:
            relationship = "First Cousins Once Removed/Great-Grandparent"
        elif degree == 4:
            relationship = "Second Cousins"
        else:
            relationship = f"Degree {degree} (Distant Relative)"
    
    ibd_summary.append({
        "Individual 1": id1,
        "Individual 2": id2,
        "Relationship": relationship,
        "Total cM": total_cm,
        "Num Segments": num_segments
    })

# Display the IBD summary
ibd_df = pd.DataFrame(ibd_summary).sort_values(by="Total cM", ascending=False)
display(ibd_df)

### 3.2 Evaluating Alternative Pedigree Structures

A key challenge in pedigree reconstruction is choosing between alternative structures that could explain the observed IBD data. Let's create a function to evaluate how well a pedigree explains a set of IBD data:

In [ ]:
# First, let's examine how Bonsai v3 evaluates pedigrees
from utils.bonsaitree.bonsaitree.v3.likelihoods import (
    get_ped_like as actual_get_ped_like
)

if not is_jupyterlite():
    print("Source code for get_ped_like in Bonsai v3:")
    view_source(actual_get_ped_like)
else:
    print("Cannot display actual Bonsai v3 functions in JupyterLite environment.")

# For our lab, we'll implement a simplified pedigree evaluation function
def evaluate_pedigree(pedigree, ibd_data):
    """Evaluate how well a pedigree explains a set of IBD data.
    
    Args:
        pedigree: Up-node dictionary representing the pedigree
        ibd_data: Dictionary mapping pairs of IDs to their IBD segments
        
    Returns:
        log_likelihood: Log-likelihood score (higher is better)
        relationship_consistency: Percentage of relationships consistent with IBD
    """
    import math
    
    log_likelihood = 0.0
    consistent_relationships = 0
    total_relationships = 0
    
    # For each pair with IBD data
    for (id1, id2), segments in ibd_data.items():
        total_relationships += 1
        
        # Get the expected relationship from the pedigree
        expected_rel = get_simple_rel_tuple(pedigree, id1, id2)
        if expected_rel is None:
            # If no relationship is expected but IBD is observed, penalize
            if segments:
                log_likelihood -= 10.0  # Arbitrary penalty
            continue
            
        expected_up, expected_down, expected_num_ancs = expected_rel
        expected_degree = expected_up + expected_down
        
        # Calculate expected IBD sharing based on the relationship
        if expected_degree == 0:  # Self
            expected_total_cm = 3400
        elif expected_degree == 1:  # Parent-child
            expected_total_cm = 3400 / 2
        elif expected_degree == 2 and expected_num_ancs == 2:  # Full siblings
            expected_total_cm = 2550
        elif expected_degree == 2 and expected_num_ancs == 1:  # Half siblings/grandparents
            expected_total_cm = 1700
        elif expected_degree == 3:  # First cousins once removed
            expected_total_cm = 850
        elif expected_degree == 4:  # Second cousins
            expected_total_cm = 425
        elif expected_degree == 5:  # Second cousins once removed
            expected_total_cm = 212.5
        elif expected_degree == 6:  # Third cousins
            expected_total_cm = 106.25
        else:  # More distant
            expected_total_cm = 53.125
        
        # Calculate observed IBD sharing
        observed_total_cm = sum(seg["length_cm"] for seg in segments)
        
        # Allow for some variation in IBD sharing (standard deviation proportional to expected)
        std_dev = expected_total_cm * 0.2  # 20% variation
        
        # Calculate log-likelihood using a normal distribution around the expected value
        if std_dev > 0:
            log_likelihood += -0.5 * ((observed_total_cm - expected_total_cm) / std_dev) ** 2 - math.log(std_dev)
        
        # Check if the relationship is consistent with the IBD
        # A simple check: is the observed IBD within 50% of the expected?
        if 0.5 * expected_total_cm <= observed_total_cm <= 1.5 * expected_total_cm:
            consistent_relationships += 1
    
    # Calculate relationship consistency as a percentage
    relationship_consistency = 100 * consistent_relationships / total_relationships if total_relationships > 0 else 0
    
    return log_likelihood, relationship_consistency

In [ ]:
# Let's create a few alternative pedigree structures and evaluate them
# 1. The original combined pedigree (our baseline)
# 2. A pedigree where we mistakenly connect with a different direction
# 3. A pedigree where we connect completely different individuals

# Evaluate the original combined pedigree
original_ll, original_consistency = evaluate_pedigree(combined_pedigree, ibd_data)
print(f"Original pedigree: Log-likelihood = {original_ll:.2f}, Consistency = {original_consistency:.1f}%")

# Alternative 1: Connect with upward direction instead of downward
alt_connection_point = (5, None, 1)  # Connect upward from ID 5
alt_pedigree1 = connect_pedigrees(half_sibs_renamed, sibling_group, alt_connection_point)
alt1_ll, alt1_consistency = evaluate_pedigree(alt_pedigree1, ibd_data)
print(f"Alternative 1 (different direction): Log-likelihood = {alt1_ll:.2f}, Consistency = {alt1_consistency:.1f}%")

# Alternative 2: Connect different individuals
alt_connection_point2 = (6, None, 0)  # Connect downward from ID 6 (one of the half-siblings)
alt_pedigree2 = connect_pedigrees(half_sibs_renamed, sibling_group, alt_connection_point2)
alt2_ll, alt2_consistency = evaluate_pedigree(alt_pedigree2, ibd_data)
print(f"Alternative 2 (different individuals): Log-likelihood = {alt2_ll:.2f}, Consistency = {alt2_consistency:.1f}%")

# Create a summary of the evaluations
evaluation_summary = pd.DataFrame([
    {"Pedigree": "Original", "Log-Likelihood": original_ll, "Consistency %": original_consistency},
    {"Pedigree": "Alternative 1 (Different Direction)", "Log-Likelihood": alt1_ll, "Consistency %": alt1_consistency},
    {"Pedigree": "Alternative 2 (Different Individuals)", "Log-Likelihood": alt2_ll, "Consistency %": alt2_consistency}
]).sort_values(by="Log-Likelihood", ascending=False)

# Display the evaluation summary
display(evaluation_summary)

# Visualize the best pedigree
best_pedigree_idx = evaluation_summary.index[0]
best_pedigree_name = evaluation_summary.loc[best_pedigree_idx, "Pedigree"]
best_pedigree = {
    "Original": combined_pedigree,
    "Alternative 1 (Different Direction)": alt_pedigree1,
    "Alternative 2 (Different Individuals)": alt_pedigree2
}[best_pedigree_name]

print(f"The best pedigree is: {best_pedigree_name}")
visualize_pedigree(best_pedigree, f"Best Pedigree: {best_pedigree_name}", individual_metadata=combined_metadata)

## Summary

In this lab, we explored how Bonsai v3 connects individuals into small pedigree structures. We examined the actual Bonsai v3 implementations of key functions before providing simplified versions for educational purposes. Key takeaways include:

1. **Understanding Bonsai's Core Functions**: We examined the source code of critical Bonsai v3 functions like `get_possible_connection_point_set`, `add_parent`, `connect_pedigrees_through_points`, and `get_ped_like` to understand how Bonsai builds and evaluates pedigrees.

2. **Building Blocks**: We learned how to create basic pedigree structures like parent-child units, sibling groups, and half-sibling structures, which form the building blocks of more complex pedigrees.

3. **Connection Points**: We identified potential connection points within pedigrees using Bonsai's `get_possible_connection_point_set` function, which represents positions where different pedigrees can be joined together.

4. **Connecting Pedigrees**: We implemented simplified versions of Bonsai's pedigree connection functions to connect two pedigrees based on specified connection points, considering different types of connections (upward, downward, or lateral).

5. **Evaluating Structures**: We created simplified methods based on Bonsai's likelihood evaluation system to evaluate how well different pedigree structures explain observed IBD data, allowing us to select the most likely structure.

6. **Alternative Hypotheses**: We compared different pedigree hypotheses by evaluating their log-likelihood scores and consistency with the observed genetic data, similar to how Bonsai compares alternative pedigree configurations.

These techniques form the foundation of Bonsai's approach to pedigree reconstruction. By connecting small structures based on genetic evidence and evaluating alternative hypotheses, Bonsai builds cohesive pedigrees that best explain the observed patterns of DNA sharing. The simplified implementations we've created in this lab help illustrate the core principles while being compatible with JupyterLite, though they lack some of the sophisticated optimizations and statistical models found in the full Bonsai v3 codebase.

In [ ]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab13_Small_Pedigree_Structures.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive