# Lab 14: Data Structures and Algorithmic Design in Bonsai

Building upon our exploration of the mathematical foundations of Bonsai in Lab 13, we now delve into the data structures and algorithmic design that power Bonsai's pedigree reconstruction capabilities. This lab focuses on how Bonsai organizes and processes genetic data for efficient relationship inference and pedigree construction.

> **Why This Matters:** The efficacy of pedigree reconstruction algorithms depends heavily on the design of their underlying data structures. Understanding how Bonsai organizes and processes genetic data allows researchers to optimize algorithm performance, implement custom extensions, and interpret complex outputs accurately.

**Learning Objectives**:
- Analyze the core data structures that power the Bonsai algorithm
- Understand how IBD segment data is preprocessed and organized for efficient access
- Explore the up-node dictionary structure and its role in representing pedigree relationships
- Examine the graph-theoretical foundations of pedigree representation
- Learn how Bonsai's data structures enable efficient search and optimization
- Implement key data structures and operations used in pedigree reconstruction

## Environment Setup

In [None]:
import os
import sys
import math
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from scipy.stats import poisson, expon, norm, multivariate_normal
from collections import defaultdict, deque
from pathlib import Path
from IPython.display import display, HTML
from dotenv import load_dotenv
import heapq

## 1. Core Data Structures in Bonsai

Bonsai's effectiveness stems from its carefully designed data structures that balance computational efficiency with biological accuracy. Let's examine the key structures that power the algorithm.

### IBD Segment Representation

The fundamental input to Bonsai is Identity-By-Descent (IBD) segment data, which is structured to facilitate rapid relationship inference.

In [None]:
class IBDSegment:
    def __init__(self, ind1, ind2, chrom, start_pos, end_pos, is_ibd2, length_cm):
        self.ind1 = ind1          # First individual ID
        self.ind2 = ind2          # Second individual ID
        self.chrom = chrom        # Chromosome number
        self.start_pos = start_pos  # Start position (base pairs)
        self.end_pos = end_pos    # End position (base pairs)
        self.is_ibd2 = is_ibd2    # Whether this is an IBD2 segment
        self.length_cm = length_cm  # Genetic length in centiMorgans
    
    def __repr__(self):
        return f"IBDSegment({self.ind1}, {self.ind2}, chr{self.chrom}, {self.length_cm:.2f}cM, {'IBD2' if self.is_ibd2 else 'IBD1'})"

# Example of how segments are organized in Bonsai
def organize_segments_by_pair(ibd_segments):
    segments_by_pair = {}
    for seg in ibd_segments:
        pair = tuple(sorted([seg.ind1, seg.ind2]))
        if pair not in segments_by_pair:
            segments_by_pair[pair] = []
        segments_by_pair[pair].append(seg)
    return segments_by_pair

Let's create some example IBD segments to work with throughout this lab:

In [None]:
# Create example IBD segments
example_segments = [
    # Parent-child segments (extensive IBD1 sharing)
    IBDSegment(1000, 1001, 1, 10000000, 50000000, False, 45.5),  # IBD1 segment on chr1
    IBDSegment(1000, 1001, 1, 60000000, 125000000, False, 65.2),  # IBD1 segment on chr1
    IBDSegment(1000, 1001, 2, 5000000, 80000000, False, 75.0),   # IBD1 segment on chr2
    
    # Sibling segments (mix of IBD1 and IBD2)
    IBDSegment(1001, 1002, 1, 15000000, 45000000, True, 30.1),   # IBD2 segment on chr1
    IBDSegment(1001, 1002, 1, 50000000, 90000000, False, 40.3),  # IBD1 segment on chr1
    IBDSegment(1001, 1002, 2, 10000000, 45000000, False, 35.5),  # IBD1 segment on chr2
    IBDSegment(1001, 1002, 2, 50000000, 70000000, True, 20.8),   # IBD2 segment on chr2
    
    # First cousin segments (modest IBD1 sharing)
    IBDSegment(1000, 1003, 1, 25000000, 50000000, False, 25.0),  # IBD1 segment on chr1
    IBDSegment(1000, 1003, 3, 15000000, 35000000, False, 20.5),  # IBD1 segment on chr3
    
    # Distant relative (single small segment)
    IBDSegment(1002, 1004, 5, 30000000, 45000000, False, 15.2),  # IBD1 segment on chr5
]

# Organize segments by pair
segments_by_pair = organize_segments_by_pair(example_segments)

# Display the organized segments
for pair, segs in segments_by_pair.items():
    print(f"Pair {pair}:")
    for seg in segs:
        print(f"  {seg}")
    print()

This organization allows Bonsai to efficiently access all IBD segments between a specific pair of individuals, which is crucial for calculating relationship likelihoods.

### The Up-Node Dictionary

The core data structure for representing pedigrees in Bonsai is the "up-node dictionary," which encodes parent-child relationships in a compact format:

In [None]:
# Example up-node dictionary structure
up_node_dict = {
    1000: {1001: 1, 1002: 1},  # Individual 1000 has parents 1001 and 1002
    1003: {1001: 1, 1002: 1},  # Individual 1003 has the same parents
    1004: {-1: 1, -2: 1},      # Individual 1004 has inferred parents -1 and -2
    -1: {1005: 1, 1006: 1},    # Inferred individual -1 has parents 1005 and 1006
    1005: {},                  # Individual 1005 has no recorded parents
    1006: {},                  # Individual 1006 has no recorded parents
    1001: {},                  # Individual 1001 has no recorded parents
    1002: {}                   # Individual 1002 has no recorded parents
}

# Function to visualize pedigree from up-node dictionary
def visualize_pedigree(up_node_dict):
    """Create a visualization of the pedigree from an up-node dictionary."""
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add nodes and edges
    for child, parents in up_node_dict.items():
        G.add_node(child)
        for parent in parents:
            G.add_node(parent)
            G.add_edge(parent, child)  # Direction from parent to child
    
    # Set node colors: green for real individuals (positive IDs), gray for latent (negative IDs)
    node_colors = ['lightgreen' if isinstance(node, int) and node > 0 else 'lightgray' 
                   for node in G.nodes()]
    
    # Calculate layout
    try:
        pos = nx.nx_agraph.graphviz_layout(G, prog='dot')
    except:
        pos = nx.spring_layout(G)  # Fallback if graphviz is not available
    
    # Draw the graph
    plt.figure(figsize=(12, 8))
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            node_size=700, font_size=10, arrows=True)
    plt.title('Pedigree Structure')
    plt.show()

# Visualize the example pedigree
visualize_pedigree(up_node_dict)

Key features of the up-node dictionary:
- Each key represents an individual ID (positive for observed individuals, negative for inferred ancestors)
- Each value is a dictionary mapping parent IDs to 1 (the value 1 is a placeholder; the structure could be extended to include additional information)
- An empty dictionary indicates an individual with no recorded parents (either a founder or an individual with unknown parentage)
- The structure supports efficient traversal of ancestors and descendants
- It can represent complex multi-generational pedigrees with inferred latent ancestors

### BioInfo Structure

Bonsai incorporates biological metadata about individuals through the BioInfo structure:

In [None]:
# Example BioInfo structure
bio_info = [
    {'genotype_id': 1000, 'age': 75, 'sex': 'F'},
    {'genotype_id': 1001, 'age': 80, 'sex': 'M'},
    {'genotype_id': 1002, 'age': 78, 'sex': 'F'},
    {'genotype_id': 1003, 'age': 45, 'sex': 'M'},
    {'genotype_id': 1004, 'age': 43, 'sex': 'F'},
    {'genotype_id': 1005, 'age': 20, 'sex': 'M'}
]

# Display the information
bio_df = pd.DataFrame(bio_info)
bio_df

This structure is used to enforce biological constraints (e.g., age-appropriate relationships, sex-specific reproduction) and to enhance the accuracy of pedigree reconstruction.

## 2. Preprocessing IBD Data for Efficient Access

Before pedigree reconstruction begins, Bonsai preprocesses the IBD data to optimize subsequent operations.

### Indexing and Filtering

The preprocessing pipeline includes several key steps:

In [None]:
def preprocess_ibd_segments(segments, min_cm=7):
    """Preprocess IBD segments for efficient access."""
    # 1. Filter by minimum length
    filtered_segments = [seg for seg in segments if seg.length_cm >= min_cm]
    
    # 2. Create pair-based index
    pair_index = {}
    for seg in filtered_segments:
        pair = tuple(sorted([seg.ind1, seg.ind2]))
        if pair not in pair_index:
            pair_index[pair] = []
        pair_index[pair].append(seg)
    
    # 3. Create individual-based index
    ind_index = {}
    for seg in filtered_segments:
        for ind in [seg.ind1, seg.ind2]:
            if ind not in ind_index:
                ind_index[ind] = []
            ind_index[ind].append(seg)
    
    # 4. Calculate summary statistics
    pair_stats = {}
    for pair, segs in pair_index.items():
        ind1, ind2 = pair
        total_ibd1_cm = sum(seg.length_cm for seg in segs if not seg.is_ibd2)
        total_ibd2_cm = sum(seg.length_cm for seg in segs if seg.is_ibd2)
        ibd1_count = sum(1 for seg in segs if not seg.is_ibd2)
        ibd2_count = sum(1 for seg in segs if seg.is_ibd2)
        
        pair_stats[pair] = {
            'total_ibd1_cm': total_ibd1_cm,
            'total_ibd2_cm': total_ibd2_cm,
            'ibd1_count': ibd1_count,
            'ibd2_count': ibd2_count,
            'total_cm': total_ibd1_cm + total_ibd2_cm,
            'segment_count': len(segs)
        }
    
    return {
        'filtered_segments': filtered_segments,
        'pair_index': pair_index,
        'ind_index': ind_index,
        'pair_stats': pair_stats
    }

# Preprocess our example segments
preprocessed_data = preprocess_ibd_segments(example_segments)

# Display the summary statistics
print("Pair Statistics:")
for pair, stats in preprocessed_data['pair_stats'].items():
    print(f"\nPair {pair}:")
    for stat, value in stats.items():
        print(f"  {stat}: {value:.2f}" if isinstance(value, float) else f"  {stat}: {value}")

This preprocessing creates several indexes that enable efficient data access during pedigree reconstruction:
- The pair index allows quick lookup of all segments between a specific pair of individuals
- The individual index facilitates finding all segments involving a particular individual
- The pair statistics cache frequently used summary metrics to avoid redundant calculations

### Moment-Based Summary Statistics

Bonsai relies heavily on "IBD moments" to summarize the IBD sharing between individuals:

In [None]:
def calculate_ibd_moments(pair, pair_index, min_cm=7):
    """Calculate IBD moments for a pair of individuals."""
    if pair not in pair_index:
        return {'m1': 0, 'm2': 0, 'm3': 0}  # No segments
    
    segments = pair_index[pair]
    filtered_segments = [seg for seg in segments if seg.length_cm >= min_cm]
    
    # First moment: number of segments
    m1 = len(filtered_segments)
    
    # Second moment: total length
    m2 = sum(seg.length_cm for seg in filtered_segments)
    
    # Third moment: sum of squares (optional)
    m3 = sum(seg.length_cm ** 2 for seg in filtered_segments)
    
    return {'m1': m1, 'm2': m2, 'm3': m3}

# Calculate moments for each pair in our example data
pair_moments = {}
for pair in preprocessed_data['pair_index']:
    pair_moments[pair] = calculate_ibd_moments(pair, preprocessed_data['pair_index'])

# Display the moments
print("IBD Moments by Pair:")
for pair, moments in pair_moments.items():
    print(f"\nPair {pair}:")
    print(f"  First moment (count): {moments['m1']}")
    print(f"  Second moment (total length): {moments['m2']:.2f} cM")
    print(f"  Third moment (sum of squares): {moments['m3']:.2f} cM²")

These moments are used in the likelihood calculations that drive pedigree reconstruction. By caching them, Bonsai avoids recalculating these statistics repeatedly during optimization.

## 3. Graph-Theoretical Representation of Pedigrees

Pedigrees are fundamentally graph structures, with individuals as nodes and parent-child relationships as edges. Bonsai leverages graph theory to represent and manipulate pedigrees efficiently.

### Directed Acyclic Graphs (DAGs)

A valid pedigree forms a directed acyclic graph (DAG), where:
- Nodes represent individuals
- Directed edges represent parent-child relationships (pointing from parent to child)
- The graph is acyclic (no cycles allowed, as individuals cannot be their own ancestors)

Bonsai enforces these properties through constraints in its optimization algorithm:

In [None]:
def would_create_cycle(up_node_dict, child_id, proposed_parent_id):
    """Check if adding a parent-child relationship would create a cycle."""
    # If the proposed parent is already a descendant of the child,
    # adding this relationship would create a cycle
    
    # Start with the proposed parent
    current_ids = [proposed_parent_id]
    visited = set(current_ids)
    
    # Traverse up the pedigree
    while current_ids:
        next_ids = []
        for current_id in current_ids:
            # If we've reached the child, a cycle would be created
            if current_id == child_id:
                return True
                
            # Add this individual's parents to the search
            if current_id in up_node_dict:
                parents = up_node_dict[current_id].keys()
                for parent_id in parents:
                    if parent_id not in visited:
                        next_ids.append(parent_id)
                        visited.add(parent_id)
        
        current_ids = next_ids
    
    # No cycle found
    return False

# Test the cycle detection function
test_pedigree = {
    1000: {1001: 1, 1002: 1},  # 1000 has parents 1001 and 1002
    1001: {1003: 1, 1004: 1},  # 1001 has parents 1003 and 1004
    1002: {},
    1003: {},
    1004: {}
}

# Would adding 1000 as a parent of 1003 create a cycle?
# (1000 -> 1001 -> 1003 -> 1000 would form a cycle)
cycle_detected = would_create_cycle(test_pedigree, 1003, 1000)
print(f"Would adding 1000 as a parent of 1003 create a cycle? {cycle_detected}")

# Would adding 1002 as a parent of 1003 create a cycle?
cycle_detected = would_create_cycle(test_pedigree, 1003, 1002)
print(f"Would adding 1002 as a parent of 1003 create a cycle? {cycle_detected}")

### Community Detection and Graph Partitioning

For large datasets, Bonsai uses graph-based community detection algorithms (e.g., Louvain method) to partition the data into more manageable subproblems:

In [None]:
def partition_with_community_detection(segments):
    """Partition individuals into communities based on IBD sharing."""
    # Create a graph where nodes are individuals and edges represent IBD sharing
    G = nx.Graph()
    
    # Add edges with weights based on IBD sharing
    for seg in segments:
        ind1, ind2 = seg.ind1, seg.ind2
        weight = seg.length_cm
        
        if G.has_edge(ind1, ind2):
            G[ind1][ind2]['weight'] += weight
        else:
            G.add_edge(ind1, ind2, weight=weight)
    
    # Apply Louvain community detection if available
    try:
        communities = nx.community.louvain_communities(G, weight='weight')
    except AttributeError:
        # Fallback to connected components if Louvain isn't available
        communities = [c for c in nx.connected_components(G)]
    
    # Return communities as lists of individual IDs
    return [list(community) for community in communities]

# Apply community detection to our example segments
communities = partition_with_community_detection(example_segments)

# Display the communities
print(f"Detected {len(communities)} communities:")
for i, community in enumerate(communities):
    print(f"Community {i+1}: {community}")

This partitioning strategy offers several advantages:
- Reduces computational complexity by breaking a large problem into smaller subproblems
- Focuses reconstruction on groups of individuals that are likely related
- Enables parallel processing of different communities
- Improves scalability to handle large datasets with thousands of individuals

Let's visualize the communities based on IBD sharing:

In [None]:
def visualize_ibd_communities(segments, communities):
    """Visualize communities based on IBD sharing."""
    # Create a graph where nodes are individuals and edges represent IBD sharing
    G = nx.Graph()
    
    # Add edges with weights based on IBD sharing
    for seg in segments:
        ind1, ind2 = seg.ind1, seg.ind2
        weight = seg.length_cm
        
        if G.has_edge(ind1, ind2):
            G[ind1][ind2]['weight'] += weight
        else:
            G.add_edge(ind1, ind2, weight=weight)
    
    # Create a mapping from individual to community
    community_map = {}
    for i, community in enumerate(communities):
        for ind in community:
            community_map[ind] = i
    
    # Create node colors based on community
    cmap = plt.cm.get_cmap('tab10', len(communities))
    node_colors = [cmap(community_map[node]) for node in G.nodes()]
    
    # Create edge weights based on IBD sharing
    edge_weights = [G[u][v]['weight'] / 10 for u, v in G.edges()]
    
    # Set up the plot
    plt.figure(figsize=(10, 8))
    pos = nx.spring_layout(G, seed=42)  # Consistent layout
    
    # Draw the graph
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            edge_color='gray', width=edge_weights, alpha=0.7,
            node_size=500, font_size=10)
    
    # Add edge labels (IBD sharing in cM)
    edge_labels = {(u, v): f"{d['weight']:.1f} cM" for u, v, d in G.edges(data=True)}
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=8)
    
    plt.title('IBD Sharing Communities')
    plt.tight_layout()
    plt.show()

# Visualize the communities
visualize_ibd_communities(example_segments, communities)

## 4. Building and Manipulating Pedigrees

Bonsai includes operations for building and modifying pedigree structures during the reconstruction process.

### Pedigree Modification Operations

These operations form the basis of Bonsai's optimization algorithm:

In [None]:
def add_parent(up_node_dict, child_id, parent_id):
    """Add a parent-child relationship to the pedigree."""
    # Ensure child exists in dictionary
    if child_id not in up_node_dict:
        up_node_dict[child_id] = {}
    
    # Check if adding this parent would create a cycle
    if would_create_cycle(up_node_dict, child_id, parent_id):
        return False  # Cannot add this relationship
    
    # Check if child already has two parents
    if len(up_node_dict[child_id]) >= 2:
        return False  # Child already has maximum number of parents
    
    # Add the parent
    up_node_dict[child_id][parent_id] = 1
    
    # Ensure parent exists in dictionary
    if parent_id not in up_node_dict:
        up_node_dict[parent_id] = {}
    
    return True

def remove_parent(up_node_dict, child_id, parent_id):
    """Remove a parent-child relationship from the pedigree."""
    if child_id not in up_node_dict or parent_id not in up_node_dict[child_id]:
        return False  # Relationship doesn't exist
    
    # Remove the relationship
    del up_node_dict[child_id][parent_id]
    return True

def swap_parent(up_node_dict, child_id, old_parent_id, new_parent_id):
    """Replace one parent with another."""
    # Remove old parent
    if not remove_parent(up_node_dict, child_id, old_parent_id):
        return False
    
    # Add new parent
    if not add_parent(up_node_dict, child_id, new_parent_id):
        # If adding new parent fails, restore old parent
        add_parent(up_node_dict, child_id, old_parent_id)
        return False
    
    return True

# Create a simple pedigree
test_pedigree = {
    1000: {},
    1001: {},
    1002: {}
}

# Add a parent-child relationship
add_parent(test_pedigree, 1000, 1001)
print("After adding 1001 as parent of 1000:")
print(test_pedigree)

# Add another parent
add_parent(test_pedigree, 1000, 1002)
print("\nAfter adding 1002 as parent of 1000:")
print(test_pedigree)

# Visualize the pedigree
visualize_pedigree(test_pedigree)

### Adding Inferred Ancestors

A key feature of Bonsai is its ability to infer missing ancestors. This is implemented by creating "latent nodes" with negative IDs:

In [None]:
def create_latent_ancestor(up_node_dict, next_latent_id=-1):
    """Create a new latent ancestor node in the pedigree."""
    # Find an unused negative ID
    while next_latent_id in up_node_dict:
        next_latent_id -= 1
    
    # Create the new latent node with no parents
    up_node_dict[next_latent_id] = {}
    
    return next_latent_id

def add_latent_parent_pair(up_node_dict, child_id):
    """Add a pair of latent parents to a child."""
    # Create two latent parents
    latent_parent1 = create_latent_ancestor(up_node_dict)
    latent_parent2 = create_latent_ancestor(up_node_dict, latent_parent1 - 1)
    
    # Add them as parents of the child
    add_parent(up_node_dict, child_id, latent_parent1)
    add_parent(up_node_dict, child_id, latent_parent2)
    
    return latent_parent1, latent_parent2

# Extend our test pedigree with latent ancestors
test_pedigree = {
    1000: {},
    1001: {},
    1002: {},
    1003: {}
}

# Add observed relationships
add_parent(test_pedigree, 1000, 1001)
add_parent(test_pedigree, 1000, 1002)

# Add latent parents for 1003
latent_parent1, latent_parent2 = add_latent_parent_pair(test_pedigree, 1003)
print(f"Added latent parents {latent_parent1} and {latent_parent2} to individual 1003")

# Connect latent parents to observed individuals to create a more complex pedigree
add_parent(test_pedigree, latent_parent1, 1001)
print("\nFinal pedigree with latent ancestors:")
print(test_pedigree)

# Visualize the extended pedigree
visualize_pedigree(test_pedigree)

This capability allows Bonsai to reconstruct more complete pedigrees even when data for some ancestors is unavailable.

## 5. Efficient Search and Optimization

Bonsai's data structures are designed to support efficient search and optimization of pedigree structures.

### Caching and Memoization

To avoid redundant calculations, Bonsai extensively uses caching and memoization:

In [None]:
class RelationshipCalculator:
    def __init__(self, up_node_dict):
        self.up_node_dict = up_node_dict
        self.coefficient_cache = {}  # Cache for relationship coefficients
    
    def get_coefficient(self, id1, id2):
        """Get the relationship coefficient between two individuals."""
        # Check cache first
        pair = tuple(sorted([id1, id2]))
        if pair in self.coefficient_cache:
            return self.coefficient_cache[pair]
        
        # Calculate coefficient
        coefficient = self._calculate_coefficient(id1, id2)
        
        # Cache the result
        self.coefficient_cache[pair] = coefficient
        return coefficient
    
    def _calculate_coefficient(self, id1, id2):
        """Calculate the relationship coefficient (actual implementation)."""
        # Self-relationship
        if id1 == id2:
            return 1.0
        
        # Direct parent-child relationship
        if id1 in self.up_node_dict.get(id2, {}) or id2 in self.up_node_dict.get(id1, {}):
            return 0.5
        
        # Calculate common ancestor contributions
        paths1 = self._get_paths_to_ancestors(id1)
        paths2 = self._get_paths_to_ancestors(id2)
        
        # Find common ancestors
        common_ancestors = set(paths1.keys()).intersection(set(paths2.keys()))
        
        # Sum contributions from each common ancestor
        coefficient = 0.0
        for ancestor in common_ancestors:
            for path1 in paths1[ancestor]:
                for path2 in paths2[ancestor]:
                    # Contribution is 0.5^(length of paths)
                    contribution = (0.5 ** len(path1)) * (0.5 ** len(path2))
                    coefficient += contribution
        
        return coefficient
    
    def _get_paths_to_ancestors(self, individual_id):
        """Find all paths from an individual to ancestors."""
        paths = {individual_id: [[]]}
        
        # Queue of (individual, current_path) tuples to process
        queue = [(individual_id, [])]
        
        while queue:
            current_id, current_path = queue.pop(0)
            
            # Get parents of current individual
            parents = self.up_node_dict.get(current_id, {})
            
            for parent_id in parents:
                # Create a new path that includes this parent
                new_path = current_path + [parent_id]
                
                # Add path to parent's paths
                if parent_id not in paths:
                    paths[parent_id] = []
                paths[parent_id].append(new_path)
                
                # Add parent to queue for further processing
                queue.append((parent_id, new_path))
        
        return paths
    
    def invalidate_cache_for_individual(self, individual_id):
        """Invalidate cache entries involving a specific individual."""
        # When the pedigree changes, we need to invalidate affected cache entries
        keys_to_remove = []
        for key in self.coefficient_cache:
            if individual_id in key:
                keys_to_remove.append(key)
        
        for key in keys_to_remove:
            del self.coefficient_cache[key]

# Test the relationship calculator with our example pedigree
calculator = RelationshipCalculator(test_pedigree)

# Calculate some relationships
relationships = [
    (1000, 1001),  # Parent-child
    (1000, 1002),  # Parent-child
    (1003, 1001),  # Grandparent-grandchild (through latent parent)
    (1000, 1003)   # Half-siblings or cousins (depending on exact structure)
]

for id1, id2 in relationships:
    coef = calculator.get_coefficient(id1, id2)
    print(f"Relationship coefficient between {id1} and {id2}: {coef:.4f}")

This caching strategy dramatically improves performance, especially during optimization when many similar pedigree configurations are evaluated.

### Priority-Based Processing

Bonsai processes relationships in order of confidence, focusing computational resources on the most reliable inferences:

In [None]:
def process_relationships_by_priority(pair_stats):
    """Process relationships in order of priority (higher IBD sharing first)."""
    # Create priority queue
    priority_queue = []
    
    # Add all pairs to the queue with priority based on IBD sharing
    for pair, stats in pair_stats.items():
        # Priority is negative of total IBD sharing (so higher sharing = higher priority)
        priority = -stats['total_cm']
        heapq.heappush(priority_queue, (priority, pair))
    
    # Process queue
    processed_pairs = []
    while priority_queue:
        _, pair = heapq.heappop(priority_queue)
        processed_pairs.append(pair)
    
    return processed_pairs

# Process our preprocessed data by priority
prioritized_pairs = process_relationships_by_priority(preprocessed_data['pair_stats'])

# Display the pairs in priority order
print("Pairs in order of processing priority:")
for i, pair in enumerate(prioritized_pairs):
    stats = preprocessed_data['pair_stats'][pair]
    print(f"{i+1}. Pair {pair}: {stats['total_cm']:.2f} cM total sharing")

This approach helps Bonsai establish the most obvious relationships first, which constrains the search space for more ambiguous relationships.

## 6. Data Structure Implementation Challenges

Implementing Bonsai's data structures involves several challenges that must be addressed for optimal performance.

### Memory Efficiency

For large datasets with thousands of individuals and millions of IBD segments, memory usage becomes a critical concern. Bonsai employs several strategies to minimize memory footprint:

* **Sparse representation:** The up-node dictionary only stores relationships that exist, making it memory-efficient for sparse pedigrees
* **Data filtering:** IBD segments below a minimum threshold are filtered out early in the preprocessing pipeline
* **Community partitioning:** By breaking the problem into communities, each subproblem can be processed with a smaller memory footprint
* **Selective caching:** Caching strategies that prioritize frequently used data while allowing less frequent data to be recalculated

### Time Complexity Considerations

The time complexity of key operations in Bonsai:

| Operation | Time Complexity | Notes |
|-----------|-----------------|-------|
| Accessing IBD segments for a pair | O(1) | Using pair-based index |
| Finding all pairs involving an individual | O(d) where d is degree | Using individual-based index |
| Calculating relationship coefficient | O(a) where a is number of ancestors | With caching, subsequent lookups are O(1) |
| Checking for cycles | O(n) where n is individuals in pedigree | Worst case, but typically much faster |
| Community detection | O(m log n) where m is number of IBD segments | Using optimized Louvain algorithm |
| Overall reconstruction | O(i * p^2) where i is iterations, p is pairs | With optimizations, scales to thousands of individuals |

## 7. Custom Extensions and Adaptations

The modular design of Bonsai's data structures facilitates custom extensions for specialized applications.

### Incorporating Additional Metadata

In [None]:
# Extended BioInfo structure with additional metadata
extended_bio_info = [
    {
        'genotype_id': 1000,
        'age': 75,
        'sex': 'F',
        'population': 'EUR',
        'birth_year': 1947,
        'is_genotyped': True,
        'phenotypes': {'height': 165, 'weight': 68},
        'haplogroups': {'mt': 'H1', 'y': None}
    },
    # ... additional individuals ...
]

# Display the extended metadata
extended_bio_df = pd.DataFrame(extended_bio_info)
extended_bio_df

This extended metadata can be used to enhance pedigree reconstruction accuracy or to analyze the resulting pedigrees in more detail.

### Custom Relationship Types

In [None]:
# Extended up-node dictionary with relationship types
extended_up_node_dict = {
    1000: {
        1001: {'type': 'biological', 'confidence': 0.98},
        1002: {'type': 'biological', 'confidence': 0.97}
    },
    1003: {
        1001: {'type': 'adoptive', 'confidence': 0.99},
        1004: {'type': 'biological', 'confidence': 0.95}
    },
    # ... additional relationships ...
}

# Function to visualize extended pedigree
def visualize_extended_pedigree(extended_dict):
    """Visualize pedigree with extended relationship information."""
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add nodes
    for indiv_id in extended_dict.keys():
        G.add_node(indiv_id)
    
    # Add edges with attributes
    for child, parents in extended_dict.items():
        for parent, attrs in parents.items():
            G.add_edge(parent, child, **attrs)
    
    # Set up the visualization
    pos = nx.spring_layout(G, seed=42)
    plt.figure(figsize=(10, 8))
    
    # Draw nodes
    nx.draw_networkx_nodes(G, pos, node_color='lightgreen', node_size=500)
    nx.draw_networkx_labels(G, pos)
    
    # Draw edges with different styles based on relationship type
    biological_edges = [(p, c) for c, parents in extended_dict.items() 
                        for p, attrs in parents.items() if attrs.get('type') == 'biological']
    adoptive_edges = [(p, c) for c, parents in extended_dict.items() 
                      for p, attrs in parents.items() if attrs.get('type') == 'adoptive']
    
    nx.draw_networkx_edges(G, pos, edgelist=biological_edges, edge_color='blue')
    nx.draw_networkx_edges(G, pos, edgelist=adoptive_edges, edge_color='red', style='dashed')
    
    # Add edge labels with confidence scores
    edge_labels = {(p, c): f"{attrs.get('confidence', 1.0):.2f}" 
                  for c, parents in extended_dict.items() 
                  for p, attrs in parents.items()}
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=8)
    
    plt.title('Extended Pedigree with Relationship Types')
    plt.axis('off')
    plt.tight_layout()
    plt.show()

# Visualize the extended pedigree
visualize_extended_pedigree(extended_up_node_dict)

This extension allows the representation of more complex family structures, such as adoptive relationships, step-relationships, or relationships with uncertain confidence.

## 8. Exercises

Complete the following exercises to deepen your understanding of the data structures and algorithmic design in Bonsai.

### Exercise 1: Implement an efficient IBD segment indexing structure and benchmark its performance on a large dataset.

In [None]:
# Your code for Exercise 1


### Exercise 2: Extend the up-node dictionary implementation to handle additional relationship metadata, such as relationship confidence or type.

In [None]:
# Your code for Exercise 2


### Exercise 3: Implement and test the cycle detection algorithm for pedigree validation.

In [None]:
# Your code for Exercise 3


### Exercise 4: Create a visualization function that renders a pedigree from an up-node dictionary, highlighting different generations and relationship types.

In [None]:
# Your code for Exercise 4


### Exercise 5: Implement a memory-efficient version of the relationship coefficient calculator using sparse matrix representations.

In [None]:
# Your code for Exercise 5


> **Tip:** When designing data structures for pedigree reconstruction, always consider the trade-offs between memory usage, computational complexity, and biological accuracy. For large datasets, efficient data structures can make the difference between a reconstruction that completes in minutes versus one that takes days.

## Conclusion

In this lab, we explored the data structures and algorithmic design that power the Bonsai pedigree reconstruction algorithm. We implemented key components such as the IBD segment representation, up-node dictionary, preprocessing pipelines, and graph-theoretical analyses. We also examined how these structures enable efficient pedigree manipulation, search, and optimization.

Key takeaways:
- Bonsai's effectiveness stems from carefully designed data structures that balance computational efficiency with biological accuracy
- The up-node dictionary provides a compact and efficient representation of pedigree structures
- Preprocessing and indexing IBD segment data is crucial for efficient access during pedigree reconstruction
- Graph theory provides valuable tools for representing and analyzing pedigree structures
- Caching and prioritization strategies significantly improve computational performance
- The modular design facilitates custom extensions for specialized applications

In the next lab, we will explore model calibration techniques to ensure that Bonsai's likelihood calculations accurately reflect real-world genetic inheritance patterns.