# Lab 26: Performance Tuning for Large-Scale Applications

## Overview

This notebook explores advanced performance tuning techniques for deploying Bonsai v3 in large-scale applications. We'll examine the computational challenges of processing extensive datasets and complex pedigrees, and explore strategies to optimize performance while maintaining accuracy.

**Learning Objectives:**
- Understand the performance scaling challenges in genetic genealogy computation
- Learn systematic profiling and benchmarking techniques for Python code
- Implement algorithmic optimizations for core Bonsai functions
- Explore memory optimization techniques for large datasets
- Apply intelligent precision-performance tradeoffs in relationship inference

**Prerequisites:**
- Completion of Lab 12: Relationship Assessment
- Completion of Lab 14: Optimizing Pedigrees
- Familiarity with Python performance concepts

**Estimated completion time:** 60-90 minutes

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
import time
import memory_profiler
import cProfile
import pstats
from io import StringIO

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")
sns.set_palette("colorblind")  # Improve accessibility with colorblind-friendly palette

# Configure plot defaults for better readability
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 12,
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10
})

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("\u26a0\ufe0f Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        if not classes:
            print(f"No classes found in module {module_name}")
            return
            
        # Print info for each class
        for name, cls in classes:
            display(Markdown(f"### Class: {name}"))
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                display(Markdown(f"**Documentation:**\
{doc}"))
            else:
                display(Markdown("*No documentation available*"))
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            public_methods = [(method_name, method) for method_name, method in methods 
                             if not method_name.startswith('_')]
            
            if public_methods:
                display(Markdown("**Public Methods:**"))
                for method_name, method in public_methods:
                    sig = inspect.signature(method)
                    display(Markdown(f"- `{method_name}{sig}`"))
            else:
                display(Markdown("*No public methods*"))
            
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        if not functions:
            print(f"No functions found in module {module_name}")
            return
            
        # Filter public functions
        public_functions = [(name, func) for name, func in functions if not name.startswith('_')]
        
        if not public_functions:
            print(f"No public functions found in module {module_name}")
            return
            
        # Print info for each function
        for name, func in public_functions:                
            display(Markdown(f"### Function: {name}"))
            
            # Get signature
            sig = inspect.signature(func)
            display(Markdown(f"**Signature:** `{name}{sig}`"))
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                display(Markdown(f"**Documentation:**\
{doc}"))
            else:
                display(Markdown("*No documentation available*"))
                
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for `{function_name}`\
```python\
{source}\
```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

def view_class_source(module_name, class_name):
    """Display the source code of a class"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the class
        cls = getattr(module, class_name)
        
        # Get the source code
        source = inspect.getsource(cls)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for class `{class_name}`\
```python\
{source}\
```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Class {class_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing class {class_name}: {e}")

def explore_module(module_name):
    """Display a comprehensive overview of a module with classes and functions"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Module docstring
        doc = inspect.getdoc(module)
        display(Markdown(f"# Module: {module_name}"))
        
        if doc:
            display(Markdown(f"**Module Documentation:**\
{doc}"))
        else:
            display(Markdown("*No module documentation available*"))
            
        display(Markdown("---"))
        
        # Display classes
        display(Markdown("## Classes"))
        display_module_classes(module_name)
        
        # Display functions
        display(Markdown("## Functions"))
        display_module_functions(module_name)
        
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error exploring module {module_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from bonsaitree import v3
    print("\u2705 Successfully imported Bonsai v3 module")
    
    # Print Bonsai version information if available
    if hasattr(v3, "__version__"):
        print(f"Bonsai v3 version: {v3.__version__}")
    
    # List key submodules
    print("\
Available Bonsai submodules:")
    for module_name in dir(v3):
        if not module_name.startswith("_") and not module_name.startswith("__"):
            print(f"- {module_name}")
except ImportError as e:
    print(f"\u274c Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Introduction

As genetic genealogy datasets grow in size and complexity, computational performance becomes a critical factor in the practical application of tools like Bonsai v3. Large datasets with thousands of individuals and millions of IBD segments can challenge even the most efficiently designed algorithms, requiring careful optimization and performance tuning.

In this lab, we'll explore systematic approaches to identifying and addressing performance bottlenecks in Bonsai v3. We'll examine how to profile code to locate inefficiencies, implement algorithmic optimizations to improve computational efficiency, and apply memory optimization techniques to handle large-scale datasets.

**Key concepts we'll cover:**
- Understanding the computational complexity challenges in genetic genealogy
- Applying profiling tools to identify performance bottlenecks
- Implementing algorithmic optimizations for core Bonsai functions
- Optimizing memory usage for large datasets
- Making intelligent precision-performance tradeoffs

## Part 1: Performance Scaling Challenges

### Theory and Background

Genetic genealogy applications face several distinct performance scaling challenges:

1. **Quadratic Growth of Pairwise Comparisons**: The number of potential relationships grows quadratically with the number of individuals. With n individuals, there are n(n-1)/2 possible pairs to analyze. For example:
   - 100 individuals: 4,950 pairs
   - 1,000 individuals: 499,500 pairs
   - 10,000 individuals: 49,995,000 pairs

2. **IBD Segment Analysis Complexity**: Each pair of individuals may share multiple IBD segments, and analyzing these segments requires comparing genetic data across multiple positions.

3. **Pedigree Structure Optimization**: Finding the optimal pedigree structure that explains observed genetic relationships is a combinatorial problem with factorial growth.

4. **Memory Requirements**: Storing genetic data, IBD segments, and relationship information for large datasets can quickly exceed available memory.

The computational complexity of key operations in Bonsai v3 can be summarized as follows:

| Operation | Time Complexity | Space Complexity | Scaling Factor |
|-----------|-----------------|------------------|----------------|
| IBD Detection | O(n\u00b2) | O(n\u00b2) | Number of individuals (n) |
| Relationship Inference | O(n\u00b2) | O(n\u00b2) | Number of individuals (n) |
| Pedigree Construction | O(n\u00b3) | O(n\u00b2) | Number of individuals (n) |
| Pedigree Optimization | O(n\u00b3) | O(n\u00b2) | Number of individuals (n) |

### Implementation in Bonsai v3

Let's examine how these performance challenges manifest in the Bonsai v3 codebase. We'll focus on understanding the computational bottlenecks in key modules and functions.

In [None]:
# Load the Bonsai modules we'll be examining
try:
    from bonsaitree.v3 import pwlogl
    from bonsaitree.v3 import pedigree
    from bonsaitree.v3 import relationships
    
    print("\u2705 Successfully imported Bonsai modules for performance analysis")
except ImportError as e:
    print(f"\u274c Failed to import Bonsai modules: {e}")
    print("Will proceed with theoretical discussion.")

### Simulating Computation Scaling with Dataset Size

Let's create a simple simulation to illustrate how computation time grows with dataset size:

In [None]:
# Simulate the computational complexity of different operations
def simulate_operation_scaling():
    """Simulate how different operations scale with dataset size"""
    # Dataset sizes to test
    dataset_sizes = [10, 50, 100, 200, 500, 1000]
    
    # Time units (arbitrary)
    linear_times = []
    quadratic_times = []
    cubic_times = []
    
    # Base unit of computation (arbitrary constant)
    base_unit = 0.001
    
    # Calculate time for each operation and dataset size
    for n in dataset_sizes:
        # Linear time complexity: O(n)
        linear_times.append(base_unit * n)
        
        # Quadratic time complexity: O(n\u00b2)
        quadratic_times.append(base_unit * n * n)
        
        # Cubic time complexity: O(n\u00b3)
        cubic_times.append(base_unit * n * n * n)
    
    # Create a DataFrame for easy display
    scaling_df = pd.DataFrame({
        'Dataset Size': dataset_sizes,
        'Linear (O(n))': linear_times,
        'Quadratic (O(n\u00b2))': quadratic_times,
        'Cubic (O(n\u00b3))': cubic_times
    })
    
    # Display the data
    display(scaling_df)
    
    # Visualize the scaling
    plt.figure(figsize=(12, 6))
    
    # Plot each complexity on the same graph
    plt.plot(dataset_sizes, linear_times, 'o-', label='Linear (O(n))')
    plt.plot(dataset_sizes, quadratic_times, 's-', label='Quadratic (O(n\u00b2))')
    plt.plot(dataset_sizes, cubic_times, '^-', label='Cubic (O(n\u00b3))')
    
    plt.xlabel('Dataset Size (Number of Individuals)')
    plt.ylabel('Computation Time (arbitrary units)')
    plt.title('Scaling of Computation Time with Dataset Size')
    plt.legend()
    plt.grid(True)
    
    # Use log scale for better visualization
    plt.yscale('log')
    
    plt.tight_layout()
    plt.show()
    
    # Create a second plot focusing on the smaller dataset sizes
    plt.figure(figsize=(12, 6))
    
    # Use only the first few dataset sizes for better visibility
    small_sizes = dataset_sizes[:4]  # Up to 200 individuals
    
    plt.plot(small_sizes, linear_times[:4], 'o-', label='Linear (O(n))')
    plt.plot(small_sizes, quadratic_times[:4], 's-', label='Quadratic (O(n\u00b2))')
    plt.plot(small_sizes, cubic_times[:4], '^-', label='Cubic (O(n\u00b3))')
    
    plt.xlabel('Dataset Size (Number of Individuals)')
    plt.ylabel('Computation Time (arbitrary units)')
    plt.title('Scaling of Computation Time (Small Dataset Sizes)')
    plt.legend()
    plt.grid(True)
    
    plt.tight_layout()
    plt.show()

# Run the simulation
simulate_operation_scaling()

### Exercise 1: Identifying Performance Bottlenecks

Let's analyze a typical Bonsai v3 workflow to identify potential performance bottlenecks based on computational complexity.

**Task:** Examine the code snippets below and identify the potential performance bottlenecks, explaining why they would be problematic for large datasets.

**Hint:** Look for nested loops, large data structures, and operations that would scale poorly with dataset size.

In [None]:
# Example 1: Pairwise relationship likelihood calculation
def pairwise_relationship_likelihood_bottleneck(all_individuals, segments_dict):
    """Example of a pairwise calculation that would be a bottleneck"""
    # This is a simplified example for illustration purposes
    n_individuals = len(all_individuals)
    relationship_matrix = np.zeros((n_individuals, n_individuals))
    
    for i in range(n_individuals):
        for j in range(i+1, n_individuals):
            id1 = all_individuals[i]
            id2 = all_individuals[j]
            
            # Get shared segments
            pair_key = frozenset([id1, id2])
            if pair_key not in segments_dict:
                continue
                
            segments = segments_dict[pair_key]
            
            # Calculate likelihoods for various relationships
            # This would involve multiple calculations per segment
            relationship_matrix[i, j] = calculate_max_likelihood(segments)
            relationship_matrix[j, i] = relationship_matrix[i, j]  # Symmetric
    
    return relationship_matrix

def calculate_max_likelihood(segments):
    """Placeholder for likelihood calculation"""
    # In a real implementation, this would calculate likelihoods
    # for multiple relationship types
    return len(segments) * 0.01

# Example 2: Optimizing pedigree structure
def optimize_pedigree_bottleneck(pedigree, all_individuals, relationship_scores):
    """Example of a pedigree optimization that would be a bottleneck"""
    # Simplified example for illustration
    best_score = compute_pedigree_score(pedigree, relationship_scores)
    
    # Try all possible individual swaps to improve the pedigree
    improved = True
    while improved:
        improved = False
        
        for i in range(len(all_individuals)):
            for j in range(i+1, len(all_individuals)):
                # Try swapping positions of individuals i and j
                new_pedigree = swap_individuals(pedigree, i, j)
                new_score = compute_pedigree_score(new_pedigree, relationship_scores)
                
                if new_score > best_score:
                    pedigree = new_pedigree
                    best_score = new_score
                    improved = True
    
    return pedigree

def compute_pedigree_score(pedigree, relationship_scores):
    """Placeholder for pedigree scoring"""
    # In a real implementation, this would evaluate the pedigree quality
    return sum(relationship_scores.values())

def swap_individuals(pedigree, i, j):
    """Placeholder for individual swapping"""
    # In a real implementation, this would create a new pedigree
    # with individuals i and j swapped
    return pedigree.copy()

# Example 3: Memory-intensive segment handling
def process_all_segments_bottleneck(segment_data, genetic_map):
    """Example of memory-intensive segment processing"""
    # Simplified example for illustration
    all_segments = []
    
    # Load all segment data into memory
    for segment in segment_data:
        # Enhance segment with additional data
        segment['genetic_distance'] = lookup_genetic_distance(segment, genetic_map)
        segment['shared_snps'] = calculate_shared_snps(segment)
        
        all_segments.append(segment)
    
    # Process all segments
    results = []
    for segment in all_segments:
        # Multiple calculations per segment
        results.append(analyze_segment(segment))
    
    return results

def lookup_genetic_distance(segment, genetic_map):
    """Placeholder for genetic distance lookup"""
    return segment['end_pos'] - segment['start_pos'] / 1_000_000

def calculate_shared_snps(segment):
    """Placeholder for SNP counting"""
    return (segment['end_pos'] - segment['start_pos']) // 1000

def analyze_segment(segment):
    """Placeholder for segment analysis"""
    return {'length': segment['genetic_distance'], 'quality': segment['shared_snps'] / 100}

### Bottleneck Analysis

For each example, identify the performance bottlenecks and explain why they would be problematic for large datasets. Consider both time and memory complexity.

#### Your Analysis:

**Example 1: Pairwise relationship likelihood calculation**
- Bottleneck 1:


**Example 2: Optimizing pedigree structure**
- Bottleneck 1:


**Example 3: Memory-intensive segment handling**
- Bottleneck 1:

## Part 2: Profiling and Benchmarking Methodology

### Theory and Background

Profiling is the process of systematically measuring the performance characteristics of code to identify bottlenecks and inefficiencies. Effective profiling helps focus optimization efforts on the parts of the code that will yield the greatest improvements.

Key profiling metrics include:

1. **Time Profiling**:
   - Function call counts and time spent in each function
   - Line-by-line execution time 
   - Call graph analysis to understand the call stack

2. **Memory Profiling**:
   - Memory allocation patterns
   - Peak memory usage
   - Object lifetime and reference patterns

3. **I/O Profiling**:
   - Disk read/write operations
   - Network traffic
   - Database queries

Python provides several powerful profiling tools:

1. **cProfile**: A built-in deterministic profiler that measures function call times
2. **memory_profiler**: A package that measures line-by-line memory usage
3. **line_profiler**: A package that provides line-by-line time profiling
4. **Scalene**: A high-performance CPU and memory profiler
5. **PyInstrument**: A low-overhead call stack profiler

Let's explore how to use these tools to profile Bonsai v3 code.

### Implementation in Bonsai v3

We'll now look at how to apply profiling tools to understand the performance characteristics of Bonsai v3 functions. Let's start with a simple example using cProfile.

In [None]:
# Define a wrapper for cProfile to easily profile functions
def profile_function(func, *args, **kwargs):
    """Profile a function using cProfile and display sorted results.
    
    Args:
        func: The function to profile
        *args, **kwargs: Arguments to pass to the function
        
    Returns:
        The result of the function call
    """
    profiler = cProfile.Profile()
    profiler.enable()
    
    result = func(*args, **kwargs)
    
    profiler.disable()
    s = StringIO()
    ps = pstats.Stats(profiler, stream=s).sort_stats('cumtime')
    ps.print_stats(20)  # Print top 20 functions by cumulative time
    
    print(s.getvalue())
    return result

# Define a sample function to profile (simulating a Bonsai operation)
def sample_bonsai_operation(n_individuals=100, n_segments=500):
    """Simulate a complex Bonsai operation for profiling purposes.
    
    Args:
        n_individuals: Number of individuals to simulate
        n_segments: Number of segments per individual pair
        
    Returns:
        A dictionary of results
    """
    # Create synthetic data
    individuals = [f"ind_{i}" for i in range(n_individuals)]
    
    # Create segments (simplified)
    segments_dict = {}
    for i in range(n_individuals):
        for j in range(i+1, n_individuals):
            id1 = individuals[i]
            id2 = individuals[j]
            
            # Generate random segments for this pair
            pair_segments = []
            for _ in range(np.random.poisson(n_segments / 10)):
                chr_num = np.random.randint(1, 23)
                start = np.random.randint(1, 200_000_000)
                length = np.random.exponential(5_000_000)
                end = start + length
                
                segment = {
                    'chr': chr_num,
                    'start_pos': start,
                    'end_pos': end,
                    'cm_length': length / 1_000_000
                }
                pair_segments.append(segment)
            
            if pair_segments:
                segments_dict[frozenset([id1, id2])] = pair_segments
    
    # Calculate pairwise statistics (intentionally inefficient for demonstration)
    pair_stats = {}
    for pair, segments in segments_dict.items():
        # Calculate total segments and length
        total_length = sum(seg['cm_length'] for seg in segments)
        
        # Perform an expensive operation (simulating likelihood calculation)
        likelihoods = {}
        for rel_type in ['parent-child', 'full-sibling', 'half-sibling', 'first-cousin']:
            # Simulate relationship likelihood calculation
            likelihoods[rel_type] = expensive_calculation(segments, rel_type)
        
        pair_stats[pair] = {
            'total_segments': len(segments),
            'total_length': total_length,
            'likelihoods': likelihoods
        }
    
    return {
        'n_individuals': n_individuals,
        'n_pairs': len(segments_dict),
        'pair_stats': pair_stats
    }

def expensive_calculation(segments, rel_type):
    """Simulate an expensive calculation for profiling purposes.
    
    Args:
        segments: List of segment dictionaries
        rel_type: Relationship type
        
    Returns:
        A likelihood value
    """
    # Make this function artificially slow for demonstration
    result = 0
    for segment in segments:
        # Simulate complex calculation
        for _ in range(100):
            result += np.sin(segment['cm_length']) * np.cos(segment['start_pos'] / 1e6)
            if rel_type == 'parent-child':
                result *= 1.01
            elif rel_type == 'full-sibling':
                result *= 0.99
                
    return result

In [None]:
# Profile our sample operation with a small dataset
print("Profiling sample Bonsai operation with 20 individuals:")
result = profile_function(sample_bonsai_operation, n_individuals=20, n_segments=100)

print(f"\
Processed {result['n_individuals']} individuals and {result['n_pairs']} pairs.")

### Memory Profiling Example

Now let's demonstrate how to use memory profiling to track memory usage:

In [None]:
# Memory profiling example
@memory_profiler.profile
def memory_intensive_operation(n_size=1000):
    """A function that demonstrates memory usage patterns.
    
    Args:
        n_size: Size parameter controlling memory usage
        
    Returns:
        Sum of all values created
    """
    # Create a large list
    print("Creating large list...")
    large_list = [i * 2 for i in range(n_size * 1000)]
    
    # Create a large dictionary
    print("Creating large dictionary...")
    large_dict = {i: np.random.random(100) for i in range(n_size)}
    
    # Create a large numpy array
    print("Creating large numpy array...")
    large_array = np.random.random((n_size, n_size))
    
    # Calculate something using all the structures
    print("Performing calculations...")
    result = sum(large_list) + sum(sum(values) for values in large_dict.values()) + np.sum(large_array)
    
    # Clean up to reduce memory (demonstrate memory release)
    print("Cleaning up...")
    del large_list
    del large_dict
    del large_array
    
    return result

# Run the memory-intensive operation
print("Running memory-intensive operation...")
result = memory_intensive_operation(500)
print(f"Operation result: {result}")

### Exercise 2: Building a Benchmark Framework

Benchmarking is the process of measuring the performance of code under controlled conditions to establish baseline metrics and track improvements. Let's create a simple benchmarking framework for Bonsai operations.

**Task:** Complete the benchmark framework below to measure the performance of different operations across varying dataset sizes.

**Hint:** Focus on tracking both time and memory usage for each operation and dataset size.

In [None]:
# Exercise 2: Complete the benchmark framework
class BonsaiBenchmark:
    """Framework for benchmarking Bonsai operations."""
    
    def __init__(self):
        """Initialize the benchmark framework."""
        self.results = {}
    
    def benchmark_operation(self, operation_func, dataset_sizes, repeat=3, **kwargs):
        """Benchmark an operation across multiple dataset sizes.
        
        Args:
            operation_func: Function to benchmark
            dataset_sizes: List of dataset sizes to test
            repeat: Number of times to repeat each benchmark
            **kwargs: Additional arguments to pass to the operation function
            
        Returns:
            DataFrame with benchmark results
        """
        # TODO: Implement benchmarking logic
        # 1. For each dataset size, run the operation 'repeat' times
        # 2. Measure execution time for each run
        # 3. Track memory usage using memory_profiler (optional, can be challenging)
        # 4. Record and return results
        
        operation_name = operation_func.__name__
        results = []
        
        print(f"Benchmarking {operation_name}...")
        
        for size in dataset_sizes:
            print(f"  Dataset size: {size}")
            
            # Run multiple times for reliable timing
            run_times = []
            for i in range(repeat):
                # Measure execution time
                start_time = time.time()
                operation_func(n_individuals=size, **kwargs)
                end_time = time.time()
                
                execution_time = end_time - start_time
                run_times.append(execution_time)
                print(f"    Run {i+1}/{repeat}: {execution_time:.4f} seconds")
            
            # Calculate statistics
            avg_time = np.mean(run_times)
            min_time = np.min(run_times)
            max_time = np.max(run_times)
            std_time = np.std(run_times)
            
            # Store results
            results.append({
                'Operation': operation_name,
                'Dataset Size': size,
                'Average Time (s)': avg_time,
                'Min Time (s)': min_time,
                'Max Time (s)': max_time,
                'Std Dev (s)': std_time
            })
        
        # Create a DataFrame with results
        results_df = pd.DataFrame(results)
        self.results[operation_name] = results_df
        
        return results_df
    
    def plot_results(self, operation_name=None, log_scale=True):
        """Plot benchmark results.
        
        Args:
            operation_name: Name of operation to plot (if None, plot all)
            log_scale: Whether to use logarithmic scale for time axis
        """
        # TODO: Implement plotting logic
        # 1. Create a plot showing execution time vs dataset size
        # 2. Include error bars for timing variability
        # 3. If multiple operations are being compared, show them on the same plot
        
        plt.figure(figsize=(10, 6))
        
        if operation_name is not None and operation_name in self.results:
            # Plot specific operation
            df = self.results[operation_name]
            plt.errorbar(
                df['Dataset Size'], 
                df['Average Time (s)'], 
                yerr=df['Std Dev (s)'],
                marker='o',
                label=operation_name
            )
        else:
            # Plot all operations
            for op_name, df in self.results.items():
                plt.errorbar(
                    df['Dataset Size'], 
                    df['Average Time (s)'], 
                    yerr=df['Std Dev (s)'],
                    marker='o',
                    label=op_name
                )
        
        plt.xlabel('Dataset Size (Number of Individuals)')
        plt.ylabel('Execution Time (seconds)')
        plt.title('Benchmark Results: Execution Time vs Dataset Size')
        plt.grid(True, alpha=0.3)
        plt.legend()
        
        if log_scale:
            plt.yscale('log')
        
        plt.tight_layout()
        plt.show()
        
    def compare_operations(self, operations, dataset_size, repeat=3, **kwargs):
        """Compare multiple operations on the same dataset size.
        
        Args:
            operations: List of operation functions to compare
            dataset_size: Size of dataset to use
            repeat: Number of times to repeat each benchmark
            **kwargs: Additional arguments to pass to the operation functions
            
        Returns:
            DataFrame with comparison results
        """
        # TODO: Implement comparison logic
        # 1. Run each operation on the same dataset size
        # 2. Compare execution times
        # 3. Return and visualize results
        
        comparison_results = []
        
        print(f"Comparing operations on dataset size {dataset_size}...")
        
        for operation_func in operations:
            operation_name = operation_func.__name__
            print(f"  Running {operation_name}...")
            
            # Run multiple times for reliable timing
            run_times = []
            for i in range(repeat):
                # Measure execution time
                start_time = time.time()
                operation_func(n_individuals=dataset_size, **kwargs)
                end_time = time.time()
                
                execution_time = end_time - start_time
                run_times.append(execution_time)
                print(f"    Run {i+1}/{repeat}: {execution_time:.4f} seconds")
            
            # Calculate statistics
            avg_time = np.mean(run_times)
            min_time = np.min(run_times)
            max_time = np.max(run_times)
            std_time = np.std(run_times)
            
            # Store results
            comparison_results.append({
                'Operation': operation_name,
                'Dataset Size': dataset_size,
                'Average Time (s)': avg_time,
                'Min Time (s)': min_time,
                'Max Time (s)': max_time,
                'Std Dev (s)': std_time
            })
        
        # Create a DataFrame with results
        comparison_df = pd.DataFrame(comparison_results)
        
        # Visualize the comparison
        plt.figure(figsize=(10, 6))
        plt.bar(
            comparison_df['Operation'],
            comparison_df['Average Time (s)'],
            yerr=comparison_df['Std Dev (s)'],
            capsize=5
        )
        plt.xlabel('Operation')
        plt.ylabel('Execution Time (seconds)')
        plt.title(f'Operation Performance Comparison (Dataset Size: {dataset_size})')
        plt.xticks(rotation=45, ha='right')
        plt.grid(True, axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()
        
        return comparison_df

In [None]:
# Test the benchmark framework with our sample operation
benchmark = BonsaiBenchmark()

# Define a variant of our sample operation with different characteristics
def optimized_bonsai_operation(n_individuals=100, n_segments=500):
    """A more optimized version of the sample operation."""
    # Simplified implementation with better performance characteristics
    # (Still slow enough to demonstrate benchmarking)
    individuals = [f"ind_{i}" for i in range(n_individuals)]
    
    # Pre-compute some values to avoid redundant calculations
    rel_types = ['parent-child', 'full-sibling', 'half-sibling', 'first-cousin']
    rel_factors = {'parent-child': 1.01, 'full-sibling': 0.99, 'half-sibling': 1.0, 'first-cousin': 0.98}
    
    # Create segments more efficiently
    segments_dict = {}
    for i in range(n_individuals):
        for j in range(i+1, n_individuals):
            id1 = individuals[i]
            id2 = individuals[j]
            
            n_pair_segments = np.random.poisson(n_segments / 10)
            if n_pair_segments > 0:
                # Generate all segments at once instead of in a loop
                chr_nums = np.random.randint(1, 23, n_pair_segments)
                starts = np.random.randint(1, 200_000_000, n_pair_segments)
                lengths = np.random.exponential(5_000_000, n_pair_segments)
                ends = starts + lengths
                cm_lengths = lengths / 1_000_000
                
                pair_segments = [
                    {
                        'chr': chr_nums[k],
                        'start_pos': starts[k],
                        'end_pos': ends[k],
                        'cm_length': cm_lengths[k]
                    }
                    for k in range(n_pair_segments)
                ]
                
                segments_dict[frozenset([id1, id2])] = pair_segments
    
    # Calculate pairwise statistics more efficiently
    pair_stats = {}
    for pair, segments in segments_dict.items():
        # Calculate total length once
        total_length = sum(seg['cm_length'] for seg in segments)
        
        # Pre-compute values shared across relationship calculations
        segment_values = np.array([np.sin(seg['cm_length']) * np.cos(seg['start_pos'] / 1e6) for seg in segments])
        base_likelihood = np.sum(segment_values) * 100
        
        # Calculate likelihoods for all relationship types at once
        likelihoods = {rel_type: base_likelihood * rel_factors[rel_type] for rel_type in rel_types}
        
        pair_stats[pair] = {
            'total_segments': len(segments),
            'total_length': total_length,
            'likelihoods': likelihoods
        }
    
    return {
        'n_individuals': n_individuals,
        'n_pairs': len(segments_dict),
        'pair_stats': pair_stats
    }

# Run the benchmark
dataset_sizes = [10, 20, 50, 100]
results_original = benchmark.benchmark_operation(sample_bonsai_operation, dataset_sizes, repeat=2, n_segments=100)
results_optimized = benchmark.benchmark_operation(optimized_bonsai_operation, dataset_sizes, repeat=2, n_segments=100)

# Plot the results
benchmark.plot_results()

<cell_type>markdown</cell_type>## Part 3: Algorithmic Optimization Strategies

### Theory and Background

Algorithmic optimization is often the most effective approach for improving performance in computational genetics applications. By reducing the computational complexity of key operations, we can achieve significant speedups that scale well with dataset size.

Key algorithmic optimization strategies include:

1. **Early Termination**: Stopping computations as soon as a conclusive result is reached
2. **Pruning**: Eliminating branches of computation that cannot lead to optimal solutions
3. **Memoization**: Caching results of expensive function calls
4. **Precomputation**: Calculating and storing values that will be needed multiple times
5. **Approximation**: Using faster, approximate methods when exact solutions are not required
6. **Divide and Conquer**: Breaking problems into smaller, more manageable subproblems

Let's explore how these strategies can be applied to Bonsai v3.

<cell_type>markdown</cell_type>### Implementation in Bonsai v3

Let's examine and optimize some of the computationally intensive functions in Bonsai v3. We'll focus on the pairwise likelihood calculations, which are often a major bottleneck in relationship inference.

In [ ]:
# Example: Optimizing the relationship likelihood calculation
def original_compute_likelihoods(segments, relationship_models):
    """Original (unoptimized) function to compute relationship likelihoods.
    
    Args:
        segments: List of IBD segments shared between two individuals
        relationship_models: Dictionary of relationship models
        
    Returns:
        Dictionary of relationship likelihoods
    """
    likelihoods = {}
    
    for rel_type, model in relationship_models.items():
        # Compute likelihood for each relationship type
        likelihood = 0
        
        # Process each segment
        for segment in segments:
            # Expensive calculation for each segment
            segment_likelihood = compute_segment_likelihood(segment, model)
            likelihood += segment_likelihood
        
        likelihoods[rel_type] = likelihood
    
    return likelihoods

def optimized_compute_likelihoods(segments, relationship_models, early_termination_threshold=0.01):
    """Optimized function to compute relationship likelihoods with several optimizations.
    
    Args:
        segments: List of IBD segments shared between two individuals
        relationship_models: Dictionary of relationship models
        early_termination_threshold: Threshold for early termination
        
    Returns:
        Dictionary of relationship likelihoods
    """
    likelihoods = {}
    
    # Optimization 1: Precomputation
    # Precompute segment features that will be used by all relationship models
    segment_features = []
    for segment in segments:
        features = precompute_segment_features(segment)
        segment_features.append(features)
    
    # Optimization 2: Early termination based on segment count
    if len(segments) == 0:
        # If no segments, set all likelihoods to minimum value
        return {rel_type: float('-inf') for rel_type in relationship_models}
    
    # Optimization 3: Sort relationships by computational cost
    # Process cheaper models first to allow for earlier filtering
    rel_types = sorted(relationship_models.keys(), 
                      key=lambda r: relationship_computation_cost(r))
    
    # Optimization 4: Track best likelihood for early termination
    best_likelihood = float('-inf')
    best_rel_type = None
    
    for rel_type in rel_types:
        model = relationship_models[rel_type]
        
        # Skip unlikely relationships based on segment count heuristic
        if skip_unlikely_relationship(rel_type, len(segments)):
            likelihoods[rel_type] = float('-inf')
            continue
        
        # Compute likelihood using precomputed features
        likelihood = 0
        for features in segment_features:
            segment_likelihood = compute_segment_likelihood_from_features(features, model)
            likelihood += segment_likelihood
            
            # Optimization 5: Early termination within segment processing
            # If this relationship is already much worse than the best, stop computing
            if best_likelihood - likelihood > early_termination_threshold * len(segments):
                likelihood = float('-inf')
                break
        
        likelihoods[rel_type] = likelihood
        
        # Update best likelihood for early termination check
        if likelihood > best_likelihood:
            best_likelihood = likelihood
            best_rel_type = rel_type
    
    return likelihoods

# Helper functions
def precompute_segment_features(segment):
    """Precompute features for a segment that will be used by multiple relationship models."""
    # Simulate an expensive computation
    length_feature = segment.get('cm_length', 0)
    position_feature = (segment.get('end_pos', 0) - segment.get('start_pos', 0)) / 1e6
    density_feature = segment.get('snp_count', 100) / position_feature if position_feature > 0 else 0
    
    return {
        'length': length_feature,
        'position': position_feature,
        'density': density_feature
    }

def compute_segment_likelihood(segment, model):
    """Compute the likelihood of a segment under a relationship model."""
    # Simulate the original expensive computation without precomputation
    length = segment.get('cm_length', 0)
    position = (segment.get('end_pos', 0) - segment.get('start_pos', 0)) / 1e6
    density = segment.get('snp_count', 100) / position if position > 0 else 0
    
    # Complex calculation based on the model
    return model.get('weight', 1.0) * (length * 0.1 + position * 0.01 + density * 0.001)

def compute_segment_likelihood_from_features(features, model):
    """Compute the likelihood of a segment using precomputed features."""
    # Same calculation but using precomputed features
    return model.get('weight', 1.0) * (features['length'] * 0.1 + 
                                     features['position'] * 0.01 + 
                                     features['density'] * 0.001)

def relationship_computation_cost(rel_type):
    """Estimate the computational cost of a relationship type."""
    # In a real implementation, this would depend on the complexity of the model
    cost_map = {
        'unrelated': 1,
        'parent-child': 2,
        'full-sibling': 3,
        'half-sibling': 4,
        'first-cousin': 5,
        'second-cousin': 6
    }
    return cost_map.get(rel_type, 10)  # Default for unknown relationships

def skip_unlikely_relationship(rel_type, segment_count):
    """Determine if a relationship is unlikely based on segment count."""
    # Simple heuristic: different relationships have expected segment count ranges
    if rel_type == 'parent-child' and segment_count < 10:
        return True
    if rel_type == 'full-sibling' and segment_count < 5:
        return True
    if rel_type == 'unrelated' and segment_count > 15:
        return True
    return False

In [ ]:
# Test the original and optimized functions with a benchmark
def benchmark_likelihood_computation(n_segments=50, n_trials=10):
    """Compare the performance of original and optimized likelihood computation."""
    # Create test data
    segments = []
    for i in range(n_segments):
        segment = {
            'chr': np.random.randint(1, 23),
            'start_pos': np.random.randint(1, 200_000_000),
            'end_pos': np.random.randint(200_000_001, 250_000_000),
            'cm_length': np.random.uniform(1, 20),
            'snp_count': np.random.randint(100, 10000)
        }
        segments.append(segment)
    
    # Create relationship models
    relationship_models = {
        'parent-child': {'weight': 1.2, 'params': {'a': 0.5, 'b': 0.3}},
        'full-sibling': {'weight': 1.1, 'params': {'a': 0.4, 'b': 0.4}},
        'half-sibling': {'weight': 1.0, 'params': {'a': 0.3, 'b': 0.5}},
        'first-cousin': {'weight': 0.9, 'params': {'a': 0.2, 'b': 0.6}},
        'second-cousin': {'weight': 0.8, 'params': {'a': 0.1, 'b': 0.7}},
        'unrelated': {'weight': 0.5, 'params': {'a': 0.0, 'b': 1.0}}
    }
    
    # Benchmark the original function
    original_times = []
    for i in range(n_trials):
        start_time = time.time()
        original_likelihoods = original_compute_likelihoods(segments, relationship_models)
        end_time = time.time()
        original_times.append(end_time - start_time)
    
    # Benchmark the optimized function
    optimized_times = []
    for i in range(n_trials):
        start_time = time.time()
        optimized_likelihoods = optimized_compute_likelihoods(segments, relationship_models)
        end_time = time.time()
        optimized_times.append(end_time - start_time)
    
    # Display the results
    original_avg = np.mean(original_times)
    optimized_avg = np.mean(optimized_times)
    speedup = original_avg / optimized_avg if optimized_avg > 0 else float('inf')
    
    print(f"Performance comparison with {n_segments} segments:")
    print(f"  Original: {original_avg:.6f} seconds (avg)")
    print(f"  Optimized: {optimized_avg:.6f} seconds (avg)")
    print(f"  Speedup: {speedup:.2f}x")
    
    # Compare the actual likelihoods to ensure correctness
    original_likelihoods = original_compute_likelihoods(segments, relationship_models)
    optimized_likelihoods = optimized_compute_likelihoods(segments, relationship_models)
    
    print("\
Likelihood comparison (to verify correctness):")
    for rel in relationship_models:
        if rel in original_likelihoods and rel in optimized_likelihoods:
            # Infinite values may differ, but that's okay
            if original_likelihoods[rel] == float('-inf') and optimized_likelihoods[rel] == float('-inf'):
                print(f"  {rel}: Both methods return -inf")
            elif original_likelihoods[rel] == float('-inf') or optimized_likelihoods[rel] == float('-inf'):
                print(f"  {rel}: DIFFERENT! Original: {original_likelihoods[rel]}, Optimized: {optimized_likelihoods[rel]}")
            else:
                diff = abs(original_likelihoods[rel] - optimized_likelihoods[rel])
                rel_diff = diff / abs(original_likelihoods[rel]) if original_likelihoods[rel] != 0 else float('inf')
                print(f"  {rel}: Original: {original_likelihoods[rel]:.4f}, Optimized: {optimized_likelihoods[rel]:.4f}, Diff: {rel_diff:.6f}")
        else:
            print(f"  {rel}: Missing from one of the results!")
    
    # Create a dictionary to return benchmark results
    benchmark_results = {
        'n_segments': n_segments,
        'n_trials': n_trials,
        'original_avg': original_avg,
        'optimized_avg': optimized_avg,
        'speedup': speedup,
        'original_times': original_times,
        'optimized_times': optimized_times
    }
    
    return benchmark_results

# Test with different numbers of segments
benchmark_results = {}
for n_segments in [10, 50, 100, 200]:
    print(f"\
Benchmarking with {n_segments} segments...")
    benchmark_results[n_segments] = benchmark_likelihood_computation(n_segments)

# Visualize the scaling with number of segments
plt.figure(figsize=(10, 6))

segment_sizes = list(benchmark_results.keys())
original_times = [benchmark_results[n]['original_avg'] for n in segment_sizes]
optimized_times = [benchmark_results[n]['optimized_avg'] for n in segment_sizes]

plt.plot(segment_sizes, original_times, 'o-', label='Original Implementation')
plt.plot(segment_sizes, optimized_times, 's-', label='Optimized Implementation')

plt.xlabel('Number of Segments')
plt.ylabel('Execution Time (seconds)')
plt.title('Performance Scaling with Number of Segments')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Plot the speedup
plt.figure(figsize=(10, 6))

speedups = [benchmark_results[n]['speedup'] for n in segment_sizes]

plt.plot(segment_sizes, speedups, 'o-')
plt.axhline(y=1, color='r', linestyle='--', alpha=0.3, label='No Speedup')

plt.xlabel('Number of Segments')
plt.ylabel('Speedup Factor (Original Time / Optimized Time)')
plt.title('Performance Speedup with Optimized Implementation')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<cell_type>markdown</cell_type>### Exercise 3: Implementing a Memoization Optimization

Memoization is a powerful optimization technique that caches the results of expensive function calls to avoid redundant calculations. Let's implement a memoization decorator for Bonsai calculations.

**Task:** Complete the memoization decorator and apply it to an expensive function that would benefit from caching.

**Hint:** Use a dictionary to store function results based on input arguments, and ensure proper handling of mutable arguments.

In [ ]:
# Exercise 3: Implement a memoization decorator
import functools

def memoize(func):
    """Memoization decorator to cache function results.
    
    Args:
        func: Function to be memoized
        
    Returns:
        Wrapped function with caching
    """
    # TODO: Implement the memoization decorator
    # 1. Create a cache to store function results
    # 2. Create a wrapper function that checks the cache before computing
    # 3. Store results in the cache after computation
    # 4. Handle the case of mutable arguments
    
    # Create cache
    cache = {}
    
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        # Create a key for the cache
        # For mutable arguments, we need to make them hashable
        key_parts = []
        
        # Process positional arguments
        for arg in args:
            try:
                # Try to use the argument directly as a key
                hash(arg)
                key_parts.append(arg)
            except TypeError:
                # For unhashable types (like lists or dicts), convert to a hashable representation
                if isinstance(arg, list):
                    key_parts.append(tuple(arg))
                elif isinstance(arg, dict):
                    key_parts.append(tuple(sorted(arg.items())))
                else:
                    # For other unhashable types, use a string representation
                    key_parts.append(str(arg))
        
        # Process keyword arguments (sorted for consistency)
        for k in sorted(kwargs.keys()):
            v = kwargs[k]
            try:
                hash(v)
                key_parts.append((k, v))
            except TypeError:
                if isinstance(v, list):
                    key_parts.append((k, tuple(v)))
                elif isinstance(v, dict):
                    key_parts.append((k, tuple(sorted(v.items()))))
                else:
                    key_parts.append((k, str(v)))
        
        # Create a hashable key from all parts
        key = hash(tuple(key_parts))
        
        # Check if result is already in cache
        if key in cache:
            return cache[key]
        
        # If not in cache, compute the result
        result = func(*args, **kwargs)
        
        # Store in cache for future use
        cache[key] = result
        
        return result
    
    # Add a method to clear the cache
    wrapper.clear_cache = lambda: cache.clear()
    
    # Add a method to get cache info
    wrapper.cache_info = lambda: {'size': len(cache)}
    
    return wrapper

# Example expensive function that would benefit from memoization
@memoize
def compute_relationship_probability(segment_length, relationship_type):
    """Compute the probability of a segment of a given length under a relationship model.
    
    Args:
        segment_length: Length of the segment in cM
        relationship_type: Type of relationship to model
        
    Returns:
        Probability of the segment under the relationship model
    """
    # Simulate an expensive computation
    print(f"Computing for segment {segment_length:.2f} cM under {relationship_type} relationship...")
    
    # Add artificial delay to simulate a complex calculation
    time.sleep(0.1)
    
    # Different models for different relationships
    if relationship_type == 'parent-child':
        return np.exp(-segment_length / 100) * 0.9
    elif relationship_type == 'full-sibling':
        return np.exp(-segment_length / 50) * 0.7
    elif relationship_type == 'half-sibling':
        return np.exp(-segment_length / 30) * 0.5
    elif relationship_type == 'first-cousin':
        return np.exp(-segment_length / 20) * 0.3
    else:
        return np.exp(-segment_length / 10) * 0.1

# Test the memoized function
print("First call (should compute):")
prob1 = compute_relationship_probability(15.0, 'parent-child')
print(f"Result: {prob1}")

print("\
Second call with same arguments (should use cache):")
prob2 = compute_relationship_probability(15.0, 'parent-child')
print(f"Result: {prob2}")

print("\
Third call with different arguments (should compute):")
prob3 = compute_relationship_probability(15.0, 'full-sibling')
print(f"Result: {prob3}")

print("\
Fourth call with first arguments again (should use cache):")
prob4 = compute_relationship_probability(15.0, 'parent-child')
print(f"Result: {prob4}")

# Check cache info
print(f"\
Cache info: {compute_relationship_probability.cache_info()}")

# Benchmark with and without memoization
def benchmark_memoization():
    """Compare performance with and without memoization."""
    # Define test cases
    segment_lengths = [10.0, 15.0, 20.0, 25.0, 30.0]
    relationship_types = ['parent-child', 'full-sibling', 'half-sibling', 'first-cousin', 'unrelated']
    
    # Create a non-memoized version for comparison
    def compute_relationship_probability_no_memo(segment_length, relationship_type):
        # Same function without memoization
        time.sleep(0.1)  # Artificial delay
        
        if relationship_type == 'parent-child':
            return np.exp(-segment_length / 100) * 0.9
        elif relationship_type == 'full-sibling':
            return np.exp(-segment_length / 50) * 0.7
        elif relationship_type == 'half-sibling':
            return np.exp(-segment_length / 30) * 0.5
        elif relationship_type == 'first-cousin':
            return np.exp(-segment_length / 20) * 0.3
        else:
            return np.exp(-segment_length / 10) * 0.1
    
    # Benchmark non-memoized version
    print("\
Benchmarking without memoization:")
    start_time = time.time()
    
    # Call the function multiple times, including repeated calls
    for _ in range(3):  # Repeat the whole test set 3 times
        for length in segment_lengths:
            for rel_type in relationship_types:
                compute_relationship_probability_no_memo(length, rel_type)
    
    no_memo_time = time.time() - start_time
    print(f"Time without memoization: {no_memo_time:.2f} seconds")
    
    # Clear the cache for the memoized version
    compute_relationship_probability.clear_cache()
    
    # Benchmark memoized version
    print("\
Benchmarking with memoization:")
    start_time = time.time()
    
    # Call the function multiple times, including repeated calls
    for _ in range(3):  # Repeat the whole test set 3 times
        for length in segment_lengths:
            for rel_type in relationship_types:
                compute_relationship_probability(length, rel_type)
    
    memo_time = time.time() - start_time
    print(f"Time with memoization: {memo_time:.2f} seconds")
    
    # Calculate speedup
    speedup = no_memo_time / memo_time
    print(f"Speedup: {speedup:.2f}x")
    
    return {
        'no_memo_time': no_memo_time,
        'memo_time': memo_time,
        'speedup': speedup
    }

# Run the benchmark
benchmark_memoization()

<cell_type>markdown</cell_type>## Part 4: Memory Optimization Techniques

### Theory and Background

Memory optimization is crucial when processing large genetic datasets, as memory constraints often limit scalability before CPU constraints do. Key memory optimization techniques include:

1. **Efficient Data Structures**: Using memory-efficient data structures appropriate for the task
2. **Streaming Processing**: Processing data in chunks rather than loading everything into memory
3. **Object Pooling**: Reusing objects instead of creating new ones
4. **Memory-Mapped Files**: Accessing file content without loading it entirely into memory
5. **Sparse Representations**: Using data structures that only store non-default values
6. **Compression**: Storing data in compressed formats

Let's explore how these techniques can be applied to Bonsai v3 when working with large IBD datasets.

<cell_type>markdown</cell_type>### Implementation in Bonsai v3

Let's examine memory optimization strategies for storing and processing IBD segments. In Bonsai v3, IBD segments can consume substantial memory, especially with large cohorts.

In [ ]:
# Example 1: Memory-Efficient IBD Segment Representation

# Original representation (memory-intensive)
class IBDSegment:
    """Standard representation of an IBD segment."""
    
    def __init__(self, id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps=None, score=None):
        """Initialize a new IBD segment."""
        self.id1 = id1
        self.id2 = id2
        self.chromosome = chromosome
        self.start_pos = start_pos
        self.end_pos = end_pos
        self.start_cm = start_cm
        self.end_cm = end_cm
        self.n_snps = n_snps
        self.score = score
        
        # Calculate derived properties
        self.length_bp = end_pos - start_pos
        self.length_cm = end_cm - start_cm

# Memory-optimized representation using slots
class OptimizedIBDSegment:
    """Memory-efficient representation of an IBD segment using __slots__."""
    
    __slots__ = ('id1', 'id2', 'chromosome', 'start_pos', 'end_pos', 
                'start_cm', 'end_cm', 'n_snps', 'score', 'length_bp', 'length_cm')
    
    def __init__(self, id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps=None, score=None):
        """Initialize a new IBD segment."""
        self.id1 = id1
        self.id2 = id2
        self.chromosome = chromosome
        self.start_pos = start_pos
        self.end_pos = end_pos
        self.start_cm = start_cm
        self.end_cm = end_cm
        self.n_snps = n_snps
        self.score = score
        
        # Calculate derived properties
        self.length_bp = end_pos - start_pos
        self.length_cm = end_cm - start_cm

# Example 2: Compact IBD Segment Tuple Representation
def create_compact_segment(id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps=None, score=None):
    """Create a compact tuple representation of an IBD segment."""
    # Use a namedtuple-like approach, but with just a basic tuple for maximum memory efficiency
    return (id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps, score)

def get_segment_length_cm(compact_segment):
    """Get the segment length in cM from a compact segment representation."""
    return compact_segment[6] - compact_segment[5]  # end_cm - start_cm

# Example 3: Using numpy structured arrays for bulk storage
def create_segment_array(n_segments):
    """Create a numpy structured array for efficient storage of many segments."""
    # Define the structured array dtype
    segment_dtype = np.dtype([
        ('id1', np.int32),          # Use integer IDs instead of strings for efficiency
        ('id2', np.int32),
        ('chromosome', np.int8),     # Chromosomes 1-23 fit in a byte
        ('start_pos', np.int32),     # Base positions in bp
        ('end_pos', np.int32),
        ('start_cm', np.float32),    # Genetic positions in cM (32-bit float to save memory)
        ('end_cm', np.float32),
        ('n_snps', np.int16),        # Number of SNPs in segment
        ('score', np.float32)        # IBD detection score
    ])
    
    # Create the array
    segments = np.zeros(n_segments, dtype=segment_dtype)
    return segments

# Measure memory usage of different representations
def compare_segment_memory_usage(n_segments=100000):
    """Compare memory usage of different IBD segment representations."""
    import sys
    import numpy as np
    
    # Generate random segment data
    ids = np.arange(1000)  # 1000 possible individual IDs
    
    # Memory usage results
    memory_usage = {}
    
    # Method 1: Standard class instances (baseline)
    print(f"Creating {n_segments} standard IBD segment objects...")
    standard_segments = []
    for i in range(n_segments):
        id1, id2 = np.random.choice(ids, 2, replace=False)
        chromosome = np.random.randint(1, 23)
        start_pos = np.random.randint(1, 200_000_000)
        end_pos = start_pos + np.random.randint(1000, 5_000_000)
        start_cm = start_pos / 1_000_000 * 1.5  # Approximate cM position
        end_cm = end_pos / 1_000_000 * 1.5
        n_snps = np.random.randint(10, 1000)
        score = np.random.random()
        
        segment = IBDSegment(id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps, score)
        standard_segments.append(segment)
    
    # Measure memory usage
    memory_usage['standard'] = sys.getsizeof(standard_segments) + \\
                              sum(sys.getsizeof(seg) for seg in standard_segments)
    
    # Method 2: Optimized class instances with __slots__
    print(f"Creating {n_segments} optimized IBD segment objects with __slots__...")
    optimized_segments = []
    for i in range(n_segments):
        id1, id2 = np.random.choice(ids, 2, replace=False)
        chromosome = np.random.randint(1, 23)
        start_pos = np.random.randint(1, 200_000_000)
        end_pos = start_pos + np.random.randint(1000, 5_000_000)
        start_cm = start_pos / 1_000_000 * 1.5
        end_cm = end_pos / 1_000_000 * 1.5
        n_snps = np.random.randint(10, 1000)
        score = np.random.random()
        
        segment = OptimizedIBDSegment(id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps, score)
        optimized_segments.append(segment)
    
    # Measure memory usage
    memory_usage['optimized'] = sys.getsizeof(optimized_segments) + \\
                               sum(sys.getsizeof(seg) for seg in optimized_segments)
    
    # Method 3: Tuple representation
    print(f"Creating {n_segments} tuple-based IBD segments...")
    tuple_segments = []
    for i in range(n_segments):
        id1, id2 = np.random.choice(ids, 2, replace=False)
        chromosome = np.random.randint(1, 23)
        start_pos = np.random.randint(1, 200_000_000)
        end_pos = start_pos + np.random.randint(1000, 5_000_000)
        start_cm = start_pos / 1_000_000 * 1.5
        end_cm = end_pos / 1_000_000 * 1.5
        n_snps = np.random.randint(10, 1000)
        score = np.random.random()
        
        segment = create_compact_segment(id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps, score)
        tuple_segments.append(segment)
    
    # Measure memory usage
    memory_usage['tuple'] = sys.getsizeof(tuple_segments) + \\
                           sum(sys.getsizeof(seg) for seg in tuple_segments)
    
    # Method 4: Numpy structured array
    print(f"Creating a numpy structured array for {n_segments} IBD segments...")
    segment_array = create_segment_array(n_segments)
    
    # Fill with random data
    for i in range(n_segments):
        id1, id2 = np.random.choice(ids, 2, replace=False)
        chromosome = np.random.randint(1, 23)
        start_pos = np.random.randint(1, 200_000_000)
        end_pos = start_pos + np.random.randint(1000, 5_000_000)
        start_cm = start_pos / 1_000_000 * 1.5
        end_cm = end_pos / 1_000_000 * 1.5
        n_snps = np.random.randint(10, 1000)
        score = np.random.random()
        
        segment_array[i] = (id1, id2, chromosome, start_pos, end_pos, start_cm, end_cm, n_snps, score)
    
    # Measure memory usage
    memory_usage['numpy'] = segment_array.nbytes
    
    # Print results
    print("\
Memory usage comparison:")
    for method, memory in memory_usage.items():
        print(f"  {method}: {memory/1024/1024:.2f} MB")
    
    # Calculate memory savings
    baseline = memory_usage['standard']
    for method, memory in memory_usage.items():
        if method != 'standard':
            savings = (baseline - memory) / baseline * 100
            print(f"  {method} saves {savings:.2f}% compared to standard")
    
    # Visualize the comparison
    plt.figure(figsize=(12, 6))
    methods = list(memory_usage.keys())
    memory_mb = [memory_usage[m]/1024/1024 for m in methods]
    
    colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']
    bars = plt.bar(methods, memory_mb, color=colors)
    
    # Add labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f} MB',
                ha='center', va='bottom', fontsize=12)
    
    plt.xlabel('Representation Method')
    plt.ylabel('Memory Usage (MB)')
    plt.title('Memory Usage Comparison for IBD Segment Representations')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    return memory_usage

# Run the memory usage comparison
memory_usage = compare_segment_memory_usage(100000)

<cell_type>markdown</cell_type>### Exercise 4: Implement Chunked Processing for Large Datasets

Large datasets often can't be loaded entirely into memory. Chunked processing allows us to work with these datasets by processing them in manageable portions.

**Task:** Implement a streaming processor for large IBD segment files that analyzes data without loading the entire file into memory.

**Hint:** Use Python's file handling capabilities to read and process the file line by line.

In [ ]:
# Exercise 4: Implement a streaming processor for large IBD segment files

class StreamingIBDProcessor:
    """Process large IBD segment files without loading the entire file into memory."""
    
    def __init__(self, chunk_size=1000):
        """Initialize the processor.
        
        Args:
            chunk_size: Number of segments to process at once
        """
        self.chunk_size = chunk_size
        self.stats = {
            'total_segments': 0,
            'total_pairs': 0,
            'chr_counts': {},
            'length_stats': {'min': float('inf'), 'max': 0, 'total': 0},
            'individual_counts': {}
        }
    
    def process_file(self, file_path, callback=None):
        """Process an IBD segment file line by line.
        
        Args:
            file_path: Path to the IBD segment file
            callback: Optional function to call for each chunk of segments
            
        Returns:
            Dictionary of statistics about the file
        """
        # TODO: Implement the streaming processor
        # 1. Open the file and read it line by line
        # 2. Parse each line into an IBD segment
        # 3. Process segments in chunks
        # 4. Update statistics
        # 5. Call the callback function with each chunk if provided
        
        # Implementation
        current_chunk = []
        
        print(f"Processing {file_path}...")
        
        try:
            with open(file_path, 'r') as f:
                # Skip header if present
                first_line = f.readline().strip()
                if first_line.startswith('#') or not self._is_segment_line(first_line):
                    print("Skipping header line")
                else:
                    # Process the first line if it's a segment
                    self._process_segment_line(first_line, current_chunk)
                
                # Process the rest of the file
                for line_num, line in enumerate(f, start=2):
                    # Skip empty lines and comments
                    line = line.strip()
                    if not line or line.startswith('#'):
                        continue
                    
                    # Process this segment
                    self._process_segment_line(line, current_chunk)
                    
                    # If we've reached the chunk size, process the chunk
                    if len(current_chunk) >= self.chunk_size:
                        self._process_chunk(current_chunk)
                        if callback:
                            callback(current_chunk)
                        current_chunk = []
                        
                        # Print progress every 100,000 segments
                        if self.stats['total_segments'] % 100000 == 0:
                            print(f"  Processed {self.stats['total_segments']} segments...")
            
            # Process any remaining segments
            if current_chunk:
                self._process_chunk(current_chunk)
                if callback:
                    callback(current_chunk)
            
            # Calculate averages
            if self.stats['total_segments'] > 0:
                self.stats['avg_length'] = self.stats['length_stats']['total'] / self.stats['total_segments']
            else:
                self.stats['avg_length'] = 0
                
            print(f"Completed processing. Found {self.stats['total_segments']} segments across {self.stats['total_pairs']} pairs.")
            return self.stats
            
        except Exception as e:
            print(f"Error processing file: {e}")
            return self.stats
    
    def _is_segment_line(self, line):
        """Check if a line contains an IBD segment."""
        parts = line.strip().split()
        return len(parts) >= 6 and all(self._is_numeric(p) for p in parts[2:6])
    
    def _is_numeric(self, text):
        """Check if a string represents a number."""
        try:
            float(text)
            return True
        except (ValueError, TypeError):
            return False
    
    def _process_segment_line(self, line, current_chunk):
        """Parse a line into an IBD segment and add it to the current chunk."""
        parts = line.strip().split()
        
        # Expected format: id1 id2 chromosome start_pos end_pos genetic_length [additional fields]
        if len(parts) < 6:
            # Skip malformed lines
            return
        
        try:
            id1 = parts[0]
            id2 = parts[1]
            chromosome = int(parts[2])
            start_pos = int(parts[3])
            end_pos = int(parts[4])
            genetic_length = float(parts[5])
            
            # Create a simple tuple representation for memory efficiency
            segment = (id1, id2, chromosome, start_pos, end_pos, genetic_length)
            current_chunk.append(segment)
            
        except (ValueError, IndexError) as e:
            # Skip malformed lines
            print(f"Skipping malformed line: {line.strip()} - Error: {e}")
    
    def _process_chunk(self, chunk):
        """Process a chunk of segments."""
        # Update total segments
        self.stats['total_segments'] += len(chunk)
        
        # Process each segment
        pairs_seen = set()
        
        for segment in chunk:
            id1, id2, chromosome, start_pos, end_pos, genetic_length = segment
            
            # Update chromosome counts
            self.stats['chr_counts'][chromosome] = self.stats['chr_counts'].get(chromosome, 0) + 1
            
            # Update length statistics
            self.stats['length_stats']['min'] = min(self.stats['length_stats']['min'], genetic_length)
            self.stats['length_stats']['max'] = max(self.stats['length_stats']['max'], genetic_length)
            self.stats['length_stats']['total'] += genetic_length
            
            # Update individual counts
            self.stats['individual_counts'][id1] = self.stats['individual_counts'].get(id1, 0) + 1
            self.stats['individual_counts'][id2] = self.stats['individual_counts'].get(id2, 0) + 1
            
            # Update pair count (each pair is counted only once)
            pair = (min(id1, id2), max(id1, id2))
            if pair not in pairs_seen:
                pairs_seen.add(pair)
                self.stats['total_pairs'] += 1

# Create a simple custom callback function for the streaming processor
def example_callback(chunk):
    """Example callback function for the streaming processor."""
    # In a real application, this might update a progress bar, save to a database, etc.
    long_segments = [seg for seg in chunk if seg[5] > 15]  # Filter segments longer than 15 cM
    if long_segments:
        print(f"  Found {len(long_segments)} segments longer than 15 cM in this chunk")
        
# Test with a mock IBD file
def create_mock_ibd_file(filename, n_segments=10000):
    """Create a mock IBD segment file for testing."""
    print(f"Creating mock IBD file with {n_segments} segments: {filename}")
    
    with open(filename, 'w') as f:
        # Write header
        f.write("# Mock IBD segment file\
")
        f.write("id1 id2 chrom start_pos end_pos genetic_length\
")
        
        # Write segments
        for i in range(n_segments):
            id1 = f"ind_{np.random.randint(1, 100)}"
            id2 = f"ind_{np.random.randint(1, 100)}"
            while id2 == id1:
                id2 = f"ind_{np.random.randint(1, 100)}"
                
            chromosome = np.random.randint(1, 23)
            start_pos = np.random.randint(1, 200_000_000)
            end_pos = start_pos + np.random.randint(1000, 5_000_000)
            genetic_length = np.random.exponential(10) # Mean 10 cM
            
            f.write(f"{id1} {id2} {chromosome} {start_pos} {end_pos} {genetic_length:.2f}\
")
    
    print(f"Mock file created: {filename}")
    return filename

# Create a test file
import os
mock_file_path = os.path.join(RESULTS_DIR, "mock_ibd_segments.txt")
create_mock_ibd_file(mock_file_path, n_segments=50000)

# Test the streaming processor
processor = StreamingIBDProcessor(chunk_size=5000)
stats = processor.process_file(mock_file_path, callback=example_callback)

# Display summary statistics
print("\
Summary Statistics:")
print(f"Total Segments: {stats['total_segments']}")
print(f"Total Pairs: {stats['total_pairs']}")
print(f"Chromosome Distribution:")
for chr_num in sorted(stats['chr_counts'].keys()):
    print(f"  Chr {chr_num}: {stats['chr_counts'][chr_num]} segments")
print(f"Segment Length Statistics:")
print(f"  Min: {stats['length_stats']['min']:.2f} cM")
print(f"  Max: {stats['length_stats']['max']:.2f} cM")
print(f"  Avg: {stats['avg_length']:.2f} cM")
print(f"Top 5 individuals by segment count:")
top_individuals = sorted(stats['individual_counts'].items(), key=lambda x: x[1], reverse=True)[:5]
for ind, count in top_individuals:
    print(f"  {ind}: {count} segments")

# Visualize some statistics
plt.figure(figsize=(15, 6))

# Plot 1: Chromosome distribution
plt.subplot(1, 2, 1)
chr_nums = sorted(stats['chr_counts'].keys())
chr_counts = [stats['chr_counts'][chr_num] for chr_num in chr_nums]
plt.bar(chr_nums, chr_counts)
plt.xlabel('Chromosome')
plt.ylabel('Number of Segments')
plt.title('IBD Segment Distribution by Chromosome')

# Plot 2: Top individuals
plt.subplot(1, 2, 2)
top_n = 10
top_individuals = sorted(stats['individual_counts'].items(), key=lambda x: x[1], reverse=True)[:top_n]
inds = [ind for ind, _ in top_individuals]
counts = [count for _, count in top_individuals]
plt.barh(inds, counts)
plt.xlabel('Number of Segments')
plt.ylabel('Individual ID')
plt.title(f'Top {top_n} Individuals by Segment Count')

plt.tight_layout()
plt.show()

<cell_type>markdown</cell_type>## Part 5: Precision-Performance Tradeoffs

### Theory and Background

In genetic genealogy computations, precision and performance often exist in a trade-off relationship. By making intelligent tradeoffs, we can significantly improve performance with minimal impact on accuracy.

Key precision-performance tradeoff strategies include:

1. **Early Termination**: Stopping computations once a sufficient level of confidence is reached
2. **Approximation Algorithms**: Using faster, approximate methods for computationally intensive operations
3. **Pruning Low-Information Data**: Focusing on high-quality data and ignoring noisy or ambiguous information
4. **Adaptive Precision**: Adjusting computational precision based on the specific relationship being analyzed
5. **Confidence-Weighted Operations**: Allocating more computational resources to higher-confidence predictions

In Bonsai v3, there are several areas where precision-performance tradeoffs can be intelligently applied without significantly compromising accuracy.

In [None]:
# Implementation and Examples

class AdaptivePrecisionCalculator:
    """A calculator that adjusts precision based on confidence thresholds."""
    
    def __init__(self, precision_levels=None, confidence_thresholds=None):
        """Initialize the adaptive precision calculator.
        
        Args:
            precision_levels: Dictionary of precision levels for different operations
            confidence_thresholds: Dictionary of confidence thresholds for early termination
        """
        # Default precision levels (higher is more precise but slower)
        self.precision_levels = precision_levels or {
            'low': 1,      # Fast, approximate calculations
            'medium': 2,   # Balanced precision/speed
            'high': 3,     # High precision, slower calculations
            'ultra': 4     # Maximum precision, slowest calculations
        }
        
        # Default confidence thresholds for early termination
        self.confidence_thresholds = confidence_thresholds or {
            'relationship': 0.95,  # Stop when we're 95% confident in the relationship
            'connection_point': 0.90,  # Stop when we're 90% confident in the connection point
            'pedigree': 0.85,  # Stop when we're 85% confident in the pedigree structure
        }
        
        # Current settings
        self.current_precision = 'medium'
        self.enable_early_termination = True
        self.enable_approximation = True
        
    def set_precision(self, level):
        """Set the precision level.
        
        Args:
            level: Precision level ('low', 'medium', 'high', 'ultra')
        """
        if level in self.precision_levels:
            self.current_precision = level
        else:
            raise ValueError(f"Invalid precision level: {level}. Valid levels: {list(self.precision_levels.keys())}")
    
    def get_precision_value(self):
        """Get the numerical precision value for the current level."""
        return self.precision_levels[self.current_precision]
    
    def adaptive_relationship_likelihood(self, segments, relationship_types, confidence_target=None):
        """Calculate relationship likelihoods with adaptive precision.
        
        Args:
            segments: List of IBD segments
            relationship_types: List of relationship types to evaluate
            confidence_target: Target confidence level (0-1) or None to use default
            
        Returns:
            Dictionary of relationship likelihoods and confidence scores
        """
        # Use default threshold if not specified
        confidence_target = confidence_target or self.confidence_thresholds['relationship']
        
        # Get precision parameters based on current level
        precision = self.get_precision_value()
        
        # Adjust algorithm parameters based on precision level
        if precision == 1:  # Low precision
            segment_sampling = 0.5  # Use only 50% of segments (random sample)
            model_simplification = 0.7  # Use simplified models (70% complexity)
            iterations = 10  # Low number of iterations for Monte Carlo methods
        elif precision == 2:  # Medium precision
            segment_sampling = 0.8  # Use 80% of segments
            model_simplification = 0.9  # Slightly simplified models
            iterations = 50  # Medium number of iterations
        elif precision == 3:  # High precision
            segment_sampling = 1.0  # Use all segments
            model_simplification = 1.0  # Full models
            iterations = 100  # High number of iterations
        else:  # Ultra precision
            segment_sampling = 1.0  # Use all segments
            model_simplification = 1.0  # Full models
            iterations = 500  # Very high number of iterations
        
        # Apply segment sampling if enabled
        working_segments = segments
        if segment_sampling < 1.0:
            n_samples = max(1, int(len(segments) * segment_sampling))
            indices = np.random.choice(len(segments), n_samples, replace=False)
            working_segments = [segments[i] for i in indices]
        
        # Calculate likelihoods for each relationship type
        results = {}
        max_likelihood = float('-inf')
        max_rel_type = None
        
        for rel_type in relationship_types:
            # Calculate likelihood with the appropriate precision
            likelihood = self._calculate_relationship_likelihood(
                working_segments, rel_type, model_simplification, iterations)
            
            results[rel_type] = {
                'likelihood': likelihood,
                'confidence': None  # Will be filled in later
            }
            
            # Track maximum likelihood for early termination
            if likelihood > max_likelihood:
                max_likelihood = likelihood
                max_rel_type = rel_type
        
        # Calculate confidence scores
        total_evidence = 0
        for rel_type, result in results.items():
            # Convert likelihoods to evidence values (prevent underflow)
            evidence = np.exp(result['likelihood'] - max_likelihood)
            results[rel_type]['evidence'] = evidence
            total_evidence += evidence
        
        # Normalize to get confidence scores
        for rel_type, result in results.items():
            confidence = result['evidence'] / total_evidence if total_evidence > 0 else 0
            results[rel_type]['confidence'] = confidence
        
        # Apply early termination if enabled
        if self.enable_early_termination and max_rel_type is not None:
            max_confidence = results[max_rel_type]['confidence']
            
            # If we're confident enough, skip additional checks
            if max_confidence >= confidence_target:
                # Mark other relationships as skipped
                for rel_type in relationship_types:
                    if rel_type \\!= max_rel_type:
                        results[rel_type]['skipped'] = True
        
        return results
    
    def _calculate_relationship_likelihood(self, segments, relationship_type, model_simplification, iterations):
        """Calculate the likelihood of segments under a relationship model.
        
        Args:
            segments: List of IBD segments
            relationship_type: Type of relationship to model
            model_simplification: Factor for model simplification (0-1)
            iterations: Number of iterations for Monte Carlo methods
            
        Returns:
            Log-likelihood of the segments under the relationship model
        """
        # This is a simplified simulation of the likelihood calculation
        # In a real implementation, this would use the actual Bonsai models
        
        # Base likelihood depends on the relationship type
        base_factors = {
            'parent-child': 10.0,
            'full-sibling': 8.0,
            'half-sibling': 6.0,
            'grandparent': 5.0,
            'aunt-uncle': 4.0,
            'first-cousin': 3.0,
            'second-cousin': 2.0,
            'third-cousin': 1.0,
            'unrelated': 0.5
        }
        
        # Get the base factor for this relationship
        base_factor = base_factors.get(relationship_type, 0.1)
        
        # Apply model simplification (reduces precision but increases speed)
        effective_factor = base_factor * model_simplification
        
        # Monte Carlo integration for likelihood (more iterations = more precision)
        likelihood_samples = []
        for _ in range(iterations):
            # Generate a random sample from the relationship model
            sample = np.random.normal(effective_factor, 0.5)
            
            # Calculate likelihood for this sample
            segment_likelihoods = []
            for segment in segments:
                # Extract segment features (simplified)
                length = segment[5] if isinstance(segment, tuple) else segment.get('length_cm', 5.0)
                
                # Calculate likelihood contribution of this segment
                segment_likelihood = self._segment_likelihood_model(length, effective_factor)
                segment_likelihoods.append(segment_likelihood)
            
            # Combine segment likelihoods (sum of log-likelihoods)
            combined_likelihood = sum(segment_likelihoods)
            likelihood_samples.append(combined_likelihood)
        
        # Average the samples (in log space)
        log_likelihood = np.mean(likelihood_samples)
        
        return log_likelihood
    
    def _segment_likelihood_model(self, length, factor):
        """Model for segment likelihood calculation.
        
        Args:
            length: Length of the segment in cM
            factor: Model parameter based on relationship type
            
        Returns:
            Log-likelihood of the segment
        """
        # This is a very simplified model
        # Real models would be based on empirical distributions
        
        # Basic model: likelihood depends on segment length and relationship factor
        # We use logarithmic scale to avoid numerical underflow
        log_likelihood = np.log(factor) - length / (10.0 * factor)
        
        return log_likelihood

# Example of applying adaptive precision in a relationship inference workflow
def demonstrate_adaptive_precision():
    """Demonstrate adaptive precision in relationship inference."""
    # Create an adaptive precision calculator
    calculator = AdaptivePrecisionCalculator()
    
    # Create some test IBD segments (using tuple representation for memory efficiency)
    # Format: (id1, id2, chromosome, start_pos, end_pos, length_cm)
    pc_segments = [
        ('ind1', 'ind2', 1, 10000000, 50000000, 25.0),
        ('ind1', 'ind2', 2, 20000000, 80000000, 30.0),
        ('ind1', 'ind2', 5, 30000000, 90000000, 35.0),
        ('ind1', 'ind2', 7, 40000000, 70000000, 20.0),
        ('ind1', 'ind2', 10, 50000000, 100000000, 40.0)
    ]
    
    fs_segments = [
        ('ind1', 'ind3', 1, 10000000, 50000000, 15.0),
        ('ind1', 'ind3', 3, 20000000, 60000000, 17.0),
        ('ind1', 'ind3', 6, 30000000, 70000000, 12.0),
        ('ind1', 'ind3', 9, 40000000, 80000000, 14.0)
    ]
    
    hs_segments = [
        ('ind1', 'ind4', 2, 10000000, 40000000, 10.0),
        ('ind1', 'ind4', 5, 20000000, 50000000, 8.0),
        ('ind1', 'ind4', 8, 30000000, 60000000, 12.0)
    ]
    
    fc_segments = [
        ('ind1', 'ind5', 3, 10000000, 30000000, 7.0),
        ('ind1', 'ind5', 7, 20000000, 40000000, 6.0)
    ]
    
    un_segments = [
        ('ind1', 'ind6', 4, 10000000, 25000000, 4.0)
    ]
    
    # Define relationship types to test
    relationship_types = [
        'parent-child', 
        'full-sibling', 
        'half-sibling', 
        'first-cousin',
        'unrelated'
    ]
    
    # Test with different precision levels
    precision_levels = ['low', 'medium', 'high', 'ultra']
    segment_sets = {
        'Parent-Child': pc_segments,
        'Full Sibling': fs_segments,
        'Half Sibling': hs_segments,
        'First Cousin': fc_segments,
        'Unrelated': un_segments
    }
    
    # Results storage for comparison
    results = {}
    timings = {}
    
    for precision in precision_levels:
        print(f"
Testing with {precision} precision:")
        calculator.set_precision(precision)
        
        timings[precision] = {}
        results[precision] = {}
        
        for relation_name, segments in segment_sets.items():
            print(f"  Analyzing {relation_name} relationship ({len(segments)} segments)...")
            
            # Measure execution time
            start_time = time.time()
            likelihood_results = calculator.adaptive_relationship_likelihood(segments, relationship_types)
            end_time = time.time()
            
            execution_time = end_time - start_time
            timings[precision][relation_name] = execution_time
            results[precision][relation_name] = likelihood_results
            
            # Find the most likely relationship
            max_likelihood = float('-inf')
            max_rel_type = None
            for rel_type, result in likelihood_results.items():
                if result['likelihood'] > max_likelihood:
                    max_likelihood = result['likelihood']
                    max_rel_type = rel_type
            
            # Print the results
            print(f"    Most likely relationship: {max_rel_type}")
            print(f"    Confidence: {likelihood_results[max_rel_type]['confidence']:.4f}")
            print(f"    Execution time: {execution_time:.4f} seconds")
            
            # Check if we skipped any calculations due to early termination
            skipped = sum(1 for r in likelihood_results.values() if r.get('skipped', False))
            if skipped:
                print(f"    Skipped {skipped} relationship calculations due to early termination")
    
    # Visualize the timing results
    plt.figure(figsize=(15, 6))
    
    # Plot 1: Execution time by precision level
    plt.subplot(1, 2, 1)
    
    x = np.arange(len(segment_sets))
    width = 0.2
    
    for i, precision in enumerate(precision_levels):
        times = [timings[precision][relation] for relation in segment_sets.keys()]
        plt.bar(x + i*width, times, width, label=precision)
    
    plt.xlabel('Relationship Type')
    plt.ylabel('Execution Time (seconds)')
    plt.title('Execution Time by Precision Level')
    plt.xticks(x + width * (len(precision_levels) - 1) / 2, segment_sets.keys(), rotation=45, ha='right')
    plt.legend()
    
    # Plot 2: Accuracy comparison
    plt.subplot(1, 2, 2)
    
    # Collect the correct classification rates
    correct_rates = []
    expected_relations = {
        'Parent-Child': 'parent-child',
        'Full Sibling': 'full-sibling',
        'Half Sibling': 'half-sibling',
        'First Cousin': 'first-cousin',
        'Unrelated': 'unrelated'
    }
    
    for precision in precision_levels:
        correct = 0
        for relation_name, segments in segment_sets.items():
            # Get the expected relationship type
            expected = expected_relations[relation_name]
            
            # Find the predicted relationship type
            likelihood_results = results[precision][relation_name]
            max_likelihood = float('-inf')
            predicted = None
            for rel_type, result in likelihood_results.items():
                if result['likelihood'] > max_likelihood:
                    max_likelihood = result['likelihood']
                    predicted = rel_type
            
            # Check if correct
            if predicted == expected:
                correct += 1
        
        correct_rate = correct / len(segment_sets)
        correct_rates.append(correct_rate)
    
    plt.bar(precision_levels, correct_rates)
    plt.xlabel('Precision Level')
    plt.ylabel('Correct Classification Rate')
    plt.title('Accuracy by Precision Level')
    plt.ylim(0, 1.1)
    
    # Add text labels for the correct rates
    for i, rate in enumerate(correct_rates):
        plt.text(i, rate + 0.05, f"{rate:.2f}", ha='center')
    
    plt.tight_layout()
    plt.show()
    
    # Return the complete results for further analysis
    return {
        'results': results,
        'timings': timings
    }

# Run the demonstration
demo_results = demonstrate_adaptive_precision()

### Exercise 5: Configure a Performance-Optimized Pipeline

In this exercise, you'll create a configuration-based performance optimization system that applies different optimization techniques based on dataset characteristics and desired tradeoffs.

**Task:** Complete the ConfigurablePerformancePipeline class below to enable configurable performance tuning for different scenarios.

**Hint:** Focus on making the pipeline flexible enough to handle different optimization strategies while maintaining a consistent interface.

In [None]:
# Exercise 5: Configure a Performance-Optimized Pipeline

class ConfigurablePerformancePipeline:
    """Configurable pipeline that applies different optimization techniques based on scenarios.
    
    This pipeline allows for flexible performance tuning of Bonsai operations
    based on dataset characteristics and precision requirements.
    """
    
    # Performance profiles for different scenarios
    PERFORMANCE_PROFILES = {
        'small_family': {
            'precision': 'high',  # High precision for small datasets
            'early_termination': True,
            'memory_optimization': 'standard',  # Standard memory usage is fine
            'segment_representation': 'class',  # Use class-based representation
            'chunked_processing': False,  # Load everything in memory
            'parallel_processing': False  # Single-threaded is sufficient
        },
        'large_pedigree': {
            'precision': 'medium',  # Medium precision for balance
            'early_termination': True,
            'memory_optimization': 'high',  # Optimize memory usage
            'segment_representation': 'numpy',  # Use numpy arrays
            'chunked_processing': True,  # Process in chunks
            'parallel_processing': True  # Use parallel processing
        },
        'endogamous': {
            'precision': 'ultra',  # Maximum precision for complex relationships
            'early_termination': False,  # Don't terminate early due to complex patterns
            'memory_optimization': 'medium',  # Balance memory and precision
            'segment_representation': 'slotted',  # Use slotted classes
            'chunked_processing': True,  # Process in chunks for large datasets
            'parallel_processing': True  # Use parallel processing
        },
        'realtime': {
            'precision': 'low',  # Low precision for speed
            'early_termination': True,  # Terminate early for speed
            'memory_optimization': 'high',  # Optimize memory usage
            'segment_representation': 'tuple',  # Use tuple representation
            'chunked_processing': True,  # Process in chunks
            'parallel_processing': True  # Use parallel processing
        }
    }
    
    def __init__(self, profile='medium', custom_config=None):
        """Initialize the pipeline with a performance profile.
        
        Args:
            profile: Name of the performance profile or 'custom'
            custom_config: Custom configuration dictionary (if profile is 'custom')
        """
        # Set the configuration based on the profile
        if profile == 'custom' and custom_config is not None:
            self.config = custom_config
        elif profile in self.PERFORMANCE_PROFILES:
            self.config = self.PERFORMANCE_PROFILES[profile]
        else:
            # Default to a balanced profile
            self.config = {
                'precision': 'medium',
                'early_termination': True,
                'memory_optimization': 'medium',
                'segment_representation': 'slotted',
                'chunked_processing': False,
                'parallel_processing': False
            }
            
        # Initialize components based on configuration
        self._init_components()
        
        # Print the configuration
        print("Pipeline configuration:")
        for key, value in self.config.items():
            print(f"  {key}: {value}")
    
    def _init_components(self):
        """Initialize components based on configuration."""
        # Create an adaptive precision calculator with the configured precision
        self.precision_calculator = AdaptivePrecisionCalculator()
        self.precision_calculator.set_precision(self.config['precision'])
        self.precision_calculator.enable_early_termination = self.config['early_termination']
        
        # TODO: Initialize other components based on configuration
        # For now, we'll just set up placeholders
        
        # Set up the IBD segment factory based on representation type
        self.segment_factory = self._create_segment_factory()
        
        # Set up the processor based on chunked processing setting
        if self.config['chunked_processing']:
            self.processor = StreamingIBDProcessor(chunk_size=5000)
        else:
            # Simple in-memory processor
            self.processor = lambda file_path, callback=None: self._process_in_memory(file_path)
    
    def _create_segment_factory(self):
        """Create a factory function for IBD segments based on configuration."""
        # Different segment creation strategies based on configuration
        representation = self.config['segment_representation']
        
        if representation == 'class':
            # Standard class-based representation
            return lambda id1, id2, chr, start, end, length: IBDSegment(id1, id2, chr, start, end, 0, length)
        
        elif representation == 'slotted':
            # Optimized class with __slots__
            return lambda id1, id2, chr, start, end, length: OptimizedIBDSegment(id1, id2, chr, start, end, 0, length)
        
        elif representation == 'tuple':
            # Tuple-based representation
            return lambda id1, id2, chr, start, end, length: (id1, id2, chr, start, end, length)
        
        elif representation == 'numpy':
            # Factory that creates segments in a numpy array
            # For simplicity, this returns a function that adds to a pre-allocated array
            array = create_segment_array(1000)  # Pre-allocate
            idx = [0]  # Use a list for the mutable reference
            
            def add_to_array(id1, id2, chr, start, end, length):
                if idx[0] >= len(array):
                    # Resize the array if needed
                    new_array = create_segment_array(len(array) * 2)
                    new_array[:len(array)] = array
                    array = new_array
                
                # Add the segment to the array
                array[idx[0]] = (int(id1), int(id2), chr, start, end, 0, length, 0, 0)
                idx[0] += 1
                return array[:idx[0]]
            
            return add_to_array
        
        else:
            # Default to tuples
            return lambda id1, id2, chr, start, end, length: (id1, id2, chr, start, end, length)
    
    def _process_in_memory(self, file_path):
        """Process a file in memory (non-chunked approach)."""
        # Simple implementation that loads all data into memory
        segments = []
        
        with open(file_path, 'r') as f:
            for line in f:
                line = line.strip()
                if not line or line.startswith('#'):
                    continue
                
                parts = line.split()
                if len(parts) >= 6:
                    try:
                        id1 = parts[0]
                        id2 = parts[1]
                        chromosome = int(parts[2])
                        start_pos = int(parts[3])
                        end_pos = int(parts[4])
                        genetic_length = float(parts[5])
                        
                        segment = self.segment_factory(id1, id2, chromosome, start_pos, end_pos, genetic_length)
                        segments.append(segment)
                    except (ValueError, IndexError):
                        pass
        
        return segments
    
    def process_ibd_file(self, file_path):
        """Process an IBD segment file using the configured pipeline.
        
        Args:
            file_path: Path to the IBD segment file
            
        Returns:
            Processed data based on the pipeline configuration
        """
        # Process the file based on configuration
        if self.config['chunked_processing']:
            # Use the streaming processor
            return self.processor.process_file(file_path)
        else:
            # Process in memory
            return self._process_in_memory(file_path)
    
    def analyze_relationships(self, segments, relationship_types=None):
        """Analyze relationships from IBD segments.
        
        Args:
            segments: List of IBD segments
            relationship_types: List of relationship types to evaluate (optional)
            
        Returns:
            Dictionary of relationship likelihoods
        """
        # Default relationship types if not specified
        if relationship_types is None:
            relationship_types = [
                'parent-child', 
                'full-sibling', 
                'half-sibling', 
                'first-cousin',
                'second-cousin',
                'unrelated'
            ]
            
        # Use the adaptive precision calculator to analyze relationships
        return self.precision_calculator.adaptive_relationship_likelihood(
            segments, relationship_types)
    
    def optimize_pedigree(self, relationships, individuals):
        """Optimize a pedigree structure based on relationship data.
        
        Args:
            relationships: Dictionary of pairwise relationships
            individuals: List of individuals in the pedigree
            
        Returns:
            Optimized pedigree structure
        """
        # This is a placeholder for a real pedigree optimization
        # In a real implementation, this would use the Bonsai pedigree optimization
        # with performance settings based on the configuration
        
        # For demonstration purposes:
        print(f"Optimizing pedigree with {len(individuals)} individuals...")
        print(f"Using precision level: {self.config['precision']}")
        
        if self.config['early_termination']:
            print("Early termination enabled: Will stop when confidence threshold is reached")
        
        if self.config['parallel_processing']:
            print("Parallel processing enabled: Using multiple threads for optimization")
        
        # Simulate computation time based on configuration
        start_time = time.time()
        time_factor = {
            'low': 0.2,
            'medium': 0.5,
            'high': 1.0,
            'ultra': 2.0
        }.get(self.config['precision'], 0.5)
        
        # Simulate the optimization process
        time.sleep(0.1 * time_factor * min(10, len(individuals) / 10))
        
        # Create a mock optimized pedigree
        pedigree = {
            'individuals': individuals,
            'relationships': relationships,
            'optimization_time': time.time() - start_time,
            'configuration': self.config
        }
        
        return pedigree

# Test the configurable pipeline with different profiles
def test_configurable_pipeline():
    """Test the configurable pipeline with different profiles."""
    # Create mock data
    mock_file_path = os.path.join(RESULTS_DIR, "mock_ibd_segments.txt")
    
    # Test with different profiles
    profiles = ['small_family', 'large_pedigree', 'endogamous', 'realtime']
    
    results = {}
    
    for profile in profiles:
        print(f"
Testing pipeline with '{profile}' profile:")
        
        # Create pipeline with this profile
        pipeline = ConfigurablePerformancePipeline(profile)
        
        # Process the file
        start_time = time.time()
        processed_data = pipeline.process_ibd_file(mock_file_path)
        processing_time = time.time() - start_time
        
        # Get individuals from the processed data
        if isinstance(processed_data, dict) and 'individual_counts' in processed_data:
            individuals = list(processed_data['individual_counts'].keys())
        else:
            # If we processed in memory, extract individuals from segments
            individuals = set()
            for segment in processed_data:
                if isinstance(segment, tuple):
                    individuals.add(segment[0])
                    individuals.add(segment[1])
                else:
                    individuals.add(segment.id1)
                    individuals.add(segment.id2)
            individuals = list(individuals)
        
        # Sample some relationships for pedigree optimization
        relationships = {}
        for i in range(min(len(individuals), 10)):
            for j in range(i+1, min(len(individuals), 10)):
                relationships[(individuals[i], individuals[j])] = {
                    'likelihood': np.random.random(),
                    'relationship': np.random.choice([
                        'parent-child', 'full-sibling', 'half-sibling', 
                        'first-cousin', 'unrelated'
                    ])
                }
        
        # Optimize pedigree
        start_time = time.time()
        optimized_pedigree = pipeline.optimize_pedigree(relationships, individuals[:10])
        optimization_time = time.time() - start_time
        
        # Store results
        results[profile] = {
            'processing_time': processing_time,
            'optimization_time': optimization_time,
            'total_time': processing_time + optimization_time,
            'pedigree': optimized_pedigree
        }
        
        print(f"  Processing time: {processing_time:.4f} seconds")
        print(f"  Optimization time: {optimization_time:.4f} seconds")
        print(f"  Total time: {processing_time + optimization_time:.4f} seconds")
    
    # Visualize the results
    plt.figure(figsize=(12, 6))
    
    # Collect times
    profiles_list = list(results.keys())
    processing_times = [results[p]['processing_time'] for p in profiles_list]
    optimization_times = [results[p]['optimization_time'] for p in profiles_list]
    
    # Create stacked bar chart
    bar_width = 0.6
    bar_positions = np.arange(len(profiles_list))
    
    p1 = plt.bar(bar_positions, processing_times, bar_width, label='Processing Time')
    p2 = plt.bar(bar_positions, optimization_times, bar_width, bottom=processing_times, label='Optimization Time')
    
    # Add labels and title
    plt.xlabel('Performance Profile')
    plt.ylabel('Execution Time (seconds)')
    plt.title('Performance Comparison of Different Profiles')
    plt.xticks(bar_positions, profiles_list)
    plt.legend()
    
    # Add total time labels
    for i, profile in enumerate(profiles_list):
        total_time = results[profile]['total_time']
        plt.text(i, total_time + 0.05, f"{total_time:.2f}s", ha='center')
    
    plt.tight_layout()
    plt.show()
    
    return results

# Run the test
pipeline_results = test_configurable_pipeline()

## Summary

In this lab, we've explored a range of performance tuning techniques for large-scale genetic genealogy applications using Bonsai v3. We've examined the computational challenges of processing large datasets and complex pedigrees, and implemented various optimization strategies.

Key concepts covered:

1. **Performance Scaling Challenges**: Understanding the computational complexity of key operations and how they scale with dataset size.

2. **Profiling and Benchmarking**: Using tools like cProfile and memory_profiler to identify bottlenecks and establish baseline metrics.

3. **Algorithmic Optimizations**: Implementing strategies like early termination, precomputation, and memoization to reduce computational complexity.

4. **Memory Optimizations**: Exploring memory-efficient data structures and streaming processing for large datasets.

5. **Precision-Performance Tradeoffs**: Making intelligent tradeoffs between precision and performance based on application requirements.

These techniques can be combined and tailored to specific use cases, as demonstrated in the configurable pipeline we implemented. By applying the right combination of optimizations, Bonsai v3 can efficiently handle large-scale genetic genealogy applications with thousands of individuals and millions of IBD segments.

### Next Steps

To further explore performance optimization:

1. Implement and benchmark distributed processing approaches for very large datasets
2. Explore GPU acceleration for computationally intensive operations
3. Develop database integration for efficient storage and retrieval of relationship data
4. Create performance profiles for specific application scenarios
5. Implement continuous performance monitoring to identify bottlenecks in production systems


### Self-Assessment Questions

1. What is the computational complexity of performing pairwise relationship inference on a dataset with n individuals? How does this scale as n grows?

2. What are three algorithmic optimization strategies that can improve performance in Bonsai v3, and which operations would benefit most from each?

3. How can memory usage be optimized when processing large IBD segment datasets? What are the tradeoffs?

4. When might it be appropriate to use lower precision settings in relationship inference? How would you determine the appropriate precision level?

5. How would you configure a performance pipeline for analyzing a dense, endogamous population with complex relationships? What specific optimizations would you prioritize?