# Lab 28: Integration with Other Genealogical Tools

## Overview

This notebook explores Bonsai v3's integration capabilities, particularly through the DRUID algorithm in `druid.py`. We'll examine how Bonsai interfaces with external data sources and tools to create a comprehensive genetic genealogy workflow.

**Learning Objectives:**
- Understand the DRUID (Degree Relationship Using IBD Data) algorithm and its implementation in Bonsai v3
- Learn how to integrate Bonsai with other genealogical tools and data formats
- Explore interfaces between Bonsai and external IBD detection tools
- Implement data transformation and compatibility techniques
- Apply Bonsai in a comprehensive genetic genealogy workflow

**Prerequisites:**
- Completion of Lab 3: IBD Formats
- Completion of Lab 6: Probabilistic Relationship Inference
- Familiarity with external IBD detection tools (IBIS, hap-IBD)

**Estimated completion time:** 60-90 minutes

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")
sns.set_palette("colorblind")  # Improve accessibility with colorblind-friendly palette

# Configure plot defaults for better readability
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 12,
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10
})

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        if not classes:
            print(f"No classes found in module {module_name}")
            return
            
        # Print info for each class
        for name, cls in classes:
            display(Markdown(f"### Class: {name}"))
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                display(Markdown(f"**Documentation:**\n{doc}"))
            else:
                display(Markdown("*No documentation available*"))
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            public_methods = [(method_name, method) for method_name, method in methods 
                             if not method_name.startswith('_')]
            
            if public_methods:
                display(Markdown("**Public Methods:**"))
                for method_name, method in public_methods:
                    sig = inspect.signature(method)
                    display(Markdown(f"- `{method_name}{sig}`"))
            else:
                display(Markdown("*No public methods*"))
            
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        if not functions:
            print(f"No functions found in module {module_name}")
            return
            
        # Filter public functions
        public_functions = [(name, func) for name, func in functions if not name.startswith('_')]
        
        if not public_functions:
            print(f"No public functions found in module {module_name}")
            return
            
        # Print info for each function
        for name, func in public_functions:                
            display(Markdown(f"### Function: {name}"))
            
            # Get signature
            sig = inspect.signature(func)
            display(Markdown(f"**Signature:** `{name}{sig}`"))
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                display(Markdown(f"**Documentation:**\n{doc}"))
            else:
                display(Markdown("*No documentation available*"))
                
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for `{function_name}`\n```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

def view_class_source(module_name, class_name):
    """Display the source code of a class"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the class
        cls = getattr(module, class_name)
        
        # Get the source code
        source = inspect.getsource(cls)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for class `{class_name}`\n```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Class {class_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing class {class_name}: {e}")

def explore_module(module_name):
    """Display a comprehensive overview of a module with classes and functions"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Module docstring
        doc = inspect.getdoc(module)
        display(Markdown(f"# Module: {module_name}"))
        
        if doc:
            display(Markdown(f"**Module Documentation:**\n{doc}"))
        else:
            display(Markdown("*No module documentation available*"))
            
        display(Markdown("---"))
        
        # Display classes
        display(Markdown("## Classes"))
        display_module_classes(module_name)
        
        # Display functions
        display(Markdown("## Functions"))
        display_module_functions(module_name)
        
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error exploring module {module_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
    
    # Print Bonsai version information if available
    if hasattr(v3, "__version__"):
        print(f"Bonsai v3 version: {v3.__version__}")
    
    # List key submodules
    print("\nAvailable Bonsai submodules:")
    for module_name in dir(v3):
        if not module_name.startswith("_") and not module_name.startswith("__"):
            print(f"- {module_name}")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Introduction\n\nGenetic genealogy involves integrating multiple data types and tools to create comprehensive family trees. While Bonsai v3 provides powerful pedigree reconstruction capabilities, it must often work alongside other specialized tools in a complete genetic genealogy workflow.\n\nIn this lab, we'll explore the integration capabilities of Bonsai v3, focusing on the DRUID (Degree Relationship Using IBD Data) algorithm. This algorithm enables Bonsai to infer relationship degrees between individuals based on IBD sharing patterns, providing a bridge between raw genetic data and pedigree structures.\n\n**Key concepts we'll cover:**\n- The DRUID algorithm and its mathematical foundations\n- Integration between Bonsai and IBD detection tools\n- Data format conversion and standardization\n- End-to-end genetic genealogy workflows

## Part 1: The DRUID Algorithm\n\n### Theory and Background\n\nDRUID (Degree Relationship Using IBD Data) is a method for inferring the degree of relationship between two individuals based on their shared Identity-By-Descent (IBD) segments. Unlike more complex relationship inference methods that try to determine specific relationship types (e.g., aunt/nephew, first cousins), DRUID focuses on estimating the *degree* of relationship:\n\n- Degree 1: Parent/child, full siblings\n- Degree 2: Grandparent/grandchild, aunt/uncle/niece/nephew, half-siblings\n- Degree 3: First cousins, great-grandparent/great-grandchild\n- And so on...\n\nDRUID works on the principle that the amount of IBD sharing between two individuals is related to their genetic distance (measured in meioses or generation steps). The algorithm uses statistical models to estimate this relationship based on:\n\n1. **Total IBD shared**: The amount of genome shared between two individuals\n2. **Expected IBD patterns**: Theoretical models of how much genome is expected to be shared at different degrees\n3. **Segment length distributions**: Information about the length distribution of IBD segments\n\nThe DRUID algorithm has several advantages that make it valuable for integration with other tools:\n\n- **Speed**: It can quickly estimate relationship degrees without complex pedigree reconstruction\n- **Robustness**: Works with both phased and unphased IBD data\n- **Compatibility**: Provides a standardized interface between raw IBD data and higher-level pedigree algorithms\n\nIn Bonsai v3, DRUID serves as a bridge between raw genetic data and more complex pedigree structures, allowing quick relationship assessment before detailed pedigree construction.

### Implementation in Bonsai v3\n\nLet's explore the implementation of the DRUID algorithm in Bonsai v3 by examining the `druid.py` module. This module contains the core functions used for relationship degree inference.

In [ ]:
# Import the druid module\ntry:\n    from bonsaitree.v3 import druid\n    print(\"✅ Successfully imported the druid module\")\n    \n    # Print key constants used in DRUID algorithm\n    from bonsaitree.v3.constants import GENOME_LENGTH, R, C, MIN_SEG_LEN\n    print(\"\\nKey constants used in DRUID algorithm:\")\n    print(f\"GENOME_LENGTH = {GENOME_LENGTH} cM  # Autosomal genome length\")\n    print(f\"R = {R}  # Expected recombinations per genome per meiosis\")\n    print(f\"C = {C}  # Number of autosomes\")\n    print(f\"MIN_SEG_LEN = {MIN_SEG_LEN} cM  # Minimum detectable segment length\")\n    \n    # List the key functions in the druid module\n    print(\"\\nKey functions in the druid module:\")\n    druid_functions = [\n        \"infer_degree_generalized_druid\",\n        \"get_conditional_king_bds\",\n        \"get_king_bds\",\n        \"get_total_ibd_bds_conditional\",\n        \"get_total_ibd_bds_unconditional\",\n        \"get_ibd_pattern_log_prob\",\n        \"get_proximal_and_shared\"\n    ]\n    \n    for func_name in druid_functions:\n        if hasattr(druid, func_name):\n            print(f\"- {func_name}\")\n        else:\n            print(f\"- {func_name} (not found)\")\n    \nexcept ImportError as e:\n    print(f\"❌ Failed to import druid module: {e}\")\n    print(\"We'll continue with a theoretical discussion of the module.\")

In [ ]:
# Examine the main DRUID algorithm implementation\ntry:\n    view_function_source(\"bonsaitree.v3.druid\", \"infer_degree_generalized_druid\")\nexcept Exception as e:\n    print(f\"Error viewing function source: {e}\")\n    \n    # If we can't view the source, provide a summary of the function\n    print(\"\\nThe 'infer_degree_generalized_druid' function infers the degree of relationship between\")\n    print(\"two ancestors based on the shared IBD between their most proximal genotyped relatives.\")\n    print(\"\\nKey steps in the algorithm:\")\n    print(\"1. Find proximal genotyped relatives for each ancestor\")\n    print(\"2. Calculate the expected fraction of genome shared\")\n    print(\"3. Compare the observed genome sharing to theoretical thresholds\")\n    print(\"4. Return the estimated relationship degree\")

In [ ]:
# Examine the IBD boundary functions used by DRUID\ntry:\n    view_function_source(\"bonsaitree.v3.druid\", \"get_king_bds\")\nexcept Exception as e:\n    print(f\"Error viewing function source: {e}\")

### The KING Method and Relationship Thresholds\n\nThe DRUID algorithm in Bonsai v3 is based on the KING method (Kinship-based INference for Gwas), which uses IBD sharing to infer relationship degrees. The key insight behind KING is that for each additional degree of separation between individuals, the expected fraction of genome shared is approximately halved.\n\nIn Bonsai's implementation, the `get_king_bds` function calculates these theoretical boundaries using the formula:\n\n```python\ndeg_arr = np.arange(0, max_deg)\nexpon = deg_arr + 1/2\nbds = 2**(-expon)\nbds = 2*bds\n```\n\nThis formula produces boundaries like:\n- Degree 0 (self): 2.0 (adjusted to this value for implementation reasons)\n- Degree 1 (parent/child, full siblings): ~0.5\n- Degree 2 (grandparent, aunt/uncle, half-siblings): ~0.25\n- Degree 3 (first cousins): ~0.125\n- Degree 4 (first cousins once removed): ~0.0625\n\nThese boundaries represent the expected fraction of the genome shared between individuals at each degree of relationship. The DRUID algorithm uses these boundaries to determine the most likely relationship degree based on observed IBD sharing.

In [ ]:
# Visualization of KING boundaries for relationship inference\ntry:\n    # Create a range of degrees\n    max_deg = 8\n    degrees = np.arange(1, max_deg + 1)\n    \n    # Calculate boundaries using the formula from get_king_bds\n    expon = degrees - 0.5  # Adjusted to show actual degrees (1-based) instead of array indices\n    boundaries = 2**(-expon)\n    boundaries = 2 * boundaries\n    \n    # Labels for each degree\n    relationship_labels = [\n        \"Parent/Child, Full Siblings\",\n        \"Grandparent, Aunt/Uncle, Half-Siblings\",\n        \"Great-Grandparent, First Cousins\",\n        \"First Cousins Once Removed\",\n        \"Second Cousins\",\n        \"Second Cousins Once Removed\",\n        \"Third Cousins\",\n        \"Third Cousins Once Removed\"\n    ]\n    \n    # Create the visualization\n    plt.figure(figsize=(12, 8))\n    \n    # Plot the boundaries\n    plt.bar(degrees, boundaries, width=0.6, alpha=0.7, color='steelblue')\n    \n    # Add value labels on top of each bar\n    for i, v in enumerate(boundaries):\n        plt.text(degrees[i], v + 0.01, f\"{v:.4f}\", ha='center', fontsize=10)\n    \n    # Add relationship labels\n    plt.xticks(degrees, [f\"Degree {d}:\\n{relationship_labels[i]}\" for i, d in enumerate(degrees)], \n              rotation=45, ha='right', fontsize=9)\n    \n    # Add labels and title\n    plt.ylabel('Expected Fraction of Genome Shared', fontsize=12)\n    plt.title('KING Boundaries for Relationship Degree Inference', fontsize=14)\n    plt.grid(axis='y', alpha=0.3)\n    plt.tight_layout()\n    \n    # Add a horizontal line showing the minimum detectable IBD threshold\n    min_ibd_threshold = 0.01  # Example threshold, often dependent on the detection method\n    plt.axhline(y=min_ibd_threshold, color='red', linestyle='--', \n                label=f'Minimum Detectable IBD Threshold (0.01)')\n    \n    plt.legend()\n    plt.show()\n    \nexcept Exception as e:\n    print(f\"Error creating visualization: {e}\")

### Exercise 1: Understanding DRUID's Relationship Inference\n\nIn this exercise, we'll simulate some IBD sharing scenarios and use the principles behind DRUID to infer relationship degrees.\n\n**Task:** Complete the function below to estimate relationship degrees based on the fraction of genome shared, using the KING method boundaries. Then analyze how different factors might affect the accuracy of these estimates.\n\n**Hint:** Use the KING boundary formula (2^(-(degree+0.5)) * 2) to determine the expected IBD sharing for each degree.

In [ ]:
# Exercise 1: Implementing a simplified DRUID-like relationship estimator\ndef estimate_relationship_degree(shared_ibd_fraction, max_degree=10):\n    \"\"\"\n    Estimate the relationship degree based on the fraction of genome shared.\n    \n    Args:\n        shared_ibd_fraction: The fraction of the genome shared (0.0 to 1.0)\n        max_degree: The maximum relationship degree to consider\n        \n    Returns:\n        estimated_degree: The estimated relationship degree\n        confidence: A confidence score (high, medium, low)\n    \"\"\"\n    # Calculate KING boundaries for degrees 0 to max_degree\n    degrees = np.arange(0, max_degree)\n    expon = degrees + 0.5\n    boundaries = 2**(-expon) * 2\n    \n    # Adjust the self boundary\n    boundaries[0] = 2.0\n    \n    # Find the degree with the closest boundary to the observed sharing\n    estimated_degree = sum(boundaries > shared_ibd_fraction) - 1\n    estimated_degree = max(0, estimated_degree)  # Ensure degree is not negative\n    \n    # Calculate confidence based on how close the sharing is to the nearest boundaries\n    if estimated_degree < max_degree - 1:\n        upper_boundary = boundaries[estimated_degree]\n        lower_boundary = boundaries[estimated_degree + 1]\n    else:\n        upper_boundary = boundaries[estimated_degree]\n        lower_boundary = 0\n    \n    # Distance from each boundary, normalized by the boundary interval\n    boundary_interval = upper_boundary - lower_boundary\n    distance_from_upper = (upper_boundary - shared_ibd_fraction) / boundary_interval\n    distance_from_lower = (shared_ibd_fraction - lower_boundary) / boundary_interval\n    \n    # Determine confidence level\n    min_distance = min(distance_from_upper, distance_from_lower)\n    if min_distance < 0.2:\n        confidence = \"high\"\n    elif min_distance < 0.4:\n        confidence = \"medium\"\n    else:\n        confidence = \"low\"\n    \n    return estimated_degree, confidence\n\n# Test the function with various IBD sharing values\nibd_sharing_examples = [\n    (\"Parent-Child\", 0.5),\n    (\"Full Siblings\", 0.5),\n    (\"Half Siblings\", 0.25),\n    (\"Grandparent-Grandchild\", 0.25),\n    (\"First Cousins\", 0.125),\n    (\"First Cousins Once Removed\", 0.0625),\n    (\"Second Cousins\", 0.03125),\n    (\"Third Cousins\", 0.0078125),\n    (\"Fourth Cousins\", 0.00390625),\n    # Add some non-standard examples\n    (\"Between 2nd and 3rd degree\", 0.18),  # Between aunt/uncle and first cousin\n    (\"Low sharing first cousins\", 0.09),   # First cousins with lower than expected sharing\n    (\"High sharing second cousins\", 0.05), # Second cousins with higher than expected sharing\n    (\"Very distant relatives\", 0.003)      # Distant relatives with limited sharing\n]\n\n# Display results in a table\nresults = []\nfor desc, sharing in ibd_sharing_examples:\n    degree, confidence = estimate_relationship_degree(sharing)\n    results.append({\n        \"Relationship Description\": desc,\n        \"IBD Fraction Shared\": sharing,\n        \"Estimated Degree\": degree,\n        \"Confidence\": confidence\n    })\n\n# Create and display the results DataFrame\nresults_df = pd.DataFrame(results)\ndisplay(results_df)\n\n# Visualize the results\nplt.figure(figsize=(12, 8))\n\n# Create the main plot with expected boundaries\nmax_deg = 7\ndegrees = np.arange(0, max_deg + 1)\nexpon = degrees + 0.5\nboundaries = 2**(-expon) * 2\nboundaries[0] = 1.0  # Adjust for visualization\n\n# Plot the expected boundaries\nplt.step([0] + list(degrees) + [max_deg + 1], \n         [boundaries[0]] + list(boundaries) + [0], \n         where='post', linestyle='--', color='gray', \n         label='KING Relationship Boundaries')\n\n# Plot the examples with different confidence levels\nconfidence_colors = {\"high\": \"green\", \"medium\": \"orange\", \"low\": \"red\"}\nfor i, row in enumerate(results):\n    degree = row[\"Estimated Degree\"]\n    sharing = row[\"IBD Fraction Shared\"]\n    confidence = row[\"Confidence\"]\n    description = row[\"Relationship Description\"]\n    \n    plt.scatter(degree, sharing, color=confidence_colors[confidence], \n                s=100, alpha=0.7, edgecolor='black', linewidth=1)\n    \n    # Add labels for selected points (avoid overcrowding)\n    if i % 2 == 0 or confidence != \"high\":\n        plt.annotate(description, (degree, sharing), \n                     xytext=(5, 5), textcoords='offset points', fontsize=8)\n\n# Add labels and legend\nplt.xlabel('Relationship Degree', fontsize=12)\nplt.ylabel('Fraction of Genome Shared', fontsize=12)\nplt.title('Relationship Degree Estimation Based on IBD Sharing', fontsize=14)\nplt.xticks(range(0, max_deg + 1))\nplt.yscale('log')\nplt.grid(True, alpha=0.3)\n\n# Create custom legend for confidence levels\nlegend_elements = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10, \n                           label=f\"{conf.capitalize()} Confidence\") \n                 for conf, color in confidence_colors.items()]\nlegend_elements.append(plt.Line2D([0], [0], linestyle='--', color='gray', \n                               label='KING Relationship Boundaries'))\nplt.legend(handles=legend_elements, loc='best')\n\nplt.tight_layout()\nplt.show()

### Factors Affecting DRUID Accuracy\n\nWhile the DRUID algorithm provides a powerful method for estimating relationship degrees, several factors can affect its accuracy:\n\n1. **Statistical Variation in IBD Sharing**: The actual amount of genome shared between relatives varies around the expected value due to the random nature of recombination. For example, full siblings can share anywhere from ~35% to ~65% of their genome, even though the expected value is 50%.\n\n2. **IBD Detection Limitations**: The ability to detect IBD segments depends on the quality and coverage of genetic data, as well as the performance of the IBD detection algorithm. Small segments may be missed, especially for distant relatives.\n\n3. **Conditional vs. Unconditional Probabilities**: DRUID can use either unconditional boundaries (standard KING method) or conditional boundaries that account for the fact that we're only observing pairs with detectable IBD. The conditional approach may be more accurate for distant relationships.\n\n4. **Genotyping Coverage**: When individuals in a pedigree aren't directly genotyped, DRUID must estimate relationships through their most proximal genotyped relatives, which introduces additional uncertainty.\n\n5. **Population Structure**: Expected IBD sharing can vary across different populations due to factors like endogamy, founder effects, and bottlenecks.\n\nIn the Bonsai v3 implementation, the DRUID algorithm addresses some of these challenges by:\n\n- Providing both conditional and unconditional boundary options\n- Taking into account the fraction of genome shared between proximal relatives\n- Scaling observed IBD appropriately when working with indirectly genotyped individuals\n- Offering likelihood-based relationship assessment in addition to threshold-based methods

## Part 2: Integration with IBD Detection Tools\n\n### Theory and Background\n\nBefore Bonsai can reconstruct pedigrees, it needs information about genetic relationships in the form of Identity-By-Descent (IBD) segments. These segments are typically detected using specialized IBD detection tools, which analyze genotype data to identify shared chromosomal regions that likely come from a common ancestor.\n\nThe most common IBD detection tools include:\n\n1. **IBIS**: Fast IBD segment detection using unphased genotype data\n2. **hap-IBD**: High-accuracy IBD detection using phased haplotypes\n3. **refined-IBD**: Precise IBD detection with phased data and refined segment boundaries\n4. **GERMLINE**: Fast and efficient detection for large datasets\n\nEach of these tools produces output in different formats, which creates an integration challenge for pedigree reconstruction systems like Bonsai. To address this, Bonsai implements adapter functions that convert these various formats into its internal IBD representation.\n\nThe integration process typically involves:\n\n1. **Data Format Conversion**: Transforming tool-specific output formats into Bonsai's internal format\n2. **Segment Filtering**: Removing low-quality or very short segments that might introduce noise\n3. **Segment Merging**: Combining adjacent or overlapping segments when appropriate\n4. **Metadata Enrichment**: Adding additional information like segment quality scores or phase information

### Implementation in Bonsai v3\n\nBonsai v3 uses a standardized internal format for IBD segments, which serves as the universal interface between external IBD detection tools and its pedigree reconstruction algorithms. Let's examine how Bonsai implements this integration.

In [ ]:
# Examine Bonsai's IBD segment representation\ntry:\n    from bonsaitree.v3 import ibd\n    print(\"✅ Successfully imported the IBD module\")\n    \n    # Examine the IBDSegment class\n    if hasattr(ibd, \"IBDSegment\"):\n        view_class_source(\"bonsaitree.v3.ibd\", \"IBDSegment\")\n    else:\n        print(\"IBDSegment class not found. Looking for alternative implementations...\")\n        \n    # Alternatively, if the class is not available, look for relevant functions\n    key_functions = [\n        \"get_total_ibd_between_id_sets\",\n        \"merge_adjacent_segments\",\n        \"get_ibd_stats_unphased\"\n    ]\n    \n    print(\"\\nKey IBD processing functions:\")\n    for func_name in key_functions:\n        if hasattr(ibd, func_name):\n            print(f\"- {func_name} (Found)\")\n        else:\n            print(f\"- {func_name} (Not found)\")\n            \nexcept ImportError as e:\n    print(f\"❌ Failed to import IBD module: {e}\")\n    print(\"We'll continue with a theoretical discussion of IBD integration.\")

### Bonsai's IBD Segment Format\n\nBonsai v3 uses a standardized internal representation for IBD segments, typically as a tuple or list with the following structure:\n\n```python\n# Standard IBD segment format\nibd_segment = [id1, id2, hap1, hap2, chromosome, start_pos, end_pos, length_cM]\n```\n\nWhere:\n- `id1`, `id2`: IDs of the two individuals sharing the segment\n- `hap1`, `hap2`: Haplotype indices (0 or 1) if phased data is available (often set to -1 for unphased data)\n- `chromosome`: Chromosome number\n- `start_pos`, `end_pos`: Start and end positions in base pairs\n- `length_cM`: Segment length in centiMorgans\n\nThis standardized format allows Bonsai to process IBD data from different sources consistently. To support integration with various IBD detection tools, Bonsai implements parser functions that convert from tool-specific formats to this internal representation.

### Integration with Specific IBD Detection Tools\n\nBonsai v3 is designed to integrate with various IBD detection tools, each with its own output format. Let's examine how Bonsai handles the integration with some of the most common tools:\n\n#### 1. IBIS Integration\n\nIBIS (Identity By Descent Identification System) is a fast IBD detection tool that works with unphased genotype data. IBIS produces output files with the following format:\n\n```\nsample1 sample2 chromosome start_pos end_pos length_cM score\n```\n\nTo convert IBIS output to Bonsai's internal format, the integration code needs to:\n- Read the IBIS output file\n- Set haplotype indices to -1 (unphased)\n- Add the necessary fields to match Bonsai's expected format\n\n#### 2. hap-IBD Integration\n\nhap-IBD is a high-accuracy IBD detection tool that works with phased haplotype data. Its output typically looks like:\n\n```\nsample1 hap1 sample2 hap2 chromosome start_pos end_pos length_cM LOD_score\n```\n\nThe integration code for hap-IBD needs to:\n- Read the hap-IBD output file\n- Extract the haplotype indices (hap1 and hap2)\n- Format the data according to Bonsai's internal representation\n\n#### 3. refined-IBD Integration\n\nrefined-IBD is another phased IBD detection tool that provides very precise segment boundaries. Its output format is:\n\n```\nsample1 hap1 sample2 hap2 chromosome start_pos end_pos length_cM LOD_score\n```\n\nThe integration process is similar to hap-IBD, with specific adjustments for refined-IBD's particularities.\n\n#### 4. GERMLINE Integration\n\nGERMLINE is a fast IBD detection tool often used for large datasets. Its output format is:\n\n```\nfamily1 sample1 family2 sample2 chromosome start_pos end_pos genetic_distance unit_type\n```\n\nIntegrating GERMLINE output requires:\n- Extracting sample IDs from family:sample format\n- Converting genetic distances to centiMorgans if needed\n- Setting appropriate haplotype indices based on available phasing information

### Example: Converting IBD Tool Output to Bonsai Format\n\nLet's implement a simple converter function that transforms IBD segment data from different tools into Bonsai's internal format.

In [ ]:
def convert_ibd_to_bonsai_format(ibd_data, tool_name):\n    \"\"\"\n    Convert IBD data from various tools to Bonsai's internal format.\n    \n    Args:\n        ibd_data: List of IBD segments in tool-specific format\n        tool_name: Name of the IBD detection tool ('ibis', 'hapibd', 'refinedibd', 'germline')\n        \n    Returns:\n        List of IBD segments in Bonsai's internal format:\n        [id1, id2, hap1, hap2, chromosome, start_pos, end_pos, length_cM]\n    \"\"\"\n    bonsai_segments = []\n    \n    if tool_name.lower() == 'ibis':\n        # IBIS format: sample1 sample2 chromosome start_pos end_pos length_cM score\n        for segment in ibd_data:\n            sample1, sample2, chrom, start, end, length, score = segment\n            # Convert to Bonsai format with -1 for unphased haplotypes\n            bonsai_segment = [sample1, sample2, -1, -1, chrom, start, end, length]\n            bonsai_segments.append(bonsai_segment)\n            \n    elif tool_name.lower() == 'hapibd':\n        # hap-IBD format: sample1 hap1 sample2 hap2 chromosome start_pos end_pos length_cM LOD_score\n        for segment in ibd_data:\n            sample1, hap1, sample2, hap2, chrom, start, end, length, lod = segment\n            # Convert to Bonsai format with phased haplotypes\n            bonsai_segment = [sample1, sample2, hap1, hap2, chrom, start, end, length]\n            bonsai_segments.append(bonsai_segment)\n            \n    elif tool_name.lower() == 'refinedibd':\n        # refined-IBD format: sample1 hap1 sample2 hap2 chromosome start_pos end_pos length_cM LOD_score\n        for segment in ibd_data:\n            sample1, hap1, sample2, hap2, chrom, start, end, length, lod = segment\n            # Convert to Bonsai format with phased haplotypes\n            bonsai_segment = [sample1, sample2, hap1, hap2, chrom, start, end, length]\n            bonsai_segments.append(bonsai_segment)\n            \n    elif tool_name.lower() == 'germline':\n        # GERMLINE format: family1 sample1 family2 sample2 chromosome start_pos end_pos genetic_distance unit_type\n        for segment in ibd_data:\n            family1, sample1, family2, sample2, chrom, start, end, genetic_dist, unit = segment\n            # Extract sample IDs and convert to Bonsai format with unphased haplotypes\n            # Assume genetic_dist is already in cM\n            bonsai_segment = [sample1, sample2, -1, -1, chrom, start, end, genetic_dist]\n            bonsai_segments.append(bonsai_segment)\n            \n    else:\n        raise ValueError(f\"Unsupported IBD tool: {tool_name}\")\n        \n    return bonsai_segments\n\n# Example data for different IBD tools\nibis_example = [\n    [\"sample_001\", \"sample_002\", 1, 5000000, 10000000, 7.5, 10.2],\n    [\"sample_001\", \"sample_003\", 2, 15000000, 25000000, 12.3, 15.8],\n    [\"sample_002\", \"sample_003\", 5, 50000000, 60000000, 9.7, 11.5]\n]\n\nhapibd_example = [\n    [\"sample_001\", 0, \"sample_002\", 1, 1, 5000000, 10000000, 7.5, 20.1],\n    [\"sample_001\", 1, \"sample_003\", 0, 2, 15000000, 25000000, 12.3, 25.4],\n    [\"sample_002\", 0, \"sample_003\", 1, 5, 50000000, 60000000, 9.7, 18.9]\n]\n\n# Convert examples to Bonsai format\nbonsai_ibis = convert_ibd_to_bonsai_format(ibis_example, 'ibis')\nbonsai_hapibd = convert_ibd_to_bonsai_format(hapibd_example, 'hapibd')\n\n# Display the results\nprint(\"IBIS data converted to Bonsai format:\")\nfor segment in bonsai_ibis:\n    print(segment)\n    \nprint(\"\\nhap-IBD data converted to Bonsai format:\")\nfor segment in bonsai_hapibd:\n    print(segment)\n\n# Create DataFrames for better visualization\nibis_df = pd.DataFrame(bonsai_ibis, \n                      columns=['ID1', 'ID2', 'Hap1', 'Hap2', 'Chr', 'Start', 'End', 'Length_cM'])\nhapibd_df = pd.DataFrame(bonsai_hapibd, \n                        columns=['ID1', 'ID2', 'Hap1', 'Hap2', 'Chr', 'Start', 'End', 'Length_cM'])\n\nprint(\"\\nIBIS conversion result:\")\ndisplay(ibis_df)\n\nprint(\"\\nhap-IBD conversion result:\")\ndisplay(hapibd_df)

### Exercise 2: IBD Segment Processing\n\nIn this exercise, we'll implement post-processing functions for IBD segments that are commonly used in Bonsai's workflow after importing data from external IBD detection tools.\n\n**Task:** Complete the functions below to filter and merge IBD segments according to quality criteria.\n\n**Hint:** Pay attention to both segment length and position when merging adjacent segments.

In [ ]:
def filter_segments(ibd_segments, min_length=7.0, min_snps=None):\n    \"\"\"\n    Filter IBD segments based on quality criteria.\n    \n    Args:\n        ibd_segments: List of IBD segments in Bonsai format\n        min_length: Minimum segment length in cM to keep\n        min_snps: Minimum number of SNPs in a segment to keep (optional)\n        \n    Returns:\n        Filtered list of IBD segments\n    \"\"\"\n    filtered_segments = []\n    \n    for segment in ibd_segments:\n        # Unpack the segment\n        id1, id2, hap1, hap2, chrom, start, end, length = segment\n        \n        # Apply length filter\n        if length >= min_length:\n            # Add SNP count filter if provided\n            if min_snps is not None and len(segment) > 8:  # Assuming SNP count is stored as 9th element\n                snp_count = segment[8]  # Extract SNP count if available\n                if snp_count >= min_snps:\n                    filtered_segments.append(segment)\n            else:\n                filtered_segments.append(segment)\n    \n    return filtered_segments\n\ndef merge_adjacent_segments(ibd_segments, max_gap=2.0, phase_sensitive=True):\n    \"\"\"\n    Merge adjacent or overlapping IBD segments between the same pair of individuals.\n    \n    Args:\n        ibd_segments: List of IBD segments in Bonsai format\n        max_gap: Maximum gap in cM between segments to be merged\n        phase_sensitive: If True, only merge segments on the same haplotypes\n        \n    Returns:\n        List of merged IBD segments\n    \"\"\"\n    # Group segments by individual pair and chromosome\n    segment_groups = {}\n    \n    for segment in ibd_segments:\n        id1, id2, hap1, hap2, chrom, start, end, length = segment\n        \n        # Ensure consistent ordering of IDs\n        if id1 > id2:\n            id1, id2 = id2, id1\n            hap1, hap2 = hap2, hap1\n        \n        # Create a key that groups segments by individuals, chromosome, and haplotypes if phase_sensitive\n        if phase_sensitive and hap1 >= 0 and hap2 >= 0:  # Only apply phase sensitivity if phased\n            key = (id1, id2, chrom, hap1, hap2)\n        else:\n            key = (id1, id2, chrom)\n        \n        if key not in segment_groups:\n            segment_groups[key] = []\n            \n        segment_groups[key].append(segment)\n    \n    # Process each group to merge adjacent segments\n    merged_segments = []\n    \n    for key, segments in segment_groups.items():\n        # Sort segments by start position\n        sorted_segments = sorted(segments, key=lambda x: x[5])  # Sort by start position\n        \n        # Initialize the merged segments list with the first segment\n        current_merged = list(sorted_segments[0])  # Convert to list to allow modification\n        \n        for segment in sorted_segments[1:]:\n            # Get current end and next start positions\n            current_end = current_merged[6]  # End position\n            current_end_cm = current_merged[7]  # End position in cM (length)\n            next_start = segment[5]  # Start position\n            next_length = segment[7]  # Length in cM\n            \n            # Calculate gap or overlap\n            gap = next_start - current_end\n            \n            # If positions overlap or gap is small enough, merge segments\n            if gap <= 0 or (gap > 0 and gap <= max_gap):\n                # Update end position and length for the merged segment\n                current_merged[6] = max(current_merged[6], segment[6])  # Take the maximum end position\n                current_merged[7] = current_merged[7] + next_length  # Update length\n            else:\n                # Add the current merged segment and start a new one\n                merged_segments.append(tuple(current_merged))\n                current_merged = list(segment)\n        \n        # Add the last merged segment\n        merged_segments.append(tuple(current_merged))\n    \n    return merged_segments\n\n# Test with sample data\nsample_segments = [\n    # ID1, ID2, Hap1, Hap2, Chr, Start, End, Length\n    [\"A\", \"B\", 0, 1, 1, 1000000, 5000000, 4.0],\n    [\"A\", \"B\", 0, 1, 1, 5500000, 10000000, 4.5],  # Adjacent to first segment\n    [\"A\", \"B\", 1, 0, 1, 15000000, 20000000, 5.0],  # Different haplotype\n    [\"A\", \"C\", -1, -1, 2, 1000000, 8000000, 7.0],\n    [\"A\", \"C\", -1, -1, 2, 7500000, 12000000, 4.5],  # Overlaps with previous\n    [\"B\", \"C\", 0, 0, 3, 1000000, 3000000, 2.0],    # Below minimum length\n    [\"B\", \"C\", 0, 0, 3, 4000000, 8000000, 4.0],\n    [\"A\", \"B\", 0, 1, 1, 20000000, 30000000, 10.0]  # Same haplotype as first but gap too large\n]\n\n# Apply filtering\nfiltered_segments = filter_segments(sample_segments, min_length=3.0)\nprint(f\"Filtered segments (min length 3.0 cM): {len(filtered_segments)}\")\n\n# Apply merging\nmerged_segments_phase_sensitive = merge_adjacent_segments(filtered_segments, max_gap=2.0, phase_sensitive=True)\nmerged_segments_phase_insensitive = merge_adjacent_segments(filtered_segments, max_gap=2.0, phase_sensitive=False)\n\nprint(f\"Merged segments (phase-sensitive): {len(merged_segments_phase_sensitive)}\")\nprint(f\"Merged segments (phase-insensitive): {len(merged_segments_phase_insensitive)}\")\n\n# Display results\ndef display_segment_comparison(original, filtered, merged_sensitive, merged_insensitive):\n    # Convert to DataFrames for better visualization\n    columns = ['ID1', 'ID2', 'Hap1', 'Hap2', 'Chr', 'Start', 'End', 'Length_cM']\n    \n    original_df = pd.DataFrame(original, columns=columns)\n    filtered_df = pd.DataFrame(filtered, columns=columns)\n    merged_sensitive_df = pd.DataFrame(merged_sensitive, columns=columns)\n    merged_insensitive_df = pd.DataFrame(merged_insensitive, columns=columns)\n    \n    print(\"\\nOriginal Segments:\")\n    display(original_df)\n    \n    print(\"\\nFiltered Segments (min length 3.0 cM):\")\n    display(filtered_df)\n    \n    print(\"\\nMerged Segments (Phase-Sensitive):\")\n    display(merged_sensitive_df)\n    \n    print(\"\\nMerged Segments (Phase-Insensitive):\")\n    display(merged_insensitive_df)\n\n# Display the comparison\ndisplay_segment_comparison(\n    sample_segments,\n    filtered_segments,\n    merged_segments_phase_sensitive,\n    merged_segments_phase_insensitive\n)

## Part 3: End-to-End Genetic Genealogy Workflows\n\n### Theory and Background\n\nIn real-world applications, Bonsai is typically part of a larger genetic genealogy workflow that includes multiple stages and tools. A complete end-to-end workflow might include:\n\n1. **Raw Genetic Data Processing**: Converting raw genetic data (e.g., from consumer DNA testing companies) into a standard format\n2. **Quality Control**: Filtering out low-quality or problematic SNPs/variants\n3. **Phasing**: Determining the chromosome-specific origin of heterozygous variants\n4. **IBD Detection**: Identifying IBD segments shared between individuals\n5. **Relationship Inference**: Estimating the relationships between pairs of individuals\n6. **Pedigree Reconstruction**: Building family trees based on inferred relationships\n7. **Visualization and Analysis**: Presenting the results in an interpretable format\n\nBonsai v3 is primarily focused on steps 5 and 6, but it includes integration capabilities for steps 4 and 7. The end-to-end workflow requires a coordinated data flow between these different stages, with appropriate data format conversions at each step.

### Implementation in Bonsai v3\n\nBonsai v3 provides several functions and modules that facilitate its integration into end-to-end genetic genealogy workflows. Let's examine some of the key integration points:

In [ ]:
# Examine Bonsai's end-to-end workflow integration\nprint(\"Key integration points in Bonsai v3:\")\nprint(\"\\n1. IBD Data Import and Export\")\nprint(\"   - Functions for loading and saving IBD segments\")\nprint(\"   - Format conversion utilities for various IBD detection tools\")\nprint(\"   - Filters and quality control for IBD segments\")\n\nprint(\"\\n2. Demographic Data Integration\")\nprint(\"   - Age and sex information integration\")\nprint(\"   - Historical records integration\")\nprint(\"   - Population-specific parameter adjustments\")\n\nprint(\"\\n3. Visualization and Reporting\")\nprint(\"   - Pedigree rendering with Graphviz\")\nprint(\"   - Relationship confidence scoring\")\nprint(\"   - Interactive exploration of alternative pedigree structures\")\n\nprint(\"\\n4. External Tool Interfaces\")\nprint(\"   - DRUID algorithm for KING-like relationship degree inference\")\nprint(\"   - VCF file processing utilities\")\nprint(\"   - Network analysis and community detection integration\")\n\n# Check for the run_bonsai.py script which demonstrates end-to-end usage\ntry:\n    # Assuming the script is in the scripts_work directory\n    script_path = \"/home/lakishadavid/computational_genetic_genealogy/scripts_work/run_bonsai.py\"\n    \n    # Use Bash to check if the file exists and show its contents\n    import subprocess\n    result = subprocess.run([\"ls\", \"-l\", script_path], capture_output=True, text=True)\n    \n    if result.returncode == 0:\n        print(\"\\nFound run_bonsai.py script for end-to-end workflow example:\")\n        print(result.stdout)\n        \n        # Show the first 20 lines to get an overview\n        head_result = subprocess.run([\"head\", \"-n\", \"20\", script_path], capture_output=True, text=True)\n        print(\"\\nPreview of run_bonsai.py:\")\n        print(head_result.stdout)\n    else:\n        print(\"\\nCould not find run_bonsai.py script. Using theoretical workflow instead.\")\n        \nexcept Exception as e:\n    print(f\"\\nError checking for run_bonsai.py: {e}\")\n    print(\"Will proceed with theoretical workflow example.\")

In [ ]:
# Create a conceptual flowchart of the end-to-end genetic genealogy workflow\nimport matplotlib.pyplot as plt\nimport matplotlib.patches as mpatches\nimport numpy as np\n\n# Define the workflow components\ncomponents = [\n    \"Raw Genetic Data\\n(VCF, 23andMe, etc.)\",\n    \"Quality Control\\nFilter & Normalize\",\n    \"Phasing\\n(Beagle, ShapeIT, etc.)\",\n    \"IBD Detection\\n(IBIS, hap-IBD, etc.)\",\n    \"Relationship Inference\\n(DRUID)\",\n    \"Pedigree Reconstruction\\n(Bonsai Core)\",\n    \"Visualization & Analysis\\n(Graphviz, NetworkX)\"\n]\n\n# Define component categories and colors\ncomponent_types = [\n    \"External Tool\",   # Raw data\n    \"External Tool\",   # QC\n    \"External Tool\",   # Phasing\n    \"External Tool\",   # IBD Detection\n    \"Bonsai Module\",   # Relationship\n    \"Bonsai Module\",   # Pedigree\n    \"Integration Point\" # Viz\n]\n\ncolors = {\n    \"External Tool\": \"#ADD8E6\",  # Light blue\n    \"Bonsai Module\": \"#90EE90\",  # Light green\n    \"Integration Point\": \"#FFFACD\"  # Light yellow\n}\n\n# Create the figure\nfig, ax = plt.subplots(figsize=(12, 8))\n\n# Set up the plot\nax.set_xlim(0, 12)\nax.set_ylim(0, 10)\nax.axis('off')\n\n# Draw component boxes\nbox_width = 2.5\nbox_height = 1.2\nx_positions = [1.5, 3.5, 5.5, 7.5, 9.5, 7.5, 5.5]\ny_positions = [8, 6, 4, 2, 4, 6, 8]\n\nfor i, (component, comp_type) in enumerate(zip(components, component_types)):\n    x = x_positions[i]\n    y = y_positions[i]\n    \n    # Draw box with appropriate color\n    rect = mpatches.Rectangle((x - box_width/2, y - box_height/2), \n                             box_width, box_height, \n                             color=colors[comp_type], \n                             alpha=0.8, \n                             edgecolor='black', \n                             linewidth=1)\n    ax.add_patch(rect)\n    \n    # Add text\n    ax.text(x, y, component, ha='center', va='center', fontsize=10)\n\n# Draw arrows connecting components\narrows = [\n    (0, 1),  # Raw data to QC\n    (1, 2),  # QC to Phasing\n    (2, 3),  # Phasing to IBD Detection\n    (3, 4),  # IBD Detection to Relationship Inference\n    (4, 5),  # Relationship Inference to Pedigree Reconstruction\n    (5, 6)   # Pedigree Reconstruction to Visualization\n]\n\narrow_props = dict(arrowstyle=\"->\", linewidth=1.5, color='gray')\n\nfor start, end in arrows:\n    start_x = x_positions[start] + box_width/2 if x_positions[start] < x_positions[end] else x_positions[start] - box_width/2\n    start_y = y_positions[start]\n    \n    end_x = x_positions[end] - box_width/2 if x_positions[end] > x_positions[start] else x_positions[end] + box_width/2\n    end_y = y_positions[end]\n    \n    # If moving diagonally, use a curved arrow\n    if start_y != end_y:\n        connection_style = \"arc3,rad=0.2\"\n    else:\n        connection_style = \"arc3,rad=0\"\n    \n    arrow = mpatches.FancyArrowPatch((start_x, start_y), (end_x, end_y), \n                                  connectionstyle=connection_style, **arrow_props)\n    ax.add_patch(arrow)\n\n# Add integration annotations\nintegration_points = [\n    (3.5, 3, \"Data Format\\nConversion\"),  # Between IBD Detection and DRUID\n    (8.5, 5, \"Pedigree\\nVisualization\"),  # Between Pedigree and Viz\n    (6.5, 7, \"Parameter\\nAdjustment\")     # Between Relationship and Pedigree\n]\n\nfor x, y, text in integration_points:\n    # Draw a star or diamond to indicate integration point\n    ax.scatter(x, y, marker='*', s=150, color='orange', edgecolor='black', zorder=10)\n    \n    # Add text\n    ax.text(x, y - 0.5, text, ha='center', va='center', fontsize=8, \n           bbox=dict(boxstyle=\"round,pad=0.3\", facecolor='white', alpha=0.7, edgecolor='orange'))\n\n# Add legend\nlegend_elements = [mpatches.Patch(color=color, label=label) \n                  for label, color in colors.items()]\nlegend_elements.append(plt.Line2D([0], [0], marker='*', color='w', \n                              markerfacecolor='orange', markersize=15, label='Integration Point'))\n\nax.legend(handles=legend_elements, loc='upper center', bbox_to_anchor=(0.5, 0.05), \n          ncol=len(colors) + 1, frameon=True, fancybox=True, shadow=True)\n\n# Add title\nax.set_title('End-to-End Genetic Genealogy Workflow', fontsize=14, pad=20)\n\nplt.tight_layout()\nplt.show()

### Exercise 3: Designing an End-to-End Workflow\n\nIn this exercise, you'll design a high-level workflow for a genetic genealogy application that integrates Bonsai with other tools.\n\n**Task:** Complete the workflow implementation below, which outlines the main steps in a genetic genealogy pipeline from raw data to visualized pedigrees.\n\n**Hint:** Consider what data formats and transformations are needed at each step of the workflow.

In [ ]:
import os\nfrom typing import List, Dict, Tuple, Optional, Any\n\nclass GeneticGenealogyPipeline:\n    \"\"\"A class implementing an end-to-end genetic genealogy workflow\"\"\"\n    \n    def __init__(self, input_dir: str, output_dir: str, min_segment_cm: float = 7.0):\n        \"\"\"Initialize the workflow with input/output directories and parameters\"\"\"\n        self.input_dir = input_dir\n        self.output_dir = output_dir\n        self.min_segment_cm = min_segment_cm\n        self.demographic_data = {}\n        self.ibd_segments = []\n        \n        # Create output directory if it doesn't exist\n        os.makedirs(output_dir, exist_ok=True)\n        \n        print(f\"Initialized pipeline with min segment length: {min_segment_cm} cM\")\n        \n    def load_raw_genetic_data(self, file_format: str = 'vcf') -> None:\n        \"\"\"Load and preprocess raw genetic data\"\"\"\n        print(\"1. Loading raw genetic data...\")\n        \n        # Implement data loading based on format\n        if file_format == 'vcf':\n            print(f\"   Loading VCF files from {self.input_dir}\")\n            # In a real implementation, this would use PyVCF or similar\n            print(\"   Preprocessing VCF files (filtering variants, etc.)\")\n        elif file_format == '23andme':\n            print(f\"   Loading 23andMe files from {self.input_dir}\")\n            # Process consumer DNA test files\n        else:\n            raise ValueError(f\"Unsupported file format: {file_format}\")\n            \n        print(\"   Raw genetic data loaded and preprocessed\")\n    \n    def perform_quality_control(self) -> None:\n        \"\"\"Apply quality control filters to genetic data\"\"\"\n        print(\"\\n2. Performing quality control...\")\n        # Implement quality control steps:\n        print(\"   Checking variant call rates\")\n        print(\"   Removing low-quality SNPs\")\n        print(\"   Checking for genetic relatedness outliers\")\n        print(\"   Quality control complete\")\n    \n    def phase_genotypes(self, phasing_tool: str = 'beagle') -> None:\n        \"\"\"Phase genotypes using external phasing tool\"\"\"\n        print(\"\\n3. Phasing genotypes...\")\n        \n        if phasing_tool == 'beagle':\n            print(\"   Using Beagle for phasing\")\n            # In a real implementation, this would call Beagle via subprocess\n        elif phasing_tool == 'shapeit':\n            print(\"   Using ShapeIT for phasing\")\n        else:\n            raise ValueError(f\"Unsupported phasing tool: {phasing_tool}\")\n            \n        print(\"   Phasing complete\")\n    \n    def detect_ibd_segments(self, ibd_tool: str = 'ibis') -> None:\n        \"\"\"Detect IBD segments using specified tool\"\"\"\n        print(\"\\n4. Detecting IBD segments...\")\n        \n        # Different tools for phased vs unphased data\n        if ibd_tool == 'ibis':\n            print(\"   Using IBIS for IBD detection (works with unphased data)\")\n            # Call IBIS via subprocess\n        elif ibd_tool == 'hapibd':\n            print(\"   Using hap-IBD for IBD detection (requires phased data)\")\n            # Call hap-IBD via subprocess\n        elif ibd_tool == 'germline':\n            print(\"   Using GERMLINE for IBD detection\")\n        else:\n            raise ValueError(f\"Unsupported IBD detection tool: {ibd_tool}\")\n            \n        # Convert tool output to Bonsai format\n        self.ibd_segments = self._convert_ibd_output(ibd_tool)\n        print(f\"   Detected {len(self.ibd_segments)} IBD segments\")\n        print(f\"   Filtering segments by minimum length: {self.min_segment_cm} cM\")\n        # Apply filtering\n        self.ibd_segments = [seg for seg in self.ibd_segments if seg[7] >= self.min_segment_cm]\n        print(f\"   Retained {len(self.ibd_segments)} segments after filtering\")\n    \n    def _convert_ibd_output(self, ibd_tool: str) -> List[List]:\n        \"\"\"Convert IBD tool output to Bonsai format\"\"\"\n        # Simulate some IBD segments\n        print(\"   Converting IBD tool output to Bonsai format\")\n        \n        # This would call our convert_ibd_to_bonsai_format function in a real implementation\n        # Simulated data for demonstration\n        sample_ids = [f\"sample_{i:03d}\" for i in range(1, 11)]\n        simulated_segments = []\n        \n        # Generate some random segments between samples\n        import random\n        for _ in range(30):\n            id1, id2 = random.sample(sample_ids, 2)\n            \n            # Phased or unphased based on tool\n            if ibd_tool in ['hapibd', 'refinedibd']:\n                hap1, hap2 = random.randint(0, 1), random.randint(0, 1)\n            else:\n                hap1, hap2 = -1, -1\n                \n            chrom = random.randint(1, 22)\n            start_pos = random.randint(1000000, 100000000)\n            end_pos = start_pos + random.randint(1000000, 10000000)\n            length_cm = random.uniform(5.0, 20.0)  # Between 5 and 20 cM\n            \n            segment = [id1, id2, hap1, hap2, chrom, start_pos, end_pos, length_cm]\n            simulated_segments.append(segment)\n            \n        return simulated_segments\n    \n    def load_demographic_data(self, demographic_file: str) -> None:\n        \"\"\"Load demographic data (ages, sex, etc.) from file\"\"\"\n        print(\"\\n5. Loading demographic data...\")\n        \n        # In a real implementation, this would read from a CSV or other file\n        print(f\"   Loading demographic data from {demographic_file}\")\n        \n        # Simulate demographic data for our samples\n        import random\n        sample_ids = set()\n        for segment in self.ibd_segments:\n            sample_ids.add(segment[0])\n            sample_ids.add(segment[1])\n        \n        self.demographic_data = {}\n        for sample_id in sample_ids:\n            self.demographic_data[sample_id] = {\n                'age': random.randint(18, 80),\n                'sex': random.choice(['M', 'F']),\n                'birth_year': random.randint(1940, 2000)\n            }\n            \n        print(f\"   Loaded demographic data for {len(self.demographic_data)} individuals\")\n    \n    def infer_relationships(self) -> Dict:\n        \"\"\"Infer relationships using DRUID algorithm\"\"\"\n        print(\"\\n6. Inferring relationships using DRUID...\")\n        # In a real implementation, this would use Bonsai's DRUID algorithm\n        \n        # Simulate relationship inference\n        relationship_estimates = {}\n        pair_count = 0\n        \n        # Process each pair of individuals that share IBD\n        unique_pairs = set()\n        for segment in self.ibd_segments:\n            id1, id2 = segment[0], segment[1]\n            \n            # Ensure consistent ordering for pairs\n            if id1 > id2:\n                id1, id2 = id2, id1\n                \n            pair = (id1, id2)\n            if pair not in unique_pairs:\n                unique_pairs.add(pair)\n                \n                # Calculate total IBD sharing for this pair\n                total_ibd = sum(seg[7] for seg in self.ibd_segments \n                               if (seg[0] == id1 and seg[1] == id2) or \n                                  (seg[0] == id2 and seg[1] == id1))\n                \n                # Simple relationship inference based on total IBD\n                def estimate_degree(ibd_amount):\n                    if ibd_amount >= 1700:  # ~50% of genome\n                        return 1  # Parent/child or full sibling\n                    elif ibd_amount >= 850:  # ~25% of genome\n                        return 2  # Grandparent or avuncular\n                    elif ibd_amount >= 425:  # ~12.5% of genome\n                        return 3  # First cousin\n                    elif ibd_amount >= 212:  # ~6.25% of genome\n                        return 4  # First cousin once removed\n                    elif ibd_amount >= 106:  # ~3.125% of genome\n                        return 5  # Second cousin\n                    else:\n                        return 6  # Distant relation\n                \n                degree = estimate_degree(total_ibd)\n                relationship_estimates[pair] = {\n                    'total_ibd': total_ibd,\n                    'estimated_degree': degree,\n                    'confidence': 'high' if total_ibd > 500 else 'medium' if total_ibd > 100 else 'low'\n                }\n                pair_count += 1\n                \n        print(f\"   Inferred relationships for {pair_count} pairs of individuals\")\n        return relationship_estimates\n        \n    def reconstruct_pedigree(self, relationships: Dict) -> Dict:\n        \"\"\"Reconstruct pedigree using Bonsai\"\"\"\n        print(\"\\n7. Reconstructing pedigree...\")\n        # In a real implementation, this would use Bonsai's pedigree reconstruction\n        \n        # Simplified pedigree structure\n        # Just create a basic structure that tracks parent-child relationships\n        pedigree = {'up_dict': {}, 'down_dict': {}}\n        \n        # Start with closest relationships (lowest degrees)\n        sorted_pairs = sorted(relationships.items(), key=lambda x: x[1]['estimated_degree'])\n        \n        for (id1, id2), rel_info in sorted_pairs:\n            if rel_info['estimated_degree'] == 1 and rel_info['total_ibd'] > 2500:\n                # Likely parent-child - determine direction using age if available\n                if id1 in self.demographic_data and id2 in self.demographic_data:\n                    age1 = self.demographic_data[id1]['age']\n                    age2 = self.demographic_data[id2]['age']\n                    \n                    if age1 > age2 + 15:  # id1 is likely parent of id2\n                        if id2 not in pedigree['up_dict']:\n                            pedigree['up_dict'][id2] = {}\n                        pedigree['up_dict'][id2][id1] = 1\n                        \n                        if id1 not in pedigree['down_dict']:\n                            pedigree['down_dict'][id1] = {}\n                        pedigree['down_dict'][id1][id2] = 1\n                        \n                    elif age2 > age1 + 15:  # id2 is likely parent of id1\n                        if id1 not in pedigree['up_dict']:\n                            pedigree['up_dict'][id1] = {}\n                        pedigree['up_dict'][id1][id2] = 1\n                        \n                        if id2 not in pedigree['down_dict']:\n                            pedigree['down_dict'][id2] = {}\n                        pedigree['down_dict'][id2][id1] = 1\n        \n        # Count the number of parent-child relationships\n        parent_child_count = sum(len(parents) for parents in pedigree['up_dict'].values())\n        print(f\"   Identified {parent_child_count} parent-child relationships\")\n        print(\"   Pedigree reconstruction complete\")\n        return pedigree\n    \n    def visualize_results(self, pedigree: Dict) -> None:\n        \"\"\"Visualize the reconstructed pedigree\"\"\"\n        print(\"\\n8. Visualizing pedigree...\")\n        # In a real implementation, this would use Graphviz through Bonsai's rendering module\n        \n        print(\"   Generating pedigree visualization\")\n        print(f\"   Output saved to {self.output_dir}/pedigree.png\")\n        \n        # Create a simple network visualization of the pedigree\n        import networkx as nx\n        import matplotlib.pyplot as plt\n        \n        G = nx.DiGraph()\n        \n        # Add edges from the up_dict (child -> parent)\n        for child, parents in pedigree['up_dict'].items():\n            for parent in parents:\n                G.add_edge(parent, child)\n        \n        # Create a plot\n        plt.figure(figsize=(10, 8))\n        pos = nx.spring_layout(G, seed=42)\n        \n        # Draw the graph\n        nx.draw_networkx_nodes(G, pos, node_size=500, node_color='lightblue')\n        nx.draw_networkx_edges(G, pos, arrowstyle='->', arrowsize=15)\n        nx.draw_networkx_labels(G, pos, font_size=10)\n        \n        plt.title(\"Simplified Pedigree Visualization\")\n        plt.axis('off')\n        plt.tight_layout()\n        plt.show()\n    \n    def run_pipeline(self, input_format: str = 'vcf', ibd_tool: str = 'ibis', \n                     demographic_file: Optional[str] = None) -> None:\n        \"\"\"Run the complete end-to-end pipeline\"\"\"\n        print(\"Starting genetic genealogy pipeline...\")\n        \n        self.load_raw_genetic_data(file_format=input_format)\n        self.perform_quality_control()\n        \n        if ibd_tool in ['hapibd', 'refinedibd']:  # These tools require phased data\n            self.phase_genotypes()\n            \n        self.detect_ibd_segments(ibd_tool=ibd_tool)\n        \n        if demographic_file:\n            self.load_demographic_data(demographic_file)\n        else:\n            print(\"\\n5. Demographic data not provided, using simulated data\")\n            self.load_demographic_data(\"simulated\")\n            \n        relationships = self.infer_relationships()\n        pedigree = self.reconstruct_pedigree(relationships)\n        self.visualize_results(pedigree)\n        \n        print(\"\\nPipeline complete!\")\n\n# Run a demo of the pipeline\npipeline = GeneticGenealogyPipeline(\n    input_dir=\"/data/genetic_data\",\n    output_dir=\"/results\",\n    min_segment_cm=7.0\n)\n\npipeline.run_pipeline(input_format='vcf', ibd_tool='hapibd')

## Summary\n\nIn this lab, we explored the integration capabilities of Bonsai v3, particularly through the DRUID algorithm and its interfaces with external genetic genealogy tools. Key takeaways include:\n\n1. **The DRUID Algorithm**: A powerful method for quickly estimating relationship degrees based on IBD sharing, providing a bridge between raw genetic data and detailed pedigree reconstruction.\n\n2. **IBD Format Integration**: Bonsai's ability to work with IBD segments from various detection tools (IBIS, hap-IBD, GERMLINE) through standardized format conversion and post-processing.\n\n3. **End-to-End Workflows**: The complete pipeline from raw genetic data to visualized pedigrees, with Bonsai serving as the core pedigree reconstruction component.\n\n4. **Integration Best Practices**: Strategies for effective integration, including data format standardization, quality control, and appropriate parameter adjustments for different data sources.\n\nThese integration capabilities make Bonsai v3 a versatile tool for genetic genealogy research, capable of working with various data sources and complementary software tools to create comprehensive family tree reconstructions.\n\n### Connections to Other Labs\n\nThe concepts covered in this lab connect to:\n- **Lab 3: IBD Formats**: The foundational understanding of IBD formats explored in that lab is essential for the integration capabilities covered here.\n- **Lab 6: Probabilistic Relationship Inference**: The DRUID algorithm provides a complementary approach to the probabilistic methods covered in that lab.\n- **Lab 21: Pedigree Rendering**: The visualization aspects discussed here connect with the more detailed rendering techniques covered in that lab.\n\n### Further Reading\n\nTo deepen your understanding of these topics, consider exploring:\n\n- Manichaikul, A., et al. (2010). \"Robust relationship inference in genome-wide association studies.\" *Bioinformatics*, 26(22), 2867-2873. (The basis for the KING method)\n- Browning, B. L., & Browning, S. R. (2013). \"Improving the accuracy and efficiency of identity-by-descent detection in population data.\" *Genetics*, 194(2), 459-471. (The basis for refined-IBD)\n- Zhou, Y., et al. (2020). \"Robust genome-wide ancestry inference for heterogeneous datasets: illustrated using the 1,000 genomes project.\" *Bioinformatics*, 36(9), 2851-2859. (A modern application of IBD in genomics)

## Self-Assessment Questions\n\nTest your understanding with these questions:\n\n1. How does the DRUID algorithm differ from more detailed relationship inference methods in Bonsai?\n\n2. What are the main challenges when integrating IBD data from different detection tools, and how does Bonsai address them?\n\n3. Why is phase information important when merging adjacent IBD segments, and how does it affect the results?\n\n4. In an end-to-end genetic genealogy workflow, what role does demographic data (age, sex) play in improving pedigree reconstruction?\n\n5. What are the key integration points between Bonsai and external tools in a complete genetic genealogy pipeline?\n\n*Answers to self-assessment questions can be found at the end of the lab document.*

---\n\n## Answer Key (for instructors)\n\n### Exercise 1\n```python\ndef estimate_relationship_degree(shared_ibd_fraction, max_degree=10):\n    # Calculate KING boundaries for degrees 0 to max_degree\n    degrees = np.arange(0, max_degree)\n    expon = degrees + 0.5\n    boundaries = 2**(-expon) * 2\n    \n    # Adjust the self boundary\n    boundaries[0] = 2.0\n    \n    # Find the degree with the closest boundary to the observed sharing\n    estimated_degree = sum(boundaries > shared_ibd_fraction) - 1\n    estimated_degree = max(0, estimated_degree)  # Ensure degree is not negative\n    \n    # Calculate confidence based on how close the sharing is to the nearest boundaries\n    if estimated_degree < max_degree - 1:\n        upper_boundary = boundaries[estimated_degree]\n        lower_boundary = boundaries[estimated_degree + 1]\n    else:\n        upper_boundary = boundaries[estimated_degree]\n        lower_boundary = 0\n    \n    # Distance from each boundary, normalized by the boundary interval\n    boundary_interval = upper_boundary - lower_boundary\n    distance_from_upper = (upper_boundary - shared_ibd_fraction) / boundary_interval\n    distance_from_lower = (shared_ibd_fraction - lower_boundary) / boundary_interval\n    \n    # Determine confidence level\n    min_distance = min(distance_from_upper, distance_from_lower)\n    if min_distance < 0.2:\n        confidence = \"high\"\n    elif min_distance < 0.4:\n        confidence = \"medium\"\n    else:\n        confidence = \"low\"\n    \n    return estimated_degree, confidence\n```\n\n### Exercise 2\n```python\ndef filter_segments(ibd_segments, min_length=7.0, min_snps=None):\n    filtered_segments = []\n    \n    for segment in ibd_segments:\n        # Unpack the segment\n        id1, id2, hap1, hap2, chrom, start, end, length = segment\n        \n        # Apply length filter\n        if length >= min_length:\n            # Add SNP count filter if provided\n            if min_snps is not None and len(segment) > 8:  # Assuming SNP count is stored as 9th element\n                snp_count = segment[8]  # Extract SNP count if available\n                if snp_count >= min_snps:\n                    filtered_segments.append(segment)\n            else:\n                filtered_segments.append(segment)\n    \n    return filtered_segments\n\ndef merge_adjacent_segments(ibd_segments, max_gap=2.0, phase_sensitive=True):\n    # Group segments by individual pair and chromosome\n    segment_groups = {}\n    \n    for segment in ibd_segments:\n        id1, id2, hap1, hap2, chrom, start, end, length = segment\n        \n        # Ensure consistent ordering of IDs\n        if id1 > id2:\n            id1, id2 = id2, id1\n            hap1, hap2 = hap2, hap1\n        \n        # Create a key that groups segments by individuals, chromosome, and haplotypes if phase_sensitive\n        if phase_sensitive and hap1 >= 0 and hap2 >= 0:  # Only apply phase sensitivity if phased\n            key = (id1, id2, chrom, hap1, hap2)\n        else:\n            key = (id1, id2, chrom)\n        \n        if key not in segment_groups:\n            segment_groups[key] = []\n            \n        segment_groups[key].append(segment)\n    \n    # Process each group to merge adjacent segments\n    merged_segments = []\n    \n    for key, segments in segment_groups.items():\n        # Sort segments by start position\n        sorted_segments = sorted(segments, key=lambda x: x[5])  # Sort by start position\n        \n        # Initialize the merged segments list with the first segment\n        current_merged = list(sorted_segments[0])  # Convert to list to allow modification\n        \n        for segment in sorted_segments[1:]:\n            # Get current end and next start positions\n            current_end = current_merged[6]  # End position\n            current_end_cm = current_merged[7]  # End position in cM (length)\n            next_start = segment[5]  # Start position\n            next_length = segment[7]  # Length in cM\n            \n            # Calculate gap or overlap\n            gap = next_start - current_end\n            \n            # If positions overlap or gap is small enough, merge segments\n            if gap <= 0 or (gap > 0 and gap <= max_gap):\n                # Update end position and length for the merged segment\n                current_merged[6] = max(current_merged[6], segment[6])  # Take the maximum end position\n                current_merged[7] = current_merged[7] + next_length  # Update length\n            else:\n                # Add the current merged segment and start a new one\n                merged_segments.append(tuple(current_merged))\n                current_merged = list(segment)\n        \n        # Add the last merged segment\n        merged_segments.append(tuple(current_merged))\n    \n    return merged_segments\n```\n\n### Exercise 3\nThe GeneticGenealogyPipeline class is already implemented in the lab with comprehensive functionality for an end-to-end workflow.\n\n### Self-Assessment Answers\n\n1. How does the DRUID algorithm differ from more detailed relationship inference methods in Bonsai?\n   * Answer: DRUID focuses on estimating the degree of relationship (e.g., 1st, 2nd, 3rd) rather than specific relationship types (e.g., aunt/nephew, first cousins). It uses simple thresholds based on total IBD sharing, making it faster but less specific than the detailed likelihood-based methods in Bonsai that incorporate segment counts, lengths, and distributions.\n\n2. What are the main challenges when integrating IBD data from different detection tools, and how does Bonsai address them?\n   * Answer: The main challenges include differences in file formats, handling of phasing information, confidence score interpretation, and segment filtering criteria. Bonsai addresses these through format conversion functions, standardized internal representation, quality filters, and segment merging algorithms that can work with both phased and unphased data.\n\n3. Why is phase information important when merging adjacent IBD segments, and how does it affect the results?\n   * Answer: Phase information indicates which parental chromosome a segment comes from. When merging segments, considering phase ensures that only segments from the same ancestral chromosome are merged. Phase-sensitive merging prevents inappropriately combining segments from different ancestral lines, which could artificially inflate total IBD measurements and lead to inaccurate relationship estimates.\n\n4. In an end-to-end genetic genealogy workflow, what role does demographic data (age, sex) play in improving pedigree reconstruction?\n   * Answer: Demographic data helps disambiguate relationship types with similar genetic signatures. Age information helps determine directionality in parent-child relationships and generational placement. Sex information helps identify impossible relationships (e.g., maternal relationships for males) and distinguishes between relationship types with similar IBD patterns but different sex constraints.\n\n5. What are the key integration points between Bonsai and external tools in a complete genetic genealogy pipeline?\n   * Answer: Key integration points include: (1) IBD segment import from detection tools like IBIS and hap-IBD, (2) demographic data import from genealogical records or user input, (3) parameter tuning interfaces for population-specific adjustments, and (4) visualization outputs through Graphviz or network analysis tools for pedigree representation.

In [ ]:
# Optional: Convert this notebook to PDF\n# Uncomment and run this cell if you want to generate a PDF version\n\n# !jupyter nbconvert --to pdf \"Lab28_Integration_Tools.ipynb\"