# Lab 29: End-to-End Pedigree Reconstruction Implementation

## Overview

This notebook guides you through developing a complete pedigree reconstruction pipeline using Bonsai v3. You'll learn how to integrate all components from data preparation through pedigree visualization, creating a comprehensive solution that handles real-world data challenges while producing interpretable, confidence-scored results.

**Learning Objectives:**
- Design and implement an end-to-end pedigree reconstruction pipeline
- Integrate Bonsai's core components into a cohesive workflow
- Apply appropriate preprocessing and filtering techniques for input data
- Handle real-world genetic data challenges like missing data and complex relationships
- Evaluate and interpret confidence metrics for reconstructed pedigrees
- Generate meaningful visualizations and outputs for different use cases

**Prerequisites:**
- Completion of Lab 9: Pedigree Data Structures
- Completion of Lab 16: Merging Pedigrees
- Completion of Lab 21: Pedigree Rendering
- Completion of Lab 28: Integration with Other Genealogical Tools

**Estimated completion time:** 90-120 minutes

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")
sns.set_palette("colorblind")  # Improve accessibility with colorblind-friendly palette

# Configure plot defaults for better readability
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 12,
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10
})

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        if not classes:
            print(f"No classes found in module {module_name}")
            return
            
        # Print info for each class
        for name, cls in classes:
            display(Markdown(f"### Class: {name}"))
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                display(Markdown(f"**Documentation:**\
{doc}"))
            else:
                display(Markdown("*No documentation available*"))
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            public_methods = [(method_name, method) for method_name, method in methods 
                             if not method_name.startswith('_')]
            
            if public_methods:
                display(Markdown("**Public Methods:**"))
                for method_name, method in public_methods:
                    sig = inspect.signature(method)
                    display(Markdown(f"- `{method_name}{sig}`"))
            else:
                display(Markdown("*No public methods*"))
            
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        if not functions:
            print(f"No functions found in module {module_name}")
            return
            
        # Filter public functions
        public_functions = [(name, func) for name, func in functions if not name.startswith('_')]
        
        if not public_functions:
            print(f"No public functions found in module {module_name}")
            return
            
        # Print info for each function
        for name, func in public_functions:                
            display(Markdown(f"### Function: {name}"))
            
            # Get signature
            sig = inspect.signature(func)
            display(Markdown(f"**Signature:** `{name}{sig}`"))
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                display(Markdown(f"**Documentation:**\
{doc}"))
            else:
                display(Markdown("*No documentation available*"))
                
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for `{function_name}`\
```python\
{source}\
```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

def view_class_source(module_name, class_name):
    """Display the source code of a class"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the class
        cls = getattr(module, class_name)
        
        # Get the source code
        source = inspect.getsource(cls)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for class `{class_name}`\
```python\
{source}\
```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Class {class_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing class {class_name}: {e}")

def explore_module(module_name):
    """Display a comprehensive overview of a module with classes and functions"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Module docstring
        doc = inspect.getdoc(module)
        display(Markdown(f"# Module: {module_name}"))
        
        if doc:
            display(Markdown(f"**Module Documentation:**\
{doc}"))
        else:
            display(Markdown("*No module documentation available*"))
            
        display(Markdown("---"))
        
        # Display classes
        display(Markdown("## Classes"))
        display_module_classes(module_name)
        
        # Display functions
        display(Markdown("## Functions"))
        display_module_functions(module_name)
        
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error exploring module {module_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
    
    # Print Bonsai version information if available
    if hasattr(v3, "__version__"):
        print(f"Bonsai v3 version: {v3.__version__}")
    
    # List key submodules
    print("\
Available Bonsai submodules:")
    for module_name in dir(v3):
        if not module_name.startswith("_") and not module_name.startswith("__"):
            print(f"- {module_name}")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Introduction

In previous labs, we've explored individual components of the Bonsai v3 system, from IBD processing to relationship inference and pedigree rendering. In this lab, we'll bring all these components together to build a complete end-to-end pipeline for pedigree reconstruction from genetic data.

A comprehensive pedigree reconstruction system needs to handle multiple stages:

1. **Input Data Processing**: Loading and validating genetic data, IBD segments, and demographic information
2. **Data Preparation**: Filtering, merging, and normalizing data for consistency
3. **Relationship Inference**: Estimating relationships between pairs of individuals
4. **Pedigree Construction**: Building and optimizing pedigree structures
5. **Confidence Evaluation**: Assessing the reliability of reconstructed relationships
6. **Result Visualization**: Rendering pedigrees in an interpretable format
7. **Output Generation**: Producing structured data for further analysis

By integrating these components into a cohesive pipeline, we can create a powerful tool for genetic genealogy research, ancestry analysis, and family discovery applications.

## Part 1: Pipeline Design and Architecture\
\
### Theory and Background\
\
Designing an effective pedigree reconstruction pipeline requires careful consideration of both the algorithm flow and data structures. A well-architected pipeline should:\
\
1. **Be Modular**: Components should be cleanly separated with well-defined interfaces\
2. **Handle Errors Gracefully**: Robust error handling and validation at each stage\
3. **Support Configuration**: Allow parameters to be customized for different scenarios\
4. **Scale Appropriately**: Efficiently process datasets of varying sizes\
5. **Provide Feedback**: Offer visibility into progress and intermediate results\
\
In Bonsai v3, the pedigree reconstruction process follows a progression from raw IBD segments to complete pedigrees:\
\
```\
IBD Segments → Pairwise Relationships → Small Pedigree Clusters → Merged Pedigree\
```\
\
This workflow leverages multiple Bonsai components in sequence, with each component handling a specific aspect of the reconstruction process. Let's examine how these components can be integrated into a cohesive pipeline.

### Implementation in Bonsai v3\
\
Let's look at how Bonsai v3 implements key pipeline components through its module structure. We'll examine the most important interfaces for building an end-to-end pipeline.

In [ ]:
# Explore the bonsai.py module, which provides the main pipeline functionality\
try:\
    explore_module(\\"bonsaitree.v3.bonsai\\")\
except Exception as e:\
    print(f\\"Error exploring bonsai module: {e}\\")\
    print(\\"Will continue with a theoretical discussion of the pipeline architecture.\\")

In [ ]:
# Look for the run_bonsai.py script which demonstrates end-to-end usage\
try:\
    # Assuming the script is in the scripts_work directory\
    script_path = \\"/home/lakishadavid/computational_genetic_genealogy/scripts_work/run_bonsai.py\\"\
    \
    # Use subprocess to check if the file exists and show its contents\
    import subprocess\
    result = subprocess.run([\\"ls\\", \\"-l\\", script_path], capture_output=True, text=True)\
    \
    if result.returncode == 0:\
        print(f\\"Found run_bonsai.py script for end-to-end workflow example at:\\\
{script_path}\\")\
        \
        # Show the first few lines to get an overview\
        head_result = subprocess.run([\\"head\\", \\"-n\\", \\"20\\", script_path], capture_output=True, text=True)\
        print(\\"\\\
Preview of run_bonsai.py:\\")\
        print(head_result.stdout)\
    else:\
        print(\\"Could not find run_bonsai.py script. Using theoretical discussion instead.\\")\
        \
except Exception as e:\
    print(f\\"Error checking for run_bonsai.py: {e}\\")\
    print(\\"Will proceed with theoretical workflow example.\\")

### Bonsai Pipeline Architecture\
\
Based on the module exploration and documentation, we can outline the architecture of a complete Bonsai pedigree reconstruction pipeline:\
\
1. **IBD Data Processing**:\
   - Loading IBD segments from various formats\
   - Filtering segments by length, quality, etc.\
   - Converting to Bonsai's internal format\
\
2. **Demographic Data Integration**:\
   - Loading age and sex information\
   - Incorporating prior relationship knowledge if available\
   - Setting up demographic constraints\
\
3. **Pairwise Relationship Inference**:\
   - Computing IBD statistics between pairs\
   - Calculating likelihood scores for different relationship types\
   - Ranking relationship hypotheses\
\
4. **Community Detection**:\
   - Clustering individuals into related groups\
   - Identifying independent pedigree components\
   - Prioritizing clusters for reconstruction\
\
5. **Pedigree Construction**:\
   - Building small pedigree structures\
   - Optimizing placement of individuals\
   - Merging pedigree fragments\
\
6. **Confidence Assessment**:\
   - Calculating confidence scores for relationships\
   - Identifying ambiguous relationships\
   - Generating alternative pedigree hypotheses\
\
7. **Visualization and Output**:\
   - Rendering pedigrees with Graphviz\
   - Exporting results in various formats\
   - Generating summary statistics and reports

In [ ]:
# Create a visualization of the pipeline architecture\
import matplotlib.pyplot as plt\
import matplotlib.patches as mpatches\
from matplotlib.path import Path\
import numpy as np\
\
# Define the pipeline components with their dependencies and Bonsai modules\
pipeline_stages = [\
    {\\"name\\": \\"IBD Data Processing\\", \\"module\\": \\"ibd.py\\", \\"position\\": 1},\
    {\\"name\\": \\"Demographic Data\\", \\"module\\": \\"demographics.py\\", \\"position\\": 2},\
    {\\"name\\": \\"Relationship Inference\\", \\"module\\": \\"likelihoods.py, druid.py\\", \\"position\\": 3},\
    {\\"name\\": \\"Community Detection\\", \\"module\\": \\"clusters.py\\", \\"position\\": 4},\
    {\\"name\\": \\"Pedigree Construction\\", \\"module\\": \\"bonsai.py, pedigrees.py\\", \\"position\\": 5},\
    {\\"name\\": \\"Confidence Assessment\\", \\"module\\": \\"confidence.py\\", \\"position\\": 6},\
    {\\"name\\": \\"Visualization & Output\\", \\"module\\": \\"rendering.py\\", \\"position\\": 7}\
]\
\
# Create the figure and axis\
fig, ax = plt.subplots(figsize=(12, 8))\
ax.set_xlim(0, 10)\
ax.set_ylim(0, 10)\
\
# Remove axis ticks and labels\
ax.set_xticks([])\
ax.set_yticks([])\
ax.set_frame_on(False)\
\
# Define colors and styles\
box_props = {\
    \\"facecolor\\": \\"#3498db\\",  # Blue\
    \\"edgecolor\\": \\"#2c3e50\\",  # Dark blue\
    \\"alpha\\": 0.7,\
    \\"boxstyle\\": \\"round,pad=0.5\\"\
}\
\
module_props = {\
    \\"facecolor\\": \\"#ecf0f1\\",  # Light gray\
    \\"edgecolor\\": \\"#bdc3c7\\",  # Darker gray\
    \\"alpha\\": 0.9,\
    \\"boxstyle\\": \\"round,pad=0.3\\"\
}\
\
# Calculate positions for the pipeline stages\
def calculate_positions(stages):\
    positions = []\
    max_position = max(stage[\\"position\\"] for stage in stages)\
    spacing = 8.0 / (max_position + 1)  # Horizontal spacing\
    \
    for stage in stages:\
        x = stage[\\"position\\"] * spacing + 1.0\
        \
        # Zigzag pattern for visual interest\
        if stage[\\"position\\"] % 2 == 0:\
            y = 6.5\
        else:\
            y = 3.5\
            \
        positions.append((x, y))\
    \
    return positions\
\
# Get positions for each stage\
positions = calculate_positions(pipeline_stages)\
\
# Draw boxes and module info for each stage\
for i, (stage, pos) in enumerate(zip(pipeline_stages, positions)):\
    x, y = pos\
    \
    # Draw the main stage box\
    text_box = mpatches.FancyBboxPatch(\
        (x - 1.2, y - 0.6), 2.4, 1.2, \
        **box_props\
    )\
    ax.add_patch(text_box)\
    \
    # Add the stage name\
    ax.text(x, y, stage[\\"name\\"], ha='center', va='center', color='white', \
            fontsize=12, fontweight='bold')\
    \
    # Add the module information below\
    ax.text(x, y - 0.8, f\\"Module: {stage['module']}\\", ha='center', va='center', \
            fontsize=9, bbox=module_props)\
    \
    # Add stage number above\
    ax.text(x, y + 0.8, f\\"Stage {stage['position']}\\", ha='center', va='center', \
            fontsize=10, fontweight='bold')\
\
# Draw arrows connecting the stages\
arrow_props = dict(arrowstyle=\\"->\\", connectionstyle=\\"arc3,rad=0.1\\", \
                  color=\\"#34495e\\", lw=2, alpha=0.7)\
\
for i in range(len(positions) - 1):\
    start = positions[i]\
    end = positions[i + 1]\
    \
    # Calculate start and end points for the arrow\
    if i % 2 == 0:  # Every other connection goes down\
        start_point = (start[0] + 0.8, start[1] - 0.2)\
        end_point = (end[0] - 0.8, end[1] - 0.2)\
    else:  # The others go up\
        start_point = (start[0] + 0.8, start[1] + 0.2)\
        end_point = (end[0] - 0.8, end[1] + 0.2)\
    \
    arrow = mpatches.FancyArrowPatch(start_point, end_point, **arrow_props)\
    ax.add_patch(arrow)\
\
# Add data flow indicators\
data_types = [\
    (\\"IBD Segments\\", 1.5, 2.5),\
    (\\"Age & Sex Data\\", 2.5, 2.5),\
    (\\"Pairwise Likelihoods\\", 4.2, 2.5),\
    (\\"Related Clusters\\", 5.8, 2.5),\
    (\\"Pedigree Structures\\", 7.5, 2.5),\
    (\\"Confidence Scores\\", 8.5, 5.5),\
    (\\"Visualized Pedigrees\\", 9.0, 3.5)\
]\
\
for text, x, y in data_types:\
    ax.text(x, y, text, ha='center', va='center', fontsize=8,\
            bbox=dict(facecolor='#f39c12', alpha=0.7, boxstyle=\\"round,pad=0.3\\"))\
\
# Add title\
ax.set_title(\\"Bonsai v3 End-to-End Pedigree Reconstruction Pipeline\\", fontsize=14, pad=20)\
\
# Show the plot\
plt.tight_layout()\
plt.show()

### Exercise 1: Designing a Pipeline Class Structure\
\
In this exercise, you'll design a class structure for a modular pedigree reconstruction pipeline using Bonsai v3 components.\
\
**Task:** Complete the skeleton of a `BonsaiPipeline` class that implements an end-to-end workflow with well-defined component interfaces.\
\
**Hint:** Focus on how data flows between pipeline stages and how configuration options are handled.

In [ ]:
from typing import Dict, List, Optional, Set, Tuple, Union, Any\
import os\
import logging\
import time\
\
class BonsaiPipeline:\
    \\"\\"\\"End-to-end pedigree reconstruction pipeline using Bonsai v3.\\"\\"\\"\
    \
    def __init__(self, config: Dict[str, Any] = None):\
        \\"\\"\\"Initialize the pipeline with configuration options.\
        \
        Args:\
            config: Dictionary of configuration options, including:\
                - input_dir: Directory containing input data\
                - output_dir: Directory for output files\
                - min_segment_cm: Minimum IBD segment length in cM\
                - use_age_info: Whether to use age information in relationship inference\
                - clustering_method: Method for community detection\
                - confidence_threshold: Minimum confidence score for reported relationships\
                - max_pedigree_size: Maximum size of pedigrees to reconstruct\
        \\"\\"\\"\
        self.config = config or {}\
        \
        # Set default configuration values\
        self.config.setdefault('input_dir', './input')\
        self.config.setdefault('output_dir', './output')\
        self.config.setdefault('min_segment_cm', 7.0)\
        self.config.setdefault('use_age_info', True)\
        self.config.setdefault('clustering_method', 'louvain')\
        self.config.setdefault('confidence_threshold', 0.8)\
        self.config.setdefault('max_pedigree_size', 100)\
        \
        # Create output directory if it doesn't exist\
        os.makedirs(self.config['output_dir'], exist_ok=True)\
        \
        # Initialize pipeline state\
        self.ibd_segments = []\
        self.demographic_data = {}\
        self.pairwise_relationships = {}\
        self.clusters = []\
        self.pedigrees = {}\
        self.confidence_scores = {}\
        \
        # Set up logging\
        self._setup_logging()\
        \
        self.logger.info(f\\"Initialized BonsaiPipeline with configuration: {self.config}\\")\
    \
    def _setup_logging(self):\
        \\"\\"\\"Set up logging configuration.\\"\\"\\"\
        # Create a logger\
        self.logger = logging.getLogger(\\"BonsaiPipeline\\")\
        self.logger.setLevel(logging.INFO)\
        \
        # Create a file handler\
        log_file = os.path.join(self.config['output_dir'], 'pipeline.log')\
        file_handler = logging.FileHandler(log_file, mode='w')\
        \
        # Create a console handler\
        console_handler = logging.StreamHandler()\
        \
        # Create a formatter and add it to the handlers\
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')\
        file_handler.setFormatter(formatter)\
        console_handler.setFormatter(formatter)\
        \
        # Add handlers to the logger\
        self.logger.addHandler(file_handler)\
        self.logger.addHandler(console_handler)\
    \
    def load_ibd_data(self, ibd_file: str, format: str = 'bonsai'):\
        \\"\\"\\"Load IBD segment data from a file.\
        \
        Args:\
            ibd_file: Path to the IBD segment file\
            format: Format of the IBD segment file ('bonsai', 'ibis', 'hapibd', etc.)\
        \\"\\"\\"\
        self.logger.info(f\\"Loading IBD data from {ibd_file} in {format} format\\")\
        \
        # TODO: Implement IBD data loading based on format\
        # You would use different loaders based on the format parameter\
        if format == 'bonsai':\
            # Direct loading of Bonsai format\
            # self.ibd_segments = load_bonsai_ibd_segments(ibd_file)\
            pass\
        elif format == 'ibis':\
            # Convert IBIS format to Bonsai format\
            # self.ibd_segments = convert_ibis_to_bonsai(ibd_file)\
            pass\
        elif format == 'hapibd':\
            # Convert hap-IBD format to Bonsai format\
            # self.ibd_segments = convert_hapibd_to_bonsai(ibd_file)\
            pass\
        else:\
            self.logger.error(f\\"Unsupported IBD format: {format}\\")\
            raise ValueError(f\\"Unsupported IBD format: {format}\\")\
        \
        # Apply filtering based on configuration\
        self._filter_ibd_segments()\
        \
        self.logger.info(f\\"Loaded {len(self.ibd_segments)} IBD segments\\")\
    \
    def _filter_ibd_segments(self):\
        \\"\\"\\"Filter IBD segments based on configuration criteria.\\"\\"\\"\
        if not self.ibd_segments:\
            return\
        \
        min_segment_cm = self.config['min_segment_cm']\
        self.logger.info(f\\"Filtering segments by minimum length: {min_segment_cm} cM\\")\
        \
        # Apply length filter\
        original_count = len(self.ibd_segments)\
        self.ibd_segments = [seg for seg in self.ibd_segments if seg[7] >= min_segment_cm]\
        \
        self.logger.info(f\\"Filtered {original_count - len(self.ibd_segments)} segments below {min_segment_cm} cM\\")\
    \
    def load_demographic_data(self, demographic_file: str):\
        \\"\\"\\"Load demographic data (age, sex, etc.) from a file.\
        \
        Args:\
            demographic_file: Path to the demographic data file\
        \\"\\"\\"\
        self.logger.info(f\\"Loading demographic data from {demographic_file}\\")\
        \
        # TODO: Implement demographic data loading\
        # This would typically read from a CSV file with columns like ID, Age, Sex, etc.\
        # Example:\
        # import pandas as pd\
        # df = pd.read_csv(demographic_file)\
        # self.demographic_data = {row['ID']: {'age': row['Age'], 'sex': row['Sex']} for _, row in df.iterrows()}\
        \
        self.logger.info(f\\"Loaded demographic data for {len(self.demographic_data)} individuals\\")\
    \
    def infer_pairwise_relationships(self):\
        \\"\\"\\"Infer pairwise relationships between individuals based on IBD data.\\"\\"\\"\
        self.logger.info(\\"Inferring pairwise relationships\\")\
        \
        # TODO: Implement relationship inference using Bonsai functions\
        # This would use functions from the likelihoods module to calculate\
        # log-likelihoods for different relationship hypotheses\
        # Example:\
        # from bonsaitree.v3 import likelihoods\
        # self.pairwise_relationships = calculate_all_pairwise_likelihoods(self.ibd_segments, self.demographic_data)\
        \
        self.logger.info(f\\"Inferred relationships for {len(self.pairwise_relationships)} pairs\\")\
    \
    def detect_communities(self):\
        \\"\\"\\"Detect communities of related individuals.\\"\\"\\"\
        self.logger.info(\\"Detecting communities of related individuals\\")\
        \
        # TODO: Implement community detection\
        # This would use network analysis to group individuals into related clusters\
        # Example:\
        # import networkx as nx\
        # G = create_relationship_graph(self.pairwise_relationships)\
        # self.clusters = detect_communities(G, method=self.config['clustering_method'])\
        \
        self.logger.info(f\\"Detected {len(self.clusters)} communities\\")\
    \
    def reconstruct_pedigrees(self):\
        \\"\\"\\"Reconstruct pedigrees for each community.\\"\\"\\"\
        self.logger.info(\\"Reconstructing pedigrees\\")\
        \
        # TODO: Implement pedigree reconstruction\
        # For each cluster, use Bonsai's pedigree reconstruction algorithms\
        # Example:\
        # from bonsaitree.v3 import bonsai\
        # for i, cluster in enumerate(self.clusters):\
        #     self.pedigrees[i] = bonsai.build_pedigree(\
        #         genotyped_ids=cluster,\
        #         ibd_segments=self.ibd_segments,\
        #         demographic_data=self.demographic_data\
        #     )\
        \
        self.logger.info(f\\"Reconstructed {len(self.pedigrees)} pedigrees\\")\
    \
    def assess_confidence(self):\
        \\"\\"\\"Assess confidence in reconstructed relationships.\\"\\"\\"\
        self.logger.info(\\"Assessing confidence in reconstructed relationships\\")\
        \
        # TODO: Implement confidence assessment\
        # Calculate confidence scores for each relationship in the pedigrees\
        # Example:\
        # from bonsaitree.v3 import confidence\
        # for ped_id, pedigree in self.pedigrees.items():\
        #     self.confidence_scores[ped_id] = calculate_confidence_scores(pedigree, self.ibd_segments)\
        \
        self.logger.info(\\"Completed confidence assessment\\")\
    \
    def visualize_results(self):\
        \\"\\"\\"Visualize the reconstructed pedigrees.\\"\\"\\"\
        self.logger.info(\\"Visualizing pedigrees\\")\
        \
        # TODO: Implement pedigree visualization\
        # Render pedigrees using Graphviz or NetworkX\
        # Example:\
        # from bonsaitree.v3 import rendering\
        # for ped_id, pedigree in self.pedigrees.items():\
        #     output_file = os.path.join(self.config['output_dir'], f\\"pedigree_{ped_id}.png\\")\
        #     rendering.render_pedigree(pedigree, output_file, confidence_scores=self.confidence_scores.get(ped_id))\
        \
        self.logger.info(\\"Completed pedigree visualization\\")\
    \
    def generate_reports(self):\
        \\"\\"\\"Generate summary reports and detailed relationship information.\\"\\"\\"\
        self.logger.info(\\"Generating reports\\")\
        \
        # TODO: Implement report generation\
        # Create summary statistics and detailed relationship reports\
        # Example:\
        # summary_file = os.path.join(self.config['output_dir'], \\"summary.csv\\")\
        # detailed_file = os.path.join(self.config['output_dir'], \\"detailed_relationships.csv\\")\
        # write_summary_report(self.pedigrees, self.confidence_scores, summary_file)\
        # write_detailed_report(self.pedigrees, self.confidence_scores, detailed_file)\
        \
        self.logger.info(\\"Completed report generation\\")\
    \
    def run_pipeline(self, ibd_file: str, demographic_file: Optional[str] = None):\
        \\"\\"\\"Run the complete pipeline from data loading to visualization.\
        \
        Args:\
            ibd_file: Path to the IBD segment file\
            demographic_file: Path to the demographic data file (optional)\
        \\"\\"\\"\
        self.logger.info(\\"Starting pipeline execution\\")\
        start_time = time.time()\
        \
        # Step 1: Load IBD data\
        self.load_ibd_data(ibd_file)\
        \
        # Step 2: Load demographic data if provided\
        if demographic_file:\
            self.load_demographic_data(demographic_file)\
        \
        # Step 3: Infer pairwise relationships\
        self.infer_pairwise_relationships()\
        \
        # Step 4: Detect communities\
        self.detect_communities()\
        \
        # Step 5: Reconstruct pedigrees\
        self.reconstruct_pedigrees()\
        \
        # Step 6: Assess confidence\
        self.assess_confidence()\
        \
        # Step 7: Visualize results\
        self.visualize_results()\
        \
        # Step 8: Generate reports\
        self.generate_reports()\
        \
        elapsed_time = time.time() - start_time\
        self.logger.info(f\\"Pipeline completed in {elapsed_time:.2f} seconds\\")\
        \
        return {\
            \\"pedigrees\\": self.pedigrees,\
            \\"confidence_scores\\": self.confidence_scores,\
            \\"execution_time\\": elapsed_time\
        }\
\
# Example usage\
if __name__ == \\"__main__\\":\
    # Create a pipeline with custom configuration\
    pipeline = BonsaiPipeline({\
        'input_dir': './data',\
        'output_dir': './results',\
        'min_segment_cm': 7.0,\
        'use_age_info': True,\
        'clustering_method': 'louvain',\
        'confidence_threshold': 0.8,\
        'max_pedigree_size': 100\
    })\
    \
    # Run the pipeline\
    results = pipeline.run_pipeline(\
        ibd_file='./data/ibd_segments.csv',\
        demographic_file='./data/demographics.csv'\
    )\
    \
    print(f\\"Reconstructed {len(results['pedigrees'])} pedigrees\\")\
    print(f\\"Pipeline execution time: {results['execution_time']:.2f} seconds\\")

## Part 2: IBD Processing and Relationship Inference\
\
### Theory and Background\
\
IBD (Identity-By-Descent) segments form the foundational data for genetic pedigree reconstruction. These segments represent regions of the genome that two individuals have inherited from a common ancestor. Properly processing and analyzing IBD segments is critical for accurate relationship inference.\
\
In an end-to-end pipeline, the IBD processing stage typically involves:\
\
1. **Format Standardization**: Converting tool-specific IBD formats to a common representation\
2. **Quality Filtering**: Removing low-quality or unreliable segments\
3. **Segment Merging**: Combining adjacent or overlapping segments when appropriate\
4. **IBD Statistics Calculation**: Computing summary statistics like total IBD sharing\
\
Once IBD segments are processed, the next step is pairwise relationship inference, which:\
\
1. **Computes Likelihoods**: Calculates how likely different relationship types are given the observed IBD\
2. **Incorporates Demographic Data**: Uses age and sex information to refine predictions\
3. **Ranks Hypotheses**: Orders possible relationships by their likelihood\
4. **Provides Confidence Scores**: Quantifies certainty in the inferred relationships

### Implementation in Bonsai v3\
\
Let's examine how Bonsai v3 implements IBD processing and pairwise relationship inference. We'll look at key functions that our pipeline can leverage for these stages.

In [ ]:
# Explore the IBD module which handles segment processing\
try:\
    explore_module(\\"bonsaitree.v3.ibd\\")\
except Exception as e:\
    print(f\\"Error exploring IBD module: {e}\\")\
    print(\\"Will continue with a theoretical discussion of IBD processing.\\")

In [ ]:
# Explore the likelihoods module which handles relationship inference\
try:\
    explore_module(\\"bonsaitree.v3.likelihoods\\")\
except Exception as e:\
    print(f\\"Error exploring likelihoods module: {e}\\")\
    print(\\"Will continue with a theoretical discussion of relationship inference.\\")

### Implementing IBD Processing\
\
Now let's implement the IBD processing components for our pipeline. This includes loading IBD segments from various formats, filtering, and merging.

In [ ]:
def load_ibd_segments(file_path, format='bonsai'):\
    \\"\\"\\"Load IBD segments from a file in the specified format.\
    \
    Args:\
        file_path: Path to the IBD segment file\
        format: Format of the file ('bonsai', 'ibis', 'hapibd', etc.)\
        \
    Returns:\
        List of IBD segments in Bonsai format:\
        [id1, id2, hap1, hap2, chromosome, start_pos, end_pos, length_cM]\
    \\"\\"\\"\
    print(f\\"Loading IBD segments from {file_path} in {format} format\\")\
    \
    # We'll simulate loading data for demonstration purposes\
    # In a real implementation, this would read from a file\
    \
    # Create some sample IBD segments \
    import random\
    sample_ids = [f\\"sample_{i:03d}\\" for i in range(1, 21)]\
    segments = []\
    \
    # Generate random segments between samples\
    for _ in range(100):\
        id1, id2 = random.sample(sample_ids, 2)\
        \
        # Phased or unphased based on format\
        if format in ['hapibd', 'refinedibd']:\
            hap1, hap2 = random.randint(0, 1), random.randint(0, 1)\
        else:\
            hap1, hap2 = -1, -1\
            \
        chrom = random.randint(1, 22)\
        start_pos = random.randint(1000000, 200000000)\
        end_pos = start_pos + random.randint(1000000, 20000000)\
        \
        # Generate segment length in cM (typically 1-20 cM)\
        length_cm = random.uniform(1.0, 30.0)\
        \
        segment = [id1, id2, hap1, hap2, chrom, start_pos, end_pos, length_cm]\
        segments.append(segment)\
    \
    print(f\\"Loaded {len(segments)} IBD segments\\")\
    return segments\
\
def filter_ibd_segments(segments, min_length=7.0, min_snps=None, max_gap=None):\
    \\"\\"\\"Filter IBD segments based on quality criteria.\
    \
    Args:\
        segments: List of IBD segments in Bonsai format\
        min_length: Minimum segment length in cM\
        min_snps: Minimum number of SNPs in a segment (optional)\
        max_gap: Maximum gap between adjacent segments to be considered separate (optional)\
        \
    Returns:\
        Filtered list of IBD segments\
    \\"\\"\\"\
    print(f\\"Filtering IBD segments: min_length={min_length} cM\\")\
    \
    # Apply length filter\
    filtered_segments = [seg for seg in segments if seg[7] >= min_length]\
    \
    print(f\\"Filtered out {len(segments) - len(filtered_segments)} segments below length threshold\\")\
    return filtered_segments\
\
def merge_adjacent_segments(segments, max_gap=2.0, phase_sensitive=True):\
    \\"\\"\\"Merge adjacent or overlapping IBD segments between the same pair of individuals.\
    \
    Args:\
        segments: List of IBD segments in Bonsai format\
        max_gap: Maximum gap in cM between segments to be merged\
        phase_sensitive: If True, only merge segments on the same haplotypes\
        \
    Returns:\
        List of merged IBD segments\
    \\"\\"\\"\
    print(f\\"Merging adjacent segments: max_gap={max_gap} cM, phase_sensitive={phase_sensitive}\\")\
    \
    # Group segments by individual pair, chromosome, and haplotypes if phase_sensitive\
    segment_groups = {}\
    \
    for segment in segments:\
        id1, id2, hap1, hap2, chrom, start, end, length = segment\
        \
        # Ensure consistent ordering of IDs\
        if id1 > id2:\
            id1, id2 = id2, id1\
            hap1, hap2 = hap2, hap1\
        \
        # Create a key that includes haplotype information if phase_sensitive\
        if phase_sensitive and hap1 >= 0 and hap2 >= 0:\
            key = (id1, id2, chrom, hap1, hap2)\
        else:\
            key = (id1, id2, chrom)\
        \
        if key not in segment_groups:\
            segment_groups[key] = []\
        \
        segment_groups[key].append(segment)\
    \
    # Process each group to merge adjacent segments\
    merged_segments = []\
    \
    for key, group in segment_groups.items():\
        # Sort segments by start position\
        sorted_segments = sorted(group, key=lambda x: x[5])\
        \
        # Initialize with the first segment\
        if not sorted_segments:\
            continue\
        \
        current = list(sorted_segments[0])  # Make a copy that we can modify\
        \
        for segment in sorted_segments[1:]:\
            # Check if this segment is adjacent to the current one\
            if segment[5] <= current[6] + max_gap:  # start <= current_end + max_gap\
                # Merge by extending the end position and length\
                if segment[6] > current[6]:  # if new_end > current_end\
                    overlap = max(0, current[6] - segment[5])  # Calculate overlap\
                    current[6] = segment[6]  # Update end position\
                    current[7] = current[7] + segment[7] - overlap  # Update length accounting for overlap\
            else:\
                # Not adjacent, so add the current segment and start a new one\
                merged_segments.append(tuple(current))  # Convert to tuple for immutability\
                current = list(segment)  # Start a new current segment\
        \
        # Add the last segment in the group\
        merged_segments.append(tuple(current))\
    \
    print(f\\"Merged {len(segments) - len(merged_segments)} segments\\")\
    return merged_segments\
\
def calculate_ibd_statistics(segments):\
    \\"\\"\\"Calculate summary statistics for IBD segments.\
    \
    Args:\
        segments: List of IBD segments in Bonsai format\
        \
    Returns:\
        Dictionary mapping pairs of individuals to their IBD statistics\
    \\"\\"\\"\
    print(\\"Calculating IBD statistics for all pairs\\")\
    \
    # Initialize statistics dictionary\
    pair_stats = {}\
    \
    for segment in segments:\
        id1, id2, hap1, hap2, chrom, start, end, length = segment\
        \
        # Ensure consistent ordering of IDs\
        if id1 > id2:\
            id1, id2 = id2, id1\
        \
        pair = (id1, id2)\
        \
        # Initialize stats for this pair if needed\
        if pair not in pair_stats:\
            pair_stats[pair] = {\
                'total_ibd': 0.0,\
                'segment_count': 0,\
                'max_segment': 0.0,\
                'chromosomes': set()\
            }\
        \
        # Update statistics\
        pair_stats[pair]['total_ibd'] += length\
        pair_stats[pair]['segment_count'] += 1\
        pair_stats[pair]['max_segment'] = max(pair_stats[pair]['max_segment'], length)\
        pair_stats[pair]['chromosomes'].add(chrom)\
    \
    print(f\\"Calculated statistics for {len(pair_stats)} pairs\\")\
    return pair_stats\
\
# Demonstrate the IBD processing pipeline\
def demonstrate_ibd_processing():\
    # 1. Load IBD segments\
    segments = load_ibd_segments('simulated_data.csv', format='ibis')\
    \
    # 2. Filter segments\
    filtered_segments = filter_ibd_segments(segments, min_length=7.0)\
    \
    # 3. Merge adjacent segments\
    merged_segments = merge_adjacent_segments(filtered_segments, max_gap=2.0, phase_sensitive=False)\
    \
    # 4. Calculate IBD statistics\
    pair_stats = calculate_ibd_statistics(merged_segments)\
    \
    # Display summary of results\
    print(\\"\\\
Summary of IBD Processing:\\")\
    print(f\\"  Original segments: {len(segments)}\\")\
    print(f\\"  After filtering: {len(filtered_segments)}\\")\
    print(f\\"  After merging: {len(merged_segments)}\\")\
    print(f\\"  Number of related pairs: {len(pair_stats)}\\")\
    \
    # Calculate distribution of total IBD\
    if pair_stats:\
        total_ibd_values = [stats['total_ibd'] for stats in pair_stats.values()]\
        total_ibd_values.sort()\
        \
        print(\\"\\\
Distribution of Total IBD (cM):\\")\
        print(f\\"  Minimum: {min(total_ibd_values):.2f} cM\\")\
        print(f\\"  Maximum: {max(total_ibd_values):.2f} cM\\")\
        print(f\\"  Median: {total_ibd_values[len(total_ibd_values)//2]:.2f} cM\\")\
        print(f\\"  Mean: {sum(total_ibd_values)/len(total_ibd_values):.2f} cM\\")\
        \
        # Plot distribution of total IBD\
        plt.figure(figsize=(10, 6))\
        plt.hist(total_ibd_values, bins=20, alpha=0.7, color='steelblue')\
        plt.xlabel('Total IBD (cM)')\
        plt.ylabel('Number of Pairs')\
        plt.title('Distribution of Total IBD Sharing Between Pairs')\
        plt.grid(alpha=0.3)\
        plt.show()\
        \
        # Return for further analysis\
        return segments, filtered_segments, merged_segments, pair_stats\
    \
    return segments, filtered_segments, merged_segments, pair_stats\
\
# Run the demonstration\
ibd_segments, filtered_segments, merged_segments, pair_stats = demonstrate_ibd_processing()

### Implementing Relationship Inference\
\
Now let's implement the relationship inference component of our pipeline. This involves calculating likelihoods for different relationship hypotheses based on the processed IBD data.

In [ ]:
def load_demographic_data(file_path):
    """Load demographic data (age, sex, etc.) from a file.
    
    Args:
        file_path: Path to the demographic data file
        
    Returns:
        Dictionary mapping individual IDs to demographic information
    """
    print(f"Loading demographic data from {file_path}")
    
    # We'll simulate loading data for demonstration purposes
    # In a real implementation, this would read from a CSV file
    
    # Create some sample demographic data
    import random
    
    # Get list of all individuals from the IBD segments
    all_individuals = set()
    for segment in merged_segments:
        all_individuals.add(segment[0])
        all_individuals.add(segment[1])
    
    demographic_data = {}
    for individual in all_individuals:
        demographic_data[individual] = {
            'age': random.randint(18, 80),
            'sex': random.choice(['M', 'F']),
            'birth_year': random.randint(1940, 2002),
            'population': random.choice(['EUR', 'AFR', 'EAS', 'SAS', 'AMR'])
        }
    
    print(f"Loaded demographic data for {len(demographic_data)} individuals")
    return demographic_data

def estimate_degree_by_ibd(total_ibd):
    """Estimate the relationship degree based on total IBD sharing.
    
    Args:
        total_ibd: Total IBD sharing in cM
        
    Returns:
        estimated_degree: Estimated relationship degree
        relationship_type: Most likely relationship type
        confidence: Confidence in the estimate (high, medium, low)
    """
    # Define thresholds for different degrees
    # These are simplified thresholds based on average sharing
    thresholds = [
        (2800, 0, "Self"),                  # > 2800 cM: Self
        (2250, 0.5, "Identical twin"),      # > 2250 cM: Identical twin
        (1450, 1, "Parent-child"),          # > 1450 cM: Parent-child
        (1300, 1, "Full sibling"),          # > 1300 cM: Full sibling
        (650, 2, "Half sibling/Grandparent"),  # > 650 cM: Half sibling/Grandparent/Avuncular
        (325, 3, "First cousin"),           # > 325 cM: First cousin
        (160, 4, "First cousin once removed"),  # > 160 cM: First cousin once removed
        (80, 5, "Second cousin"),            # > 80 cM: Second cousin
        (40, 6, "Second cousin once removed"),  # > 40 cM: Second cousin once removed
        (20, 7, "Third cousin"),             # > 20 cM: Third cousin
        (0, 8, "Distant relation")           # Any IBD: Distant relation
    ]
    
    # Find the appropriate threshold
    for threshold, degree, relationship in thresholds:
        if total_ibd >= threshold:
            # Calculate confidence based on distance from threshold boundaries
            if degree < len(thresholds) - 1:
                next_threshold = thresholds[degree + 1][0]
                if total_ibd > (threshold + next_threshold) / 2:
                    confidence = "high"
                elif total_ibd > threshold * 0.8:
                    confidence = "medium"
                else:
                    confidence = "low"
            else:
                confidence = "medium"
                
            return degree, relationship, confidence
    
    # If we get here, there's no IBD sharing
    return None, "Unrelated", "high"

def refine_with_demographic_data(pair, degree, relationship, demographic_data):
    """Refine relationship estimate using demographic data.
    
    Args:
        pair: Tuple of (id1, id2)
        degree: Estimated relationship degree
        relationship: Estimated relationship type
        demographic_data: Dictionary of demographic information
        
    Returns:
        refined_relationship: Refined relationship type
        rationale: Explanation for the refinement
    """
    id1, id2 = pair
    
    # Make sure we have demographic data for both individuals
    if id1 not in demographic_data or id2 not in demographic_data:
        return relationship, "No demographic data available"
    
    demo1 = demographic_data[id1]
    demo2 = demographic_data[id2]
    
    # Calculate age difference
    age_diff = abs(demo1['age'] - demo2['age'])
    
    # Apply demographic constraints to refine relationship type
    if degree == 1:
        # Parent-child vs full sibling
        if age_diff > 15:  # Assume parent-child relationship if age difference > 15
            if demo1['age'] > demo2['age']:
                return "Parent-child (1→2)", f"Age difference of {age_diff} years suggests parent-child"
            else:
                return "Parent-child (2→1)", f"Age difference of {age_diff} years suggests parent-child"
        else:
            return "Full sibling", f"Age difference of {age_diff} years suggests siblings"
    
    elif degree == 2:
        # Half sibling vs grandparent vs avuncular
        if age_diff > 40:  # Large age difference suggests grandparent
            if demo1['age'] > demo2['age']:
                return "Grandparent (1→2)", f"Age difference of {age_diff} years suggests grandparent"
            else:
                return "Grandparent (2→1)", f"Age difference of {age_diff} years suggests grandparent"
        elif age_diff > 15:  # Moderate age difference suggests avuncular
            return "Avuncular", f"Age difference of {age_diff} years suggests avuncular"
        else:  # Small age difference suggests half sibling
            return "Half sibling", f"Age difference of {age_diff} years suggests half siblings"
    
    # For more distant relationships, we keep the estimate based on IBD
    return relationship, "Based on IBD sharing alone"

def infer_pairwise_relationships(ibd_stats, demographic_data=None):
    """Infer pairwise relationships based on IBD statistics and demographic data.
    
    Args:
        ibd_stats: Dictionary mapping pairs to IBD statistics
        demographic_data: Dictionary of demographic information (optional)
        
    Returns:
        Dictionary mapping pairs to inferred relationships
    """
    print("Inferring pairwise relationships")
    
    # Initialize relationships dictionary
    relationships = {}
    
    for pair, stats in ibd_stats.items():
        # Estimate degree based on total IBD
        total_ibd = stats['total_ibd']
        degree, relationship, confidence = estimate_degree_by_ibd(total_ibd)
        
        # Skip unrelated pairs (no degree)
        if degree is None:
            continue
        
        # Refine with demographic data if available
        if demographic_data:
            refined_relationship, rationale = refine_with_demographic_data(
                pair, degree, relationship, demographic_data)
        else:
            refined_relationship = relationship
            rationale = "Based on IBD sharing alone"
        
        # Store the inferred relationship
        relationships[pair] = {
            'degree': degree,
            'relationship': refined_relationship,
            'confidence': confidence,
            'total_ibd': total_ibd,
            'segment_count': stats['segment_count'],
            'max_segment': stats['max_segment'],
            'chromosome_count': len(stats['chromosomes']),
            'rationale': rationale
        }
    
    print(f"Inferred {len(relationships)} relationships")
    return relationships

# Demonstrate relationship inference
def demonstrate_relationship_inference():
    # Load demographic data
    demographic_data = load_demographic_data('simulated_demographics.csv')
    
    # Infer relationships using IBD statistics and demographic data
    relationships = infer_pairwise_relationships(pair_stats, demographic_data)
    
    # Analyze the distribution of relationship degrees
    degree_counts = {}
    for rel_info in relationships.values():
        degree = rel_info['degree']
        degree_counts[degree] = degree_counts.get(degree, 0) + 1
    
    # Display summary of relationship inference
    print("\
Summary of Relationship Inference:")
    print(f"  Total pairs analyzed: {len(pair_stats)}")
    print(f"  Relationships inferred: {len(relationships)}")
    
    print("\
Relationship Degree Distribution:")
    for degree in sorted(degree_counts.keys()):
        print(f"  Degree {degree}: {degree_counts[degree]} pairs")
    
    # Create a visualization of relationship degrees
    degrees = sorted(degree_counts.keys())
    counts = [degree_counts[d] for d in degrees]
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(degrees, counts, color='steelblue', alpha=0.7)
    
    # Add count labels above bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                f'{int(height)}', ha='center', va='bottom')
    
    plt.xlabel('Relationship Degree')
    plt.ylabel('Number of Pairs')
    plt.title('Distribution of Inferred Relationship Degrees')
    plt.xticks(degrees)
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Display examples of each relationship type
    print("\
Examples of Inferred Relationships:")
    examples = {}
    for pair, rel_info in relationships.items():
        relationship = rel_info['relationship']
        # Only show one example of each relationship type
        if relationship not in examples:
            examples[relationship] = (pair, rel_info)
    
    for relationship, (pair, rel_info) in examples.items():
        print(f"  {pair[0]} and {pair[1]}: {relationship} (Degree {rel_info['degree']}, {rel_info['confidence']} confidence)")
        print(f"    Total IBD: {rel_info['total_ibd']:.2f} cM in {rel_info['segment_count']} segments")
        print(f"    Rationale: {rel_info['rationale']}")
        print()
    
    return demographic_data, relationships

# Run the demonstration
demographic_data, pairwise_relationships = demonstrate_relationship_inference()

## Part 3: Pedigree Construction and Community Detection

### Theory and Background

After inferring pairwise relationships, the next step in a pedigree reconstruction pipeline is to organize individuals into coherent family structures. This process involves two key components:

1. **Community Detection**: Identifying clusters of related individuals
2. **Pedigree Construction**: Assembling individuals into biologically valid pedigree structures

#### Community Detection

For large datasets with many individuals, it's often beneficial to first partition the data into smaller clusters of related individuals. This serves several purposes:

- Reduces computational complexity by breaking the problem into manageable subproblems
- Ensures that reconstructed pedigrees represent truly related individuals
- Allows parallelization of the pedigree reconstruction process
- Makes the final output more interpretable

Community detection algorithms from network science are well-suited for this task. We can represent individuals as nodes in a graph, with edges representing inferred relationships. The strength of each edge can be based on the degree of relationship or IBD sharing amount.

Popular community detection algorithms include:
- **Louvain Method**: Maximizes modularity through iterative node reassignment and community merging
- **Label Propagation**: Spreads community labels based on neighborhood majority
- **Spectral Clustering**: Uses eigenvalues of the graph Laplacian matrix to identify communities
- **Hierarchical Clustering**: Builds a dendrogram of relationships and cuts at an appropriate level

#### Pedigree Construction

Once we have identified communities of related individuals, we can reconstruct pedigrees within each community. This process involves:

1. **Identifying Key Relationships**: Parent-child and sibling relationships form the core structure
2. **Placing Ungenotyped Ancestors**: Adding ancestors not in the dataset to connect individuals
3. **Resolving Ambiguities**: Handling cases where multiple pedigree configurations are possible
4. **Optimizing the Structure**: Finding the pedigree that best explains the observed IBD sharing

Bonsai v3 employs several algorithms for pedigree construction:
- **Bottom-up Construction**: Starting with close relationships and building upward
- **Template Matching**: Identifying common family structures
- **Constrained Optimization**: Finding the best pedigree given pairwise constraints
- **Incremental Building**: Adding individuals one by one to an existing structure

### Implementation of Community Detection

Let's implement community detection to identify clusters of related individuals based on our pairwise relationship data. We'll use the NetworkX library for graph analysis and community detection algorithms.

In [ ]:
def build_relationship_graph(relationships):
    """Build a relationship graph from pairwise relationship data.
    
    Args:
        relationships: Dictionary mapping pairs to relationship information
        
    Returns:
        NetworkX Graph object representing the relationship network
    """
    import networkx as nx
    
    # Create a new undirected graph
    G = nx.Graph()
    
    # Add edges with relationship attributes
    for pair, rel_info in relationships.items():
        id1, id2 = pair
        
        # Add nodes if they don't exist
        if id1 not in G:
            G.add_node(id1)
        if id2 not in G:
            G.add_node(id2)
        
        # Add edge with relationship attributes
        # Weight is inversely proportional to degree (closer relationships have higher weight)
        degree = rel_info['degree']
        weight = 1.0 / (degree + 0.1)  # Add 0.1 to avoid division by zero
        
        G.add_edge(id1, id2, 
                  degree=degree,
                  relationship=rel_info['relationship'],
                  confidence=rel_info['confidence'],
                  total_ibd=rel_info['total_ibd'],
                  weight=weight)  # Weight for community detection
    
    print(f"Created relationship graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
    return G

def detect_communities(G, method='louvain', resolution=1.0, min_degree=7):
    """Detect communities of related individuals in the relationship graph.
    
    Args:
        G: NetworkX Graph object representing the relationship network
        method: Community detection algorithm to use ('louvain', 'label_propagation', 'hierarchical')
        resolution: Resolution parameter for Louvain method (higher values create more communities)
        min_degree: Maximum relationship degree to include (e.g., 3 = include up to first cousins)
        
    Returns:
        List of communities, where each community is a set of individual IDs
    """
    import networkx as nx
    
    # Create a filtered graph with only relationships up to min_degree
    filtered_G = nx.Graph()
    
    for u, v, data in G.edges(data=True):
        if data['degree'] <= min_degree:
            filtered_G.add_edge(u, v, **data)
    
    print(f"Filtered graph has {filtered_G.number_of_nodes()} nodes and {filtered_G.number_of_edges()} edges")
    
    # Detect communities using the specified method
    if method == 'louvain':
        try:
            # Try to use the community module if available
            from community import best_partition
            
            # Get the partition
            partition = best_partition(filtered_G, weight='weight', resolution=resolution)
            
            # Convert the partition to a list of communities
            communities = {}
            for node, community_id in partition.items():
                if community_id not in communities:
                    communities[community_id] = set()
                communities[community_id].add(node)
            
            communities_list = list(communities.values())
            
        except ImportError:
            print("Warning: python-louvain module not available, using networkx_community instead")
            import networkx.algorithms.community as nx_comm
            
            # Use Louvain method from NetworkX
            communities_list = list(nx_comm.louvain_communities(filtered_G, weight='weight', resolution=resolution))
    
    elif method == 'label_propagation':
        import networkx.algorithms.community as nx_comm
        communities_list = list(nx_comm.label_propagation_communities(filtered_G))
    
    elif method == 'hierarchical':
        import networkx.algorithms.community as nx_comm
        communities_list = list(nx_comm.asyn_fluidc(filtered_G, k=10, max_iter=100))
    
    else:
        raise ValueError(f"Unknown community detection method: {method}")
    
    # Sort communities by size (largest first)
    communities_list.sort(key=len, reverse=True)
    
    print(f"Detected {len(communities_list)} communities")
    
    # Print information about each community
    for i, community in enumerate(communities_list):
        print(f"  Community {i+1}: {len(community)} members")
    
    return communities_list

# Demonstrate community detection using the pairwise relationships
def demonstrate_community_detection():
    # Build the relationship graph
    G = build_relationship_graph(pairwise_relationships)
    
    # Detect communities
    communities = detect_communities(G, method='louvain', min_degree=5)
    
    # Visualize the relationship graph with community colors
    plt.figure(figsize=(12, 10))
    pos = nx.spring_layout(G, seed=42)  # Layout algorithm
    
    # Create a mapping from nodes to community index
    community_map = {}
    for i, community in enumerate(communities):
        for node in community:
            community_map[node] = i
    
    # Create a color map based on community membership
    colors = plt.cm.tab20(range(min(20, len(communities))))
    node_colors = [colors[community_map.get(node, 0) % len(colors)] for node in G.nodes()]
    
    # Draw nodes (sized by degree centrality)
    node_size = dict(nx.degree_centrality(G))
    node_sizes = [300 * node_size.get(node, 0.1) + 50 for node in G.nodes()]
    
    # Draw the network
    nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=node_sizes, alpha=0.8)
    
    # Draw edges with varying width based on relationship closeness
    edge_weights = [G[u][v]['weight'] * 3 for u, v in G.edges()]
    nx.draw_networkx_edges(G, pos, width=edge_weights, alpha=0.3, edge_color='gray')
    
    # Draw labels for larger communities
    nx.draw_networkx_labels(G, pos, font_size=8, font_color='black', alpha=0.7)
    
    plt.title("Relationship Network with Detected Communities")
    plt.axis('off')
    plt.tight_layout()
    plt.show()
    
    # Print details about the largest communities
    print("\
Largest Communities and Their Key Relationships:")
    for i, community in enumerate(communities[:5]):  # Show top 5 communities
        if len(community) <= 2:  # Skip very small communities
            continue
            
        print(f"\
Community {i+1}: {len(community)} members")
        print("  Members: " + ", ".join(list(community)[:10]) + 
              ("..." if len(community) > 10 else ""))
        
        # Find all relationships within this community
        community_relationships = {}
        for pair, rel_info in pairwise_relationships.items():
            id1, id2 = pair
            if id1 in community and id2 in community:
                community_relationships[pair] = rel_info
        
        # Sort by relationship degree (closest first)
        sorted_rels = sorted(community_relationships.items(), key=lambda x: x[1]['degree'])
        
        # Print the closest relationships in this community
        print("  Key relationships:")
        for pair, rel_info in sorted_rels[:5]:  # Show top 5 relationships
            print(f"    {pair[0]} and {pair[1]}: {rel_info['relationship']} " + 
                  f"(Degree {rel_info['degree']}, Total IBD: {rel_info['total_ibd']:.2f} cM)")
    
    return G, communities

# Run the demonstration
relationship_graph, detected_communities = demonstrate_community_detection()

### Implementation of Pedigree Construction

Now that we've identified communities of related individuals, let's implement pedigree construction algorithms to build family trees within each community.

In [ ]:
class Pedigree:
    """A class representing a pedigree (family tree) structure."""
    
    def __init__(self, name="Untitled Pedigree"):
        """Initialize an empty pedigree.
        
        Args:
            name: Name of the pedigree
        """
        self.name = name
        self.individuals = {}  # id -> individual data
        self.relationships = {}  # (id1, id2) -> relationship
        self.parents = {}  # id -> (mother_id, father_id)
        self.children = {}  # id -> set of child_ids
        self.confidence_scores = {}  # (id1, id2) -> confidence score
        
    def add_individual(self, individual_id, **attributes):
        """Add an individual to the pedigree.
        
        Args:
            individual_id: Unique identifier for the individual
            **attributes: Additional attributes (e.g., age, sex, birth_year)
        """
        self.individuals[individual_id] = attributes
        self.children[individual_id] = set()
        
    def add_relationship(self, id1, id2, relationship_type, confidence=None):
        """Add a relationship between two individuals.
        
        Args:
            id1: ID of the first individual
            id2: ID of the second individual
            relationship_type: Type of relationship
            confidence: Confidence score for this relationship
        """
        # Ensure the individuals exist in the pedigree
        if id1 not in self.individuals:
            raise ValueError(f"Individual {id1} does not exist in the pedigree")
        if id2 not in self.individuals:
            raise ValueError(f"Individual {id2} does not exist in the pedigree")
            
        # Store the relationship
        self.relationships[(id1, id2)] = relationship_type
        
        # Store the confidence score if provided
        if confidence is not None:
            self.confidence_scores[(id1, id2)] = confidence
            
    def set_parent_child(self, parent_id, child_id):
        """Set a parent-child relationship.
        
        Args:
            parent_id: ID of the parent
            child_id: ID of the child
        """
        # Ensure the individuals exist
        if parent_id not in self.individuals:
            raise ValueError(f"Parent {parent_id} does not exist in the pedigree")
        if child_id not in self.individuals:
            raise ValueError(f"Child {child_id} does not exist in the pedigree")
            
        # Get the parent's sex
        parent_sex = self.individuals.get(parent_id, {}).get('sex', None)
        
        # Get the current parents of the child
        mother_id, father_id = self.parents.get(child_id, (None, None))
        
        # Update the parent information based on sex
        if parent_sex == 'F':
            mother_id = parent_id
        elif parent_sex == 'M':
            father_id = parent_id
        else:
            # If sex is unknown, use an available parent slot
            if mother_id is None:
                mother_id = parent_id
            else:
                father_id = parent_id
                
        # Update the parents dictionary
        self.parents[child_id] = (mother_id, father_id)
        
        # Update the children set for the parent
        self.children[parent_id].add(child_id)
        
        # Add the relationship
        self.add_relationship(parent_id, child_id, 'parent-child', confidence=1.0)
            
    def add_ungenotyped_individual(self, individual_id, sex=None, **attributes):
        """Add an ungenotyped individual (ancestor/connector) to the pedigree.
        
        Args:
            individual_id: Unique identifier for the individual
            sex: Sex of the individual ('M' or 'F')
            **attributes: Additional attributes
        """
        # Create a unique ID for the ungenotyped individual if not provided
        if individual_id is None or individual_id in self.individuals:
            individual_id = f"ungenotyped_{len(self.individuals) + 1}"
            
        # Set attributes
        attributes['genotyped'] = False
        if sex:
            attributes['sex'] = sex
            
        # Add to the pedigree
        self.add_individual(individual_id, **attributes)
        return individual_id
    
    def get_siblings(self, individual_id):
        """Get the siblings of an individual.
        
        Args:
            individual_id: ID of the individual
            
        Returns:
            Set of sibling IDs
        """
        # Check if the individual has parents
        if individual_id not in self.parents:
            return set()
            
        mother_id, father_id = self.parents[individual_id]
        siblings = set()
        
        # Look for individuals with the same mother
        if mother_id is not None:
            for child_id in self.children.get(mother_id, set()):
                if child_id != individual_id:
                    siblings.add(child_id)
                    
        # Look for individuals with the same father
        if father_id is not None:
            for child_id in self.children.get(father_id, set()):
                if child_id != individual_id and child_id not in siblings:
                    siblings.add(child_id)
                    
        return siblings
    
    def get_all_relatives(self, individual_id):
        """Get all relatives of an individual in the pedigree.
        
        Args:
            individual_id: ID of the individual
            
        Returns:
            Set of relative IDs
        """
        relatives = set()
        visited = set()
        
        def dfs(id):
            if id in visited:
                return
            visited.add(id)
            
            # Add parents
            if id in self.parents:
                mother_id, father_id = self.parents[id]
                if mother_id is not None:
                    relatives.add(mother_id)
                    dfs(mother_id)
                if father_id is not None:
                    relatives.add(father_id)
                    dfs(father_id)
            
            # Add children
            for child_id in self.children.get(id, set()):
                relatives.add(child_id)
                dfs(child_id)
            
            # Add siblings and their descendants
            for sibling_id in self.get_siblings(id):
                relatives.add(sibling_id)
                dfs(sibling_id)
        
        dfs(individual_id)
        
        # Remove the individual themselves
        if individual_id in relatives:
            relatives.remove(individual_id)
            
        return relatives
    
    def to_networkx(self):
        """Convert the pedigree to a NetworkX graph for visualization.
        
        Returns:
            NetworkX DiGraph object representing the pedigree
        """
        import networkx as nx
        
        G = nx.DiGraph()
        
        # Add nodes (individuals)
        for ind_id, attributes in self.individuals.items():
            # Copy attributes
            node_attrs = dict(attributes)
            
            # Add ID as an attribute
            node_attrs['id'] = ind_id
            
            # Add node to the graph
            G.add_node(ind_id, **node_attrs)
        
        # Add edges (parent-child relationships)
        for child_id, (mother_id, father_id) in self.parents.items():
            if mother_id is not None:
                G.add_edge(mother_id, child_id, relationship='mother-child')
            if father_id is not None:
                G.add_edge(father_id, child_id, relationship='father-child')
        
        return G
    
    def plot(self, figsize=(12, 10), node_size=300, font_size=8):
        """Visualize the pedigree using NetworkX and matplotlib.
        
        Args:
            figsize: Figure size as (width, height)
            node_size: Base size of nodes
            font_size: Size of text labels
        """
        import networkx as nx
        import matplotlib.pyplot as plt
        
        # Convert to NetworkX graph
        G = self.to_networkx()
        
        # Create figure
        plt.figure(figsize=figsize)
        
        # Use a hierarchical layout
        pos = nx.nx_agraph.graphviz_layout(G, prog='dot')
        
        # Create node colors based on sex
        node_colors = []
        for node in G.nodes():
            sex = G.nodes[node].get('sex', None)
            if sex == 'M':
                node_colors.append('skyblue')
            elif sex == 'F':
                node_colors.append('pink')
            else:
                node_colors.append('lightgray')
        
        # Create node shapes based on genotyped status
        node_shapes = []
        for node in G.nodes():
            genotyped = G.nodes[node].get('genotyped', True)
            if genotyped:
                node_shapes.append('o')  # Circle for genotyped
            else:
                node_shapes.append('s')  # Square for ungenotyped
        
        # Draw the graph
        nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=node_size)
        nx.draw_networkx_edges(G, pos, edge_color='gray', width=1.5, alpha=0.7)
        nx.draw_networkx_labels(G, pos, font_size=font_size, font_color='black')
        
        plt.title(self.name)
        plt.axis('off')
        plt.tight_layout()
        plt.show()

def build_pedigree_from_relationships(community, relationships, demographic_data=None):
    """Build a pedigree from pairwise relationships.
    
    Args:
        community: Set of individual IDs in the community
        relationships: Dictionary mapping pairs to relationship information
        demographic_data: Dictionary mapping individual IDs to demographic information
        
    Returns:
        Pedigree object representing the community
    """
    # Create a new pedigree
    pedigree = Pedigree(name=f"Community of {len(community)} individuals")
    
    # Add all individuals to the pedigree
    for ind_id in community:
        # Get demographic data if available
        attributes = {}
        if demographic_data and ind_id in demographic_data:
            attributes = demographic_data[ind_id].copy()
        
        # Set as genotyped (observed in the data)
        attributes['genotyped'] = True
        
        # Add to the pedigree
        pedigree.add_individual(ind_id, **attributes)
    
    # First pass: add all clear parent-child relationships
    parent_child_pairs = []
    for pair, rel_info in relationships.items():
        id1, id2 = pair
        
        # Skip relationships not in this community
        if id1 not in community or id2 not in community:
            continue
        
        # Check if this is a parent-child relationship
        relationship = rel_info['relationship']
        if 'Parent-child' in relationship:
            if '(1→2)' in relationship:
                parent_id, child_id = id1, id2
            else:
                parent_id, child_id = id2, id1
                
            parent_child_pairs.append((parent_id, child_id, rel_info['confidence']))
    
    # Sort by confidence to add most confident parent-child relationships first
    parent_child_pairs.sort(key=lambda x: x[2], reverse=True)
    
    # Add parent-child relationships to the pedigree
    for parent_id, child_id, confidence in parent_child_pairs:
        try:
            pedigree.set_parent_child(parent_id, child_id)
        except Exception as e:
            print(f"Warning: Could not add parent-child relationship {parent_id}->{child_id}: {e}")
    
    # Second pass: add sibling relationships
    for pair, rel_info in relationships.items():
        id1, id2 = pair
        
        # Skip relationships not in this community
        if id1 not in community or id2 not in community:
            continue
        
        # Check if this is a sibling relationship
        relationship = rel_info['relationship']
        if relationship == 'Full sibling':
            # Create ungenotyped parents if needed
            # First, check if either individual already has parents
            has_parents1 = id1 in pedigree.parents and any(pedigree.parents[id1])
            has_parents2 = id2 in pedigree.parents and any(pedigree.parents[id2])
            
            if not has_parents1 and not has_parents2:
                # Neither has parents, create new ungenotyped parents
                mother_id = pedigree.add_ungenotyped_individual(None, sex='F', note="Inferred from sibling relationship")
                father_id = pedigree.add_ungenotyped_individual(None, sex='M', note="Inferred from sibling relationship")
                
                # Set as parents for both siblings
                pedigree.set_parent_child(mother_id, id1)
                pedigree.set_parent_child(mother_id, id2)
                pedigree.set_parent_child(father_id, id1)
                pedigree.set_parent_child(father_id, id2)
                
            elif has_parents1:
                # Individual 1 has parents, use them for individual 2
                mother_id, father_id = pedigree.parents[id1]
                if mother_id:
                    pedigree.set_parent_child(mother_id, id2)
                if father_id:
                    pedigree.set_parent_child(father_id, id2)
                    
            elif has_parents2:
                # Individual 2 has parents, use them for individual 1
                mother_id, father_id = pedigree.parents[id2]
                if mother_id:
                    pedigree.set_parent_child(mother_id, id1)
                if father_id:
                    pedigree.set_parent_child(father_id, id1)
    
    # Third pass: handle half-sibling, avuncular, and grandparent relationships
    # ... These require more complex logic and would be added here
    
    return pedigree

# Demonstrate pedigree construction for a community
def demonstrate_pedigree_construction():
    # Get the largest community
    largest_community = detected_communities[0] if detected_communities else set()
    
    if not largest_community:
        print("No communities detected")
        return None
        
    print(f"Building pedigree for largest community with {len(largest_community)} members")
    
    # Build a pedigree from the relationships in this community
    pedigree = build_pedigree_from_relationships(largest_community, pairwise_relationships, demographic_data)
    
    # Visualize the pedigree
    pedigree.plot()
    
    # Print pedigree statistics
    print("\
Pedigree Statistics:")
    print(f"  Total individuals: {len(pedigree.individuals)}")
    print(f"  Genotyped individuals: {sum(1 for attrs in pedigree.individuals.values() if attrs.get('genotyped', False))}")
    print(f"  Ungenotyped individuals: {sum(1 for attrs in pedigree.individuals.values() if not attrs.get('genotyped', True))}")
    print(f"  Parent-child relationships: {len(pedigree.parents)}")
    
    # Find the most connected individuals
    individual_connections = {}
    for ind_id in pedigree.individuals:
        relatives = pedigree.get_all_relatives(ind_id)
        individual_connections[ind_id] = len(relatives)
    
    # Sort by number of connections
    sorted_connections = sorted(individual_connections.items(), key=lambda x: x[1], reverse=True)
    
    print("\
Most connected individuals:")
    for ind_id, connection_count in sorted_connections[:5]:
        genotyped = "genotyped" if pedigree.individuals[ind_id].get('genotyped', True) else "ungenotyped"
        print(f"  {ind_id}: {connection_count} relatives ({genotyped})")
    
    return pedigree

# Run the demonstration
constructed_pedigree = demonstrate_pedigree_construction()

## Part 4: Confidence Assessment and Result Interpretation

### Theory and Background

A critical aspect of pedigree reconstruction is evaluating the confidence of inferred relationships and pedigree structures. Even with high-quality IBD data, some relationships may be ambiguous or have multiple plausible explanations.

Effective confidence assessment helps users:
1. **Identify Reliable Results**: Distinguish between high and low confidence relationships
2. **Focus Investigation**: Prioritize areas where additional evidence is needed
3. **Make Informed Decisions**: Balance confidence levels against the implications of incorrect inferences
4. **Communicate Uncertainty**: Present results to users with appropriate caveats

In Bonsai v3, confidence assessment is implemented at multiple levels:
- **Pairwise Relationship Confidence**: How certain we are about the relationship between two individuals
- **Pedigree Structure Confidence**: How well the overall structure is supported by the data
- **Alternative Hypothesis Scoring**: Evaluating multiple possible pedigree configurations

In this section, we'll implement confidence assessment methods and interpret the reconstructed pedigrees.

In [ ]:
def assess_pedigree_confidence(pedigree, relationships):
    """Assess confidence in the reconstructed pedigree.
    
    Args:
        pedigree: Pedigree object
        relationships: Dictionary mapping pairs to relationship information
        
    Returns:
        Dictionary mapping relationship pairs to confidence scores
        Overall pedigree confidence score
    """
    print("Assessing pedigree confidence...")
    
    # Initialize confidence scores
    confidence_scores = {}
    relationship_conflicts = {}
    
    # Assess each relationship in the pedigree
    for pair, rel_type in pedigree.relationships.items():
        id1, id2 = pair
        
        # If we have inferred this relationship from IBD data
        if pair in relationships or (id2, id1) in relationships:
            # Get the original relationship information
            rel_info = relationships.get(pair, relationships.get((id2, id1)))
            
            # Compare the pedigree relationship with the inferred relationship
            if rel_type == 'parent-child' and 'Parent-child' in rel_info['relationship']:
                # Parent-child relationship matches
                confidence = rel_info['confidence']
                if confidence == 'high':
                    score = 0.95
                elif confidence == 'medium':
                    score = 0.75
                else:
                    score = 0.5
            elif rel_type == 'sibling' and rel_info['relationship'] in ['Full sibling', 'Half sibling']:
                # Sibling relationship matches
                confidence = rel_info['confidence']
                if confidence == 'high':
                    score = 0.9
                elif confidence == 'medium':
                    score = 0.7
                else:
                    score = 0.4
            else:
                # Relationship type mismatch
                score = 0.3
                relationship_conflicts[pair] = (rel_type, rel_info['relationship'])
        else:
            # This is an inferred relationship (e.g., through transitivity)
            # For example, if A is parent of B and B is parent of C, then A is grandparent of C
            score = 0.6  # Moderate confidence for transitive relationships
            
        # Store the confidence score
        confidence_scores[pair] = score
    
    # Evaluate overall pedigree confidence
    if confidence_scores:
        overall_confidence = sum(confidence_scores.values()) / len(confidence_scores)
    else:
        overall_confidence = 0.0
        
    # Print confidence information
    print(f"Overall pedigree confidence: {overall_confidence:.2f}")
    print(f"Assessed confidence for {len(confidence_scores)} relationships")
    
    if relationship_conflicts:
        print(f"Found {len(relationship_conflicts)} relationship conflicts:")
        for pair, (ped_rel, inf_rel) in list(relationship_conflicts.items())[:5]:  # Show first 5
            print(f"  {pair[0]} and {pair[1]}: {ped_rel} in pedigree vs {inf_rel} from IBD")
    
    return confidence_scores, overall_confidence

def visualize_confidence(pedigree, confidence_scores):
    """Visualize the pedigree with confidence information.
    
    Args:
        pedigree: Pedigree object
        confidence_scores: Dictionary mapping pairs to confidence scores
    """
    import networkx as nx
    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    
    # Convert to NetworkX graph
    G = pedigree.to_networkx()
    
    # Create figure
    plt.figure(figsize=(14, 10))
    
    # Use a hierarchical layout
    pos = nx.nx_agraph.graphviz_layout(G, prog='dot')
    
    # Create node colors based on sex
    node_colors = []
    for node in G.nodes():
        sex = G.nodes[node].get('sex', None)
        if sex == 'M':
            node_colors.append('skyblue')
        elif sex == 'F':
            node_colors.append('pink')
        else:
            node_colors.append('lightgray')
    
    # Draw nodes (different shape for genotyped vs ungenotyped)
    genotyped_nodes = [node for node in G.nodes() if G.nodes[node].get('genotyped', True)]
    ungenotyped_nodes = [node for node in G.nodes() if not G.nodes[node].get('genotyped', True)]
    
    nx.draw_networkx_nodes(G, pos, 
                          nodelist=genotyped_nodes,
                          node_color=[node_colors[list(G.nodes()).index(node)] for node in genotyped_nodes],
                          node_shape='o',
                          node_size=300)
    
    nx.draw_networkx_nodes(G, pos, 
                          nodelist=ungenotyped_nodes,
                          node_color=[node_colors[list(G.nodes()).index(node)] for node in ungenotyped_nodes],
                          node_shape='s',
                          node_size=200)
    
    # Draw edges with color based on confidence
    edges = list(G.edges())
    edge_colors = []
    edge_widths = []
    
    # Color map for confidence: red (low) to green (high)
    cmap = cm.get_cmap('RdYlGn')
    
    for u, v in edges:
        # Check if we have a confidence score for this edge
        if (u, v) in confidence_scores:
            score = confidence_scores[(u, v)]
        elif (v, u) in confidence_scores:
            score = confidence_scores[(v, u)]
        else:
            score = 0.5  # Default medium confidence
            
        edge_colors.append(cmap(score))
        edge_widths.append(1.0 + score)
    
    nx.draw_networkx_edges(G, pos, 
                          edgelist=edges,
                          edge_color=edge_colors,
                          width=edge_widths)
    
    # Draw labels
    nx.draw_networkx_labels(G, pos, font_size=8, font_color='black')
    
    # Add a legend for confidence
    import matplotlib.patches as mpatches
    
    legend_elements = [
        mpatches.Patch(color=cmap(0.1), label='Low Confidence'),
        mpatches.Patch(color=cmap(0.5), label='Medium Confidence'),
        mpatches.Patch(color=cmap(0.9), label='High Confidence')
    ]
    
    plt.legend(handles=legend_elements, loc='upper right')
    
    plt.title(f"{pedigree.name} with Confidence Assessment")
    plt.axis('off')
    plt.tight_layout()
    plt.show()

def generate_pedigree_report(pedigree, confidence_scores, overall_confidence):
    """Generate a detailed report about the pedigree.
    
    Args:
        pedigree: Pedigree object
        confidence_scores: Dictionary mapping pairs to confidence scores
        overall_confidence: Overall confidence score for the pedigree
    """
    from IPython.display import Markdown, display
    
    # Create a markdown report
    report = [f"# Pedigree Report: {pedigree.name}"]
    
    # Add overall statistics
    report.append("## Overall Statistics")
    report.append(f"- **Total individuals:** {len(pedigree.individuals)}")
    report.append(f"- **Genotyped individuals:** {sum(1 for attrs in pedigree.individuals.values() if attrs.get('genotyped', False))}")
    report.append(f"- **Ungenotyped individuals:** {sum(1 for attrs in pedigree.individuals.values() if not attrs.get('genotyped', True))}")
    report.append(f"- **Parent-child relationships:** {len(pedigree.parents)}")
    report.append(f"- **Overall confidence score:** {overall_confidence:.2f}")
    
    # Group individuals by families
    families = {}
    
    for ind_id in pedigree.individuals:
        # Find the top ancestor for this individual
        current_id = ind_id
        ancestors = []
        
        while current_id in pedigree.parents and any(pedigree.parents[current_id]):
            mother_id, father_id = pedigree.parents[current_id]
            if mother_id:
                ancestors.append(mother_id)
                current_id = mother_id
            elif father_id:
                ancestors.append(father_id)
                current_id = father_id
            else:
                break
        
        # The top ancestor is the family identifier
        if ancestors:
            family_id = ancestors[-1]
        else:
            family_id = ind_id
            
        if family_id not in families:
            families[family_id] = []
        families[family_id].append(ind_id)
    
    # Add information about each family
    report.append(f"\
## Family Structures ({len(families)} families)")
    
    for i, (family_id, members) in enumerate(sorted(families.items(), key=lambda x: len(x[1]), reverse=True)):
        report.append(f"\
### Family {i+1}: {family_id}")
        report.append(f"- **Members:** {len(members)}")
        
        # List key individuals in the family
        genotyped_members = [m for m in members if pedigree.individuals[m].get('genotyped', True)]
        if genotyped_members:
            report.append(f"- **Genotyped members:** {', '.join(genotyped_members[:5])}" + 
                         ("..." if len(genotyped_members) > 5 else ""))
        
        # Find key relationships in this family
        family_relationships = []
        for pair, score in confidence_scores.items():
            if pair[0] in members and pair[1] in members:
                rel_type = pedigree.relationships.get(pair, "unknown")
                family_relationships.append((pair, rel_type, score))
        
        # Sort by confidence (highest first)
        family_relationships.sort(key=lambda x: x[2], reverse=True)
        
        if family_relationships:
            report.append("- **Key relationships:**")
            for (id1, id2), rel_type, score in family_relationships[:5]:  # Show top 5
                confidence = "high" if score > 0.8 else "medium" if score > 0.5 else "low"
                report.append(f"  - {id1} — {id2}: {rel_type} ({confidence} confidence: {score:.2f})")
    
    # Add information about high and low confidence relationships
    high_conf = {pair: score for pair, score in confidence_scores.items() if score > 0.8}
    low_conf = {pair: score for pair, score in confidence_scores.items() if score < 0.5}
    
    report.append("\
## Confidence Assessment")
    report.append(f"- **High confidence relationships:** {len(high_conf)} ({len(high_conf) / len(confidence_scores) * 100:.0f}%)")
    report.append(f"- **Low confidence relationships:** {len(low_conf)} ({len(low_conf) / len(confidence_scores) * 100:.0f}%)")
    
    if low_conf:
        report.append("\
### Low Confidence Relationships (Requiring Further Investigation)")
        for pair, score in sorted(low_conf.items(), key=lambda x: x[1]):
            rel_type = pedigree.relationships.get(pair, "unknown")
            report.append(f"- {pair[0]} — {pair[1]}: {rel_type} (confidence: {score:.2f})")
    
    # Display the report
    display(Markdown("\
".join(report)))
    
    return "\
".join(report)

# Evaluate and report on the pedigree
def demonstrate_pedigree_evaluation():
    # Skip if no pedigree was constructed
    if constructed_pedigree is None:
        print("No pedigree available for evaluation")
        return
    
    # Assess confidence in the pedigree
    confidence_scores, overall_confidence = assess_pedigree_confidence(
        constructed_pedigree, pairwise_relationships)
    
    # Visualize the pedigree with confidence information
    visualize_confidence(constructed_pedigree, confidence_scores)
    
    # Generate a report
    report = generate_pedigree_report(constructed_pedigree, confidence_scores, overall_confidence)
    
    # Return for further analysis
    return confidence_scores, overall_confidence, report

# Run the demonstration
try:
    confidence_scores, overall_confidence, pedigree_report = demonstrate_pedigree_evaluation()
except Exception as e:
    print(f"Error evaluating pedigree: {e}")
    confidence_scores, overall_confidence, pedigree_report = None, None, None

## Conclusion and Key Takeaways

In this lab, we've built a complete end-to-end pedigree reconstruction pipeline using Bonsai v3. We've covered each major component of the pipeline:

1. **Pipeline Architecture**: We designed a modular, robust pipeline architecture with well-defined interfaces between components
2. **IBD Processing**: We implemented functions for loading, filtering, and analyzing IBD segments
3. **Relationship Inference**: We developed algorithms for inferring relationships from IBD statistics and demographic data
4. **Community Detection**: We used network analysis to identify clusters of related individuals
5. **Pedigree Construction**: We built pedigree structures that respect biological constraints and explain observed IBD patterns
6. **Confidence Assessment**: We evaluated the reliability of our reconstructed relationships and pedigrees

Key takeaways from this lab include:

- **Modular Design**: Breaking the pipeline into independent components allows for easier testing, maintenance, and improvement
- **Data Quality**: Proper filtering and preprocessing of IBD data is critical for accurate results
- **Demographic Integration**: Age and sex information significantly improves relationship inference
- **Confidence Metrics**: Quantifying uncertainty helps users interpret and trust the results
- **Visualization**: Effective visualization makes complex pedigree structures interpretable

Bonsai v3 provides a powerful framework for pedigree reconstruction, but real-world applications often require customization and fine-tuning based on the specific data and requirements of the project.

## Exercise: End-to-End Integration

For this final exercise, you'll put together everything you've learned to implement a complete pedigree reconstruction pipeline for a real-world scenario.

### Scenario
You've been given IBD data from a genetic ancestry testing company. The data includes IBD segments for 100 individuals, some of whom are known to be related. Your task is to reconstruct pedigrees for these individuals, providing confidence scores and visualizations.

### Task
Implement a complete end-to-end pipeline that:

1. Processes and filters the IBD data appropriately
2. Integrates available demographic information
3. Detects communities of related individuals
4. Reconstructs pedigrees within each community
5. Assesses confidence in the reconstructed relationships
6. Generates visualizations and reports

### Implementation Guidelines

1. Start with the `BonsaiPipeline` class we designed earlier.
2. Fill in the implementation of each component.
3. Add error handling and logging to ensure robustness.
4. Include configuration options for different use cases.
5. Think about how to parallelize processing for larger datasets.

### Expected Output
Your pipeline should produce:

1. Filtered and processed IBD statistics
2. Community assignments for all individuals
3. Reconstructed pedigrees for each community
4. Confidence scores for all relationships
5. Visualizations of the pedigrees with confidence information
6. A detailed report summarizing the findings

### Template

```python
from typing import Dict, List, Set, Tuple, Optional, Any
import os
import logging
import time
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd

class CompleteBonsaiPipeline:
    """Complete end-to-end pedigree reconstruction pipeline using Bonsai v3."""
    
    def __init__(self, config: Dict[str, Any] = None):
        """Initialize the pipeline.
        
        Args:
            config: Configuration dictionary
        """
        # Set up configuration
        self.config = config or {}
        self.config.setdefault('min_segment_cm', 7.0)
        self.config.setdefault('max_pedigree_size', 100)
        
        # Initialize pipeline components
        self.ibd_processor = None
        self.relationship_inferrer = None
        self.community_detector = None
        self.pedigree_builder = None
        self.confidence_assessor = None
        self.visualizer = None
        
        # Set up logging
        self._setup_logging()
    
    def _setup_logging(self):
        """Set up logging for the pipeline."""
        # ... implement logging setup ...
        pass
    
    def run(self, ibd_file: str, demographic_file: Optional[str] = None):
        """Run the complete pipeline.
        
        Args:
            ibd_file: Path to the IBD segments file
            demographic_file: Path to the demographic data file (optional)
        """
        start_time = time.time()
        
        # Step 1: Process IBD data
        self.logger.info("Processing IBD data...")
        
        # Step 2: Load demographic data
        self.logger.info("Loading demographic data...")
        
        # Step 3: Infer pairwise relationships
        self.logger.info("Inferring pairwise relationships...")
        
        # Step 4: Detect communities
        self.logger.info("Detecting communities...")
        
        # Step 5: Construct pedigrees
        self.logger.info("Constructing pedigrees...")
        
        # Step 6: Assess confidence
        self.logger.info("Assessing confidence...")
        
        # Step 7: Visualize results
        self.logger.info("Visualizing results...")
        
        # Step 8: Generate reports
        self.logger.info("Generating reports...")
        
        elapsed_time = time.time() - start_time
        self.logger.info(f"Pipeline completed in {elapsed_time:.2f} seconds")
        
        return {
            "pedigrees": self.pedigrees,
            "confidence_scores": self.confidence_scores,
            "execution_time": elapsed_time
        }

# Implement your complete pipeline here
```

Use the functions and classes we developed in this lab as building blocks for your implementation. You can simulate the input data if needed, or try to use a small subset of the class data if it's accessible in your environment.

## References and Further Reading

### Bonsai v3 Documentation
- Bonsai v3 User Guide: Comprehensive documentation of the Bonsai v3 codebase
- Bonsai v3 API Reference: Detailed descriptions of all classes and functions

### IBD Detection and Processing
- Browning, B. L., & Browning, S. R. (2013). Improving the accuracy and efficiency of identity-by-descent detection in population data. *Genetics*, 194(2), 459-471.
- Ramstetter, M. D., et al. (2018). Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. *Genetics*, 207(1), 75-82.

### Relationship Inference
- Kling, D., et al. (2012). DNA microarray as a tool in establishing genetic relatedness—Current status and future prospects. *Forensic Science International: Genetics*, 6(3), 322-329.
- Staples, J., et al. (2016). PRIMUS: rapid reconstruction of pedigrees from genome-wide estimates of identity by descent. *The American Journal of Human Genetics*, 98(6), 1103-1113.

### Pedigree Reconstruction
- Ko, A., & Nielsen, R. (2017). Composite likelihood method for inferring local pedigrees. *PLoS Genetics*, 13(8), e1006963.
- Shchur, V., et al. (2019). Fast and accurate pedigree reconstruction using proximity-dependent splitting. *bioRxiv*, 675462.

### Community Detection in Networks
- Blondel, V. D., et al. (2008). Fast unfolding of communities in large networks. *Journal of Statistical Mechanics: Theory and Experiment*, 2008(10), P10008.
- Fortunato, S. (2010). Community detection in graphs. *Physics Reports*, 486(3-5), 75-174.

### Genetic Genealogy Applications
- Erlich, Y., et al. (2018). Identity inference of genomic data using long-range familial searches. *Science*, 362(6415), 690-694.
- Edge, M. D., & Coop, G. (2019). Reconstructing the history of polygenic scores using coalescent trees. *Genetics*, 211(1), 235-262.

### Software Tools
- [PLINK](https://www.cog-genomics.org/plink/): Whole genome association analysis toolset
- [IBDseq](https://faculty.washington.edu/sguy/ibdseq/): IBD segment detection tool
- [GERMLINE](http://www1.cs.columbia.edu/~gusev/germline/): IBD detection software
- [PRIMUS](https://primus.gs.washington.edu/): Pedigree reconstruction tool
- [DRUID](https://github.com/23andMe/druid): Degree Relationship Using IBD Data

### Related Jupyter Notebooks
- **Lab 21: Pedigree Rendering**: Techniques for visualizing pedigree structures
- **Lab 28: Integration with Other Genealogical Tools**: Working with external tools and formats