# Lab 25: Real-World Datasets and Challenges

## Overview

This notebook explores the practical application of Bonsai v3 to real-world genetic datasets, examining common challenges, adaptation strategies, and validation approaches. We'll examine how to prepare and process data from different sources, adjust for population-specific considerations, and validate results in various scenarios.

**Learning Objectives:**
- Understand the types and characteristics of real-world genetic datasets
- Learn data preparation and quality control techniques for diverse data sources
- Explore population-specific considerations for relationship inference
- Apply Bonsai to different case studies with real-world complexities
- Evaluate and validate genetic relationship reconstruction results

**Prerequisites:**
- Completion of Lab 9: Pedigree Data Structures
- Completion of Lab 12: Relationship Assessment
- Completion of Lab 24: Complex Relationship Patterns

**Estimated completion time:** 60-90 minutes

In [None]:
# 🧬 Google Colab Setup - Run this cell first!
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown

def is_colab():
    '''Check if running in Google Colab'''
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    print("🔬 Setting up Google Colab environment...")
    
    # Install dependencies
    print("📦 Installing packages...")
    !pip install -q pysam biopython scikit-allel networkx pygraphviz seaborn plotly
    !apt-get update -qq && apt-get install -qq samtools bcftools tabix graphviz-dev
    
    # Create directories
    !mkdir -p /content/class_data /content/results
    
    # Download essential class data
    print("📥 Downloading class data...")
    S3_BASE = "https://computational-genetic-genealogy.s3.us-east-2.amazonaws.com/class_data/"
    data_files = [
        "pedigree.fam", "pedigree.def", 
        "merged_opensnps_autosomes_ped_sim.seg",
        "merged_opensnps_autosomes_ped_sim-everyone.fam",
        "ped_sim_run2.seg", "ped_sim_run2-everyone.fam"
    ]
    
    for file in data_files:
        !wget -q -O /content/class_data/{file} {S3_BASE}{file}
        print(f"  ✅ {file}")
    
    # Define utility functions
    def setup_environment():
        return "/content/class_data", "/content/results"
    
    def save_results(dataframe, filename, description="results"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        dataframe.to_csv(full_path, index=False)
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e3f2fd; border-left: 4px solid #2196f3; margin: 10px 0;">
            <p><strong>💾 Results saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    def save_plot(plt, filename, description="plot"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        plt.savefig(full_path, dpi=300, bbox_inches='tight')
        plt.show()
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e8f5e8; border-left: 4px solid #4caf50; margin: 10px 0;">
            <p><strong>📊 Plot saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    print("✅ Colab setup complete! Ready to explore genetic genealogy.")
    
else:
    print("🏠 Local environment detected")
    def setup_environment():
        return "class_data", "results"
    def save_results(df, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        df.to_csv(path, index=False)
        return path
    def save_plot(plt, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        plt.savefig(path, dpi=300, bbox_inches='tight')
        plt.show()
        return path

# Set up paths and configure visualization
DATA_DIR, RESULTS_DIR = setup_environment()
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        if not classes:
            print(f"No classes found in module {module_name}")
            return
            
        # Print info for each class
        for name, cls in classes:
            display(Markdown(f"### Class: {name}"))
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                display(Markdown(f"**Documentation:**\n{doc}"))
            else:
                display(Markdown("*No documentation available*"))
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            public_methods = [(method_name, method) for method_name, method in methods 
                             if not method_name.startswith('_')]
            
            if public_methods:
                display(Markdown("**Public Methods:**"))
                for method_name, method in public_methods:
                    sig = inspect.signature(method)
                    display(Markdown(f"- `{method_name}{sig}`"))
            else:
                display(Markdown("*No public methods*"))
            
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        if not functions:
            print(f"No functions found in module {module_name}")
            return
            
        # Filter public functions
        public_functions = [(name, func) for name, func in functions if not name.startswith('_')]
        
        if not public_functions:
            print(f"No public functions found in module {module_name}")
            return
            
        # Print info for each function
        for name, func in public_functions:                
            display(Markdown(f"### Function: {name}"))
            
            # Get signature
            sig = inspect.signature(func)
            display(Markdown(f"**Signature:** `{name}{sig}`"))
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                display(Markdown(f"**Documentation:**\n{doc}"))
            else:
                display(Markdown("*No documentation available*"))
                
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for `{function_name}`\n```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

def view_class_source(module_name, class_name):
    """Display the source code of a class"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the class
        cls = getattr(module, class_name)
        
        # Get the source code
        source = inspect.getsource(cls)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for class `{class_name}`\n```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Class {class_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing class {class_name}: {e}")

def explore_module(module_name):
    """Display a comprehensive overview of a module with classes and functions"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Module docstring
        doc = inspect.getdoc(module)
        display(Markdown(f"# Module: {module_name}"))
        
        if doc:
            display(Markdown(f"**Module Documentation:**\n{doc}"))
        else:
            display(Markdown("*No module documentation available*"))
            
        display(Markdown("---"))
        
        # Display classes
        display(Markdown("## Classes"))
        display_module_classes(module_name)
        
        # Display functions
        display(Markdown("## Functions"))
        display_module_functions(module_name)
        
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error exploring module {module_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
    
    # Print Bonsai version information if available
    if hasattr(v3, "__version__"):
        print(f"Bonsai v3 version: {v3.__version__}")
    
    # List key submodules
    print("\nAvailable Bonsai submodules:")
    for module_name in dir(v3):
        if not module_name.startswith("_") and not module_name.startswith("__"):
            print(f"- {module_name}")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Introduction

Theoretical models and simulated datasets are essential for developing and testing genetic genealogy algorithms, but real-world applications introduce numerous practical challenges. In this lab, we'll explore how Bonsai v3 handles real-world genetic datasets with their inherent complexities, including data quality issues, diverse population structures, and the need for rigorous validation.

**Key concepts we'll cover:**
- Types and characteristics of real-world genetic datasets
- Data preparation and quality control techniques
- Population-specific considerations for accurate relationship inference
- Case studies illustrating common real-world scenarios
- Validation approaches for genetic relationship reconstruction

The transition from controlled simulations to real-world datasets is a critical step in applying genetic genealogy tools to practical problems. Real datasets often contain challenges like missing data, varying quality across samples, platform-specific biases, and diverse population structures that can affect IBD detection and relationship inference.

This lab builds on your knowledge of Bonsai's architecture and methods, focusing on practical applications and the adaptations required for successfully working with diverse datasets encountered in the field.

## Part 1: Types of Real-World Genetic Datasets

Various types of genetic datasets are used in genealogical research, each with unique characteristics, strengths, and limitations. Understanding these differences is essential for properly configuring Bonsai and interpreting results.

### 1.1 Direct-to-Consumer (DTC) Testing Data

DTC genetic testing companies like 23andMe, AncestryDNA, MyHeritage, and FamilyTreeDNA have generated massive databases of consumer genetic data. These datasets have several important characteristics:

- **Diverse SNP coverage**: Different companies use different genotyping arrays with varying numbers of SNPs (typically 500,000 to 1 million) and different SNP selections
- **Variable data quality**: Quality can vary based on sample collection methods, processing protocols, and genotyping technology
- **Format variations**: Each company uses proprietary file formats, requiring conversion to standard formats
- **Privacy restrictions**: Access and sharing limitations based on user agreements and privacy policies
- **Self-reported metadata**: User-provided information about ancestry, relationships, and demographics may be incomplete or inaccurate

### 1.2 Research and Clinical Datasets

Academic and clinical research generates genetic datasets with different characteristics:

- **Whole genome sequencing (WGS)**: Complete genome coverage with higher resolution than DTC tests
- **Exome sequencing**: Focused on protein-coding regions with high depth but limited coverage for IBD detection
- **Specialized arrays**: Custom arrays designed for specific research questions
- **Well-documented metadata**: Often includes verified pedigrees and detailed phenotypic information
- **Controlled access**: Usually restricted by ethics boards and data access committees

### 1.3 Historical and Forensic Datasets

These datasets present unique challenges:

- **Ancient DNA**: Degraded samples with high missing data rates
- **Low coverage sequencing**: Limited genetic information requiring specialized analysis
- **Forensic samples**: Often focused on identity markers rather than genealogical markers
- **Mixed samples**: May contain DNA from multiple individuals
- **Limited reference data**: May come from populations underrepresented in reference panels

### 1.4 Population Genetic Datasets

Public reference datasets like the 1000 Genomes Project, HapMap, and gnomAD:

- **High-quality reference data**: Well-validated genetic data across diverse populations
- **Population structure information**: Detailed metadata on ancestral origins
- **Complete documentation**: Published methodologies and quality metrics
- **Open access**: Generally available for research and algorithm development
- **Limited genealogical information**: Typically lacks detailed family structures

<cell_type>markdown</cell_type>## Implementation: Exploring Dataset Characteristics

Let's implement some tools to explore and characterize genetic datasets that you might encounter in real-world applications. We'll create functions to analyze VCF files, which are a common format for storing genetic variation data.

In [ ]:
# First, let's create some utility functions for reading and analyzing VCF files
import gzip

def parse_vcf_metadata(vcf_file, max_lines=1000):
    """Parse VCF metadata to extract basic information about the dataset.
    
    Args:
        vcf_file (str): Path to VCF file (can be gzipped)
        max_lines (int): Maximum number of header lines to read
        
    Returns:
        dict: Dictionary containing metadata information
    """
    is_gzipped = vcf_file.endswith('.gz')
    opener = gzip.open if is_gzipped else open
    mode = 'rt' if is_gzipped else 'r'
    
    metadata = {
        'file_path': vcf_file,
        'format_version': None,
        'file_format': None,
        'contigs': [],
        'samples': [],
        'filters': {},
        'infos': {},
        'formats': {},
        'n_header_lines': 0
    }
    
    with opener(vcf_file, mode) as f:
        for i, line in enumerate(f):
            if i >= max_lines:
                break
                
            # Count header lines
            metadata['n_header_lines'] += 1
            
            # Skip empty lines
            if line.strip() == '':
                continue
                
            # Extract metadata from header lines
            if line.startswith('##'):
                line = line.strip()
                
                # VCF version
                if line.startswith('##fileformat='):
                    metadata['file_format'] = line.split('=')[1]
                
                # Contig information
                elif line.startswith('##contig='):
                    contig_info = {}
                    fields = line[9:].strip('<>').split(',')
                    for field in fields:
                        if '=' in field:
                            key, value = field.split('=', 1)
                            contig_info[key] = value
                    if 'ID' in contig_info:
                        metadata['contigs'].append(contig_info)
                
                # Filter information
                elif line.startswith('##FILTER='):
                    filter_info = {}
                    fields = line[10:].strip('<>').split(',')
                    for field in fields:
                        if '=' in field:
                            key, value = field.split('=', 1)
                            filter_info[key] = value
                    if 'ID' in filter_info:
                        metadata['filters'][filter_info['ID']] = filter_info
                
                # INFO field information
                elif line.startswith('##INFO='):
                    info_field = {}
                    fields = line[8:].strip('<>').split(',')
                    for field in fields:
                        if '=' in field:
                            key, value = field.split('=', 1)
                            info_field[key] = value
                    if 'ID' in info_field:
                        metadata['infos'][info_field['ID']] = info_field
                
                # FORMAT field information
                elif line.startswith('##FORMAT='):
                    format_field = {}
                    fields = line[10:].strip('<>').split(',')
                    for field in fields:
                        if '=' in field:
                            key, value = field.split('=', 1)
                            format_field[key] = value
                    if 'ID' in format_field:
                        metadata['formats'][format_field['ID']] = format_field
            
            # Extract sample information from the header line
            elif line.startswith('#CHROM'):
                fields = line.strip().split('\t')
                if len(fields) > 9:  # VCF has samples
                    metadata['samples'] = fields[9:]
                break  # Found the header line, we're done with metadata
    
    # Calculate some summary statistics
    metadata['n_samples'] = len(metadata['samples'])
    metadata['n_contigs'] = len(metadata['contigs'])
    
    return metadata

def count_variants_in_vcf(vcf_file, max_variants=None):
    """Count the number of variants in a VCF file.
    
    Args:
        vcf_file (str): Path to VCF file (can be gzipped)
        max_variants (int, optional): Maximum number of variants to count
        
    Returns:
        dict: Dictionary with variant counts per chromosome
    """
    is_gzipped = vcf_file.endswith('.gz')
    opener = gzip.open if is_gzipped else open
    mode = 'rt' if is_gzipped else 'r'
    
    variant_counts = {'total': 0, 'by_chrom': {}}
    variants_per_million_positions = {}
    current_pos = 0
    last_million_mark = 0
    current_chrom = None
    
    with opener(vcf_file, mode) as f:
        for line in f:
            # Skip header lines
            if line.startswith('#'):
                continue
            
            fields = line.strip().split('\t')
            if len(fields) < 8:  # Minimum required VCF fields
                continue
            
            chrom = fields[0]
            pos = int(fields[1])
            
            # Initialize chromosome counter if needed
            if chrom not in variant_counts['by_chrom']:
                variant_counts['by_chrom'][chrom] = 0
            
            # Count this variant
            variant_counts['by_chrom'][chrom] += 1
            variant_counts['total'] += 1
            
            # Track variants per million positions
            if current_chrom != chrom:
                current_chrom = chrom
                current_pos = pos
                last_million_mark = 0
                if chrom not in variants_per_million_positions:
                    variants_per_million_positions[chrom] = []
            
            # New million position block
            million_mark = pos // 1_000_000
            if million_mark > last_million_mark:
                for i in range(last_million_mark + 1, million_mark + 1):
                    variants_per_million_positions[chrom].append(0)
                last_million_mark = million_mark
            
            # Increment the counter for this million position block
            if len(variants_per_million_positions[chrom]) > 0:
                variants_per_million_positions[chrom][-1] += 1
            
            # Stop if we've reached the maximum number of variants
            if max_variants and variant_counts['total'] >= max_variants:
                break
    
    # Add the density information
    variant_counts['density'] = variants_per_million_positions
    
    return variant_counts

def analyze_sample_genotypes(vcf_file, max_variants=1000):
    """Analyze genotype quality and missingness for samples in a VCF file.
    
    Args:
        vcf_file (str): Path to VCF file (can be gzipped)
        max_variants (int): Maximum number of variants to analyze
        
    Returns:
        dict: Dictionary with sample quality metrics
    """
    is_gzipped = vcf_file.endswith('.gz')
    opener = gzip.open if is_gzipped else open
    mode = 'rt' if is_gzipped else 'r'
    
    samples = []
    sample_stats = {}
    
    with opener(vcf_file, mode) as f:
        variant_count = 0
        
        for line in f:
            if line.startswith('#CHROM'):
                # Extract sample names
                fields = line.strip().split('\t')
                if len(fields) > 9:
                    samples = fields[9:]
                    # Initialize statistics for each sample
                    for sample in samples:
                        sample_stats[sample] = {
                            'total_genotypes': 0,
                            'missing_genotypes': 0,
                            'homozygous_ref': 0,
                            'homozygous_alt': 0,
                            'heterozygous': 0
                        }
                continue
            
            if line.startswith('#'):
                continue
            
            fields = line.strip().split('\t')
            if len(fields) < 10:  # Need at least one sample
                continue
            
            # Parse format field to find GT position
            format_fields = fields[8].split(':')
            try:
                gt_index = format_fields.index('GT')
            except ValueError:
                # Skip variants without genotype information
                continue
            
            # Analyze genotypes for each sample
            for i, sample in enumerate(samples):
                if i + 9 >= len(fields):
                    # Missing sample field
                    continue
                
                sample_data = fields[i + 9].split(':')
                if gt_index >= len(sample_data):
                    # Missing GT field
                    sample_stats[sample]['missing_genotypes'] += 1
                    continue
                
                gt = sample_data[gt_index]
                
                # Count genotype types
                sample_stats[sample]['total_genotypes'] += 1
                
                if gt in ['.', './.', '.|.', './0', '0/.', '.|0', '0|.']:
                    sample_stats[sample]['missing_genotypes'] += 1
                elif gt in ['0/0', '0|0']:
                    sample_stats[sample]['homozygous_ref'] += 1
                elif gt in ['1/1', '1|1', '2/2', '2|2', '3/3', '3|3']:
                    sample_stats[sample]['homozygous_alt'] += 1
                else:
                    sample_stats[sample]['heterozygous'] += 1
            
            variant_count += 1
            if max_variants and variant_count >= max_variants:
                break
    
    # Calculate summary statistics
    for sample in sample_stats:
        total = sample_stats[sample]['total_genotypes']
        if total > 0:
            sample_stats[sample]['missing_rate'] = sample_stats[sample]['missing_genotypes'] / total
            sample_stats[sample]['heterozygosity'] = sample_stats[sample]['heterozygous'] / total
    
    return sample_stats

def summarize_dataset(vcf_file, max_metadata_lines=1000, max_variants=10000, max_count_variants=None):
    """Generate a comprehensive summary of a genetic dataset in VCF format.
    
    Args:
        vcf_file (str): Path to VCF file (can be gzipped)
        max_metadata_lines (int): Maximum number of header lines to read
        max_variants (int): Maximum number of variants to analyze for sample quality
        max_count_variants (int, optional): Maximum number of variants to count (None = all)
        
    Returns:
        dict: Dictionary with dataset summary
    """
    print(f"Analyzing dataset: {vcf_file}")
    
    # Parse metadata
    print("  Parsing metadata...")
    metadata = parse_vcf_metadata(vcf_file, max_lines=max_metadata_lines)
    
    # Count variants (limited sample if specified)
    print("  Counting variants...")
    variant_counts = count_variants_in_vcf(vcf_file, max_variants=max_count_variants)
    
    # Analyze sample quality
    print("  Analyzing sample quality...")
    sample_stats = analyze_sample_genotypes(vcf_file, max_variants=max_variants)
    
    # Combine results
    summary = {
        'metadata': metadata,
        'variant_counts': variant_counts,
        'sample_stats': sample_stats
    }
    
    # Add high-level summary
    summary['summary'] = {
        'file_path': vcf_file,
        'format': metadata['file_format'],
        'samples': metadata['n_samples'],
        'contigs': metadata['n_contigs'],
        'total_variants': variant_counts['total'],
        'chromosomes': list(variant_counts['by_chrom'].keys()),
        'avg_missing_rate': sum(s['missing_rate'] for s in sample_stats.values() if 'missing_rate' in s) / len(sample_stats) if sample_stats else None,
        'avg_heterozygosity': sum(s['heterozygosity'] for s in sample_stats.values() if 'heterozygosity' in s) / len(sample_stats) if sample_stats else None
    }
    
    print("Analysis complete!")
    return summary

In [ ]:
# Visualization functions for dataset exploration

def plot_variant_distribution(variant_counts, title="Variant Distribution by Chromosome"):
    """Plot the distribution of variants across chromosomes.
    
    Args:
        variant_counts (dict): Dictionary with variant counts per chromosome
        title (str): Plot title
    """
    # Extract chromosome names and counts
    chroms = []
    counts = []
    
    # Handle both string and numeric chromosome names
    for chrom in sorted(variant_counts['by_chrom'].keys(), 
                        key=lambda x: int(x.replace('chr', '')) if x.replace('chr', '').isdigit() else float('inf')):
        chroms.append(chrom)
        counts.append(variant_counts['by_chrom'][chrom])
    
    # Create the bar plot
    plt.figure(figsize=(12, 6))
    plt.bar(chroms, counts)
    plt.title(title)
    plt.xlabel('Chromosome')
    plt.ylabel('Number of Variants')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    
    # Add text with total count
    total = variant_counts['total']
    plt.text(0.95, 0.95, f'Total Variants: {total:,}', 
             transform=plt.gca().transAxes, ha='right', va='top', 
             bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    plt.show()
    
    return plt.gcf()  # Return the figure for saving if needed

def plot_variant_density(variant_counts, chromosomes=None, figsize=(15, 10)):
    """Plot the density of variants across chromosomes.
    
    Args:
        variant_counts (dict): Dictionary with variant density information
        chromosomes (list, optional): List of chromosomes to plot
        figsize (tuple): Figure size
    """
    density_data = variant_counts['density']
    
    if chromosomes is None:
        # Use a subset of chromosomes if there are many
        all_chroms = list(density_data.keys())
        chromosomes = all_chroms[:min(6, len(all_chroms))]
    
    # Calculate number of rows and columns for subplots
    n_chroms = len(chromosomes)
    n_cols = min(3, n_chroms)
    n_rows = (n_chroms + n_cols - 1) // n_cols
    
    # Create subplots
    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
    if n_rows == 1 and n_cols == 1:
        axes = np.array([axes])
    axes = axes.flatten()
    
    # Plot density for each chromosome
    for i, chrom in enumerate(chromosomes):
        if i < len(axes) and chrom in density_data:
            ax = axes[i]
            data = density_data[chrom]
            x = list(range(1, len(data) + 1))
            
            ax.bar(x, data, width=0.8)
            ax.set_title(f'Chromosome {chrom}')
            ax.set_xlabel('Position (Mb)')
            ax.set_ylabel('Variants per Mb')
            
            # Add mean line
            if data:
                mean_density = sum(data) / len(data)
                ax.axhline(mean_density, color='red', linestyle='--', 
                           label=f'Mean: {mean_density:.1f}')
                ax.legend()
    
    # Hide unused subplots
    for i in range(n_chroms, len(axes)):
        axes[i].axis('off')
    
    plt.suptitle('Variant Density Across Chromosomes (per Megabase)', fontsize=16)
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()
    
    return fig

def plot_sample_quality_metrics(sample_stats, metric='missing_rate', 
                               title=None, top_n=None, threshold=None):
    """Plot quality metrics for samples.
    
    Args:
        sample_stats (dict): Dictionary with sample quality metrics
        metric (str): Which metric to plot ('missing_rate', 'heterozygosity')
        title (str, optional): Plot title
        top_n (int, optional): Show only the top N samples with highest metric values
        threshold (float, optional): Highlight samples above this threshold
    """
    # Extract sample names and metric values
    samples = []
    values = []
    
    for sample, stats in sample_stats.items():
        if metric in stats:
            samples.append(sample)
            values.append(stats[metric])
    
    # Sort by metric value
    sorted_indices = np.argsort(values)[::-1]  # Descending order
    
    if top_n is not None:
        sorted_indices = sorted_indices[:top_n]
    
    sorted_samples = [samples[i] for i in sorted_indices]
    sorted_values = [values[i] for i in sorted_indices]
    
    # Create the bar plot
    plt.figure(figsize=(12, 6))
    bars = plt.bar(sorted_samples, sorted_values)
    
    # Highlight bars above threshold
    if threshold is not None:
        for i, value in enumerate(sorted_values):
            if value > threshold:
                bars[i].set_color('red')
    
    # Set title
    if title is None:
        metric_name = 'Missing Rate' if metric == 'missing_rate' else 'Heterozygosity'
        title = f'Sample {metric_name}'
    plt.title(title)
    
    plt.xlabel('Sample')
    plt.ylabel(metric.replace('_', ' ').title())
    plt.xticks(rotation=90)
    
    # Add threshold line if specified
    if threshold is not None:
        plt.axhline(threshold, color='red', linestyle='--', 
                   label=f'Threshold: {threshold}')
        plt.legend()
    
    plt.tight_layout()
    plt.show()
    
    return plt.gcf()

def plot_sample_missingness_vs_heterozygosity(sample_stats):
    """Plot missing rate vs heterozygosity for samples.
    
    Args:
        sample_stats (dict): Dictionary with sample quality metrics
    """
    # Extract values
    samples = []
    missing_rates = []
    heterozygosities = []
    
    for sample, stats in sample_stats.items():
        if 'missing_rate' in stats and 'heterozygosity' in stats:
            samples.append(sample)
            missing_rates.append(stats['missing_rate'])
            heterozygosities.append(stats['heterozygosity'])
    
    # Create scatter plot
    plt.figure(figsize=(10, 8))
    plt.scatter(missing_rates, heterozygosities, alpha=0.7)
    
    # Add labels
    plt.xlabel('Missing Rate')
    plt.ylabel('Heterozygosity')
    plt.title('Sample Quality: Missing Rate vs Heterozygosity')
    
    # Add grid
    plt.grid(True, alpha=0.3)
    
    # Annotate some interesting points
    interesting_points = []
    
    # Highest missing rate
    idx_max_missing = np.argmax(missing_rates)
    interesting_points.append((missing_rates[idx_max_missing], 
                              heterozygosities[idx_max_missing], 
                              samples[idx_max_missing]))
    
    # Highest heterozygosity
    idx_max_het = np.argmax(heterozygosities)
    interesting_points.append((missing_rates[idx_max_het], 
                              heterozygosities[idx_max_het], 
                              samples[idx_max_het]))
    
    # Lowest heterozygosity
    idx_min_het = np.argmin(heterozygosities)
    interesting_points.append((missing_rates[idx_min_het], 
                              heterozygosities[idx_min_het], 
                              samples[idx_min_het]))
    
    # Annotate points
    for x, y, label in interesting_points:
        plt.annotate(label, (x, y), xytext=(5, 5), textcoords='offset points')
    
    plt.tight_layout()
    plt.show()
    
    return plt.gcf()

def display_dataset_summary(summary, include_plots=True):
    """Display a formatted summary of a genetic dataset.
    
    Args:
        summary (dict): Dataset summary from summarize_dataset()
        include_plots (bool): Whether to include visualizations
    """
    metadata = summary['metadata']
    variant_counts = summary['variant_counts']
    sample_stats = summary['sample_stats']
    high_summary = summary['summary']
    
    # Basic information
    display(Markdown(f"# Dataset Summary: {os.path.basename(high_summary['file_path'])}"))
    display(Markdown(f"**VCF Format:** {high_summary['format']}"))
    display(Markdown(f"**Samples:** {high_summary['samples']}"))
    display(Markdown(f"**Chromosomes:** {len(high_summary['chromosomes'])}"))
    display(Markdown(f"**Total Variants:** {high_summary['total_variants']:,}"))
    
    if high_summary['avg_missing_rate'] is not None:
        display(Markdown(f"**Average Missing Rate:** {high_summary['avg_missing_rate']:.4f}"))
    
    if high_summary['avg_heterozygosity'] is not None:
        display(Markdown(f"**Average Heterozygosity:** {high_summary['avg_heterozygosity']:.4f}"))
    
    # Include plots if requested
    if include_plots:
        display(Markdown("## Variant Distribution"))
        plot_variant_distribution(variant_counts)
        
        if len(variant_counts['density']) > 0:
            display(Markdown("## Variant Density"))
            # Plot density for a subset of chromosomes
            chroms = list(variant_counts['density'].keys())[:6]
            plot_variant_density(variant_counts, chromosomes=chroms)
        
        if sample_stats and any('missing_rate' in s for s in sample_stats.values()):
            display(Markdown("## Sample Quality"))
            plot_sample_quality_metrics(sample_stats, metric='missing_rate')
            
            if any('heterozygosity' in s for s in sample_stats.values()):
                plot_sample_quality_metrics(sample_stats, metric='heterozygosity')
                plot_sample_missingness_vs_heterozygosity(sample_stats)
    
    # Return a summary dictionary for further use
    return high_summary

<cell_type>markdown</cell_type>## Example: Analyzing a Sample Dataset

Let's use our functions to analyze a sample VCF file from the class data. We'll examine the merged OpenSNPs dataset to understand its characteristics and quality metrics.

In [ ]:
# Path to the sample VCF file
if is_jupyterlite():
    # In JupyterLite, use a path relative to the files directory
    sample_vcf = os.path.join('class_data', 'merged_opensnps_data.vcf.gz')
    print(f"Will attempt to load: {sample_vcf}")
    print("Note: This may not work in JupyterLite due to size limitations.")
    print("The code is provided as a reference for local execution.")
else:
    # Use the actual data directory path
    sample_vcf = os.path.join(DATA_DIR, 'class_data', 'merged_opensnps_data.vcf.gz')
    
    # Check if the file exists
    if os.path.exists(sample_vcf):
        print(f"Found sample VCF file: {sample_vcf}")
    else:
        # Try alternative locations
        alt_paths = [
            os.path.join(DATA_DIR, 'merged_opensnps_data.vcf.gz'),
            os.path.join(os.path.dirname(DATA_DIR), 'data', 'class_data', 'merged_opensnps_data.vcf.gz')
        ]
        
        for path in alt_paths:
            if os.path.exists(path):
                sample_vcf = path
                print(f"Found sample VCF file at alternate location: {sample_vcf}")
                break
        else:
            print("Could not find the sample VCF file.")
            print("Please set the correct path to a VCF file in your environment.")
            sample_vcf = None

In [ ]:
# Analyze the dataset if available
if sample_vcf and not is_jupyterlite():
    # Use limited analysis to prevent excessive computation
    print("Analyzing the dataset (with limited samples for efficiency)...")
    dataset_summary = summarize_dataset(
        sample_vcf, 
        max_metadata_lines=500,   # Only read 500 header lines
        max_variants=5000,        # Only analyze 5000 variants for sample quality
        max_count_variants=50000  # Only count 50000 variants for chromosome distribution
    )
    
    # Display summary
    display_dataset_summary(dataset_summary)
else:
    print("Dataset analysis example:")
    print("1. We would load the VCF file")
    print("2. Process its metadata to understand structure")
    print("3. Count variants across chromosomes")
    print("4. Analyze sample quality metrics")
    print("5. Visualize the results")
    
    # Show expected output format using dummy data
    dummy_summary = {
        'metadata': {
            'file_path': 'example.vcf.gz',
            'file_format': 'VCFv4.2',
            'samples': ['sample1', 'sample2', 'sample3'],
            'n_samples': 3,
            'contigs': [{'ID': '1'}, {'ID': '2'}],
            'n_contigs': 2
        },
        'variant_counts': {
            'total': 10000,
            'by_chrom': {'1': 5500, '2': 4500},
            'density': {'1': [100, 120, 95], '2': [80, 90, 85]}
        },
        'sample_stats': {
            'sample1': {
                'missing_rate': 0.02,
                'heterozygosity': 0.3
            },
            'sample2': {
                'missing_rate': 0.01,
                'heterozygosity': 0.25
            },
            'sample3': {
                'missing_rate': 0.05,
                'heterozygosity': 0.28
            }
        },
        'summary': {
            'file_path': 'example.vcf.gz',
            'format': 'VCFv4.2',
            'samples': 3,
            'contigs': 2,
            'total_variants': 10000,
            'chromosomes': ['1', '2'],
            'avg_missing_rate': 0.027,
            'avg_heterozygosity': 0.277
        }
    }
    
    print("\nExample output would look like this:")
    display_dataset_summary(dummy_summary, include_plots=True)

<cell_type>markdown</cell_type>## Exercise: Dataset Comparison Tool

In real-world scenarios, you'll often need to compare different genetic datasets to understand their compatibility for merging or joint analysis. Let's develop a tool to compare two VCF files and assess their similarity and compatibility for use in Bonsai.

In [ ]:
# Exercise: Implement a function to compare two datasets

def compare_datasets(vcf_file1, vcf_file2, max_variants=10000):
    """Compare two VCF files and assess their compatibility.
    
    Args:
        vcf_file1 (str): Path to first VCF file
        vcf_file2 (str): Path to second VCF file
        max_variants (int): Maximum number of variants to analyze
        
    Returns:
        dict: Dictionary with comparison results
    """
    # TODO: Implement this function to:
    # 1. Analyze both datasets using summarize_dataset
    # 2. Compare sample overlap
    # 3. Compare variant overlap
    # 4. Compare chromosomes covered
    # 5. Calculate overall compatibility score
    
    # For this exercise, stub implementation:
    print(f"Comparing {os.path.basename(vcf_file1)} and {os.path.basename(vcf_file2)}")
    
    # Placeholder comparison results
    comparison = {
        'sample_overlap': {
            'file1_samples': 0,
            'file2_samples': 0,
            'common_samples': 0,
            'overlap_percent': 0.0
        },
        'chromosome_overlap': {
            'file1_chroms': [],
            'file2_chroms': [],
            'common_chroms': [],
            'overlap_percent': 0.0
        },
        'variant_overlap': {
            'estimated_percent': 0.0  # Difficult to compute exactly without reading both files
        },
        'format_compatibility': {
            'same_format': False,
            'compatible_genotype_format': False
        },
        'overall_compatibility': 0.0  # 0 (incompatible) to 1 (perfect compatibility)
    }
    
    # Return comparison results
    return comparison

# Hint: Your implementation should:
# 1. Use summarize_dataset() to get metadata and basic stats for both files
# 2. Compare the samples lists to find common samples: 
#    common = set(summary1['metadata']['samples']) & set(summary2['metadata']['samples'])
# 3. Calculate sample overlap percentage
# 4. Compare chromosome lists similarly
# 5. For variant overlap, you would need to check positions, which is more complex
#    A simple approach could just check if chromosomes and positions match between datasets
# 6. Check format compatibility (VCF versions, genotype formats)
# 7. Calculate a weighted overall compatibility score

<cell_type>markdown</cell_type>## Part 2: Data Preparation Challenges

When working with real-world genetic datasets, various data preparation challenges must be addressed before applying IBD detection and relationship inference algorithms. This section explores common challenges and solutions.

### 2.1 Format Conversion and Standardization

Different genetic testing companies and research institutions use different file formats:

- **Direct-to-consumer formats**:
  - 23andMe (text files with rsIDs and genotypes)
  - AncestryDNA (proprietary format with genotype data)
  - FamilyTreeDNA (CSV format with SNPs and genotypes)
  
- **Research formats**:
  - VCF (Variant Call Format) - standard for genetic variation data
  - PLINK formats (.bed/.bim/.fam) - commonly used in statistical genetics
  - EIGENSTRAT format - used in population genetics studies
  
- **Conversion considerations**:
  - Reference genome alignment (hg19, hg38, etc.)
  - Strand orientation differences
  - SNP naming conventions (rsIDs, chromosome:position, etc.)
  - Genotype encoding differences

### 2.2 Dealing with Missing Data

Real datasets invariably have missing genotypes due to:

- **Technical limitations**: Low coverage or read quality in specific regions
- **Platform differences**: Different SNP arrays targeting different markers
- **Quality control filters**: Variants removed due to quality concerns
- **Sample-specific issues**: DNA quality affecting certain regions

Addressing missing data requires:

- **Imputation**: Statistical inference of missing genotypes based on reference panels
- **Filtering**: Removing variants/samples with high missingness rates
- **Weighting**: Accounting for missingness in downstream analyses
- **Simulation**: Testing how missingness affects results using controlled experiments

### 2.3 Quality Control and Filtering

Essential QC metrics to consider:

- **Sample-level QC**:
  - Missingness rate (percent of genotypes missing per sample)
  - Heterozygosity rate (deviation suggests contamination or inbreeding)
  - Sex checks (genetic vs. reported sex)
  - Relatedness checks (unexpected duplicates or close relatives)
  
- **Variant-level QC**:
  - Minor allele frequency (MAF)
  - Hardy-Weinberg equilibrium (HWE)
  - Missingness rate per variant
  - Mendelian inconsistencies in family data
  
- **Batch effects**:
  - Sample processing date/batch
  - Genotyping platform differences
  - Laboratory-specific artifacts

### 2.4 IBD Detection Variations

IBD detection algorithms have different sensitivities to data characteristics:

- **GERMLINE**: Sensitive to marker density and phasing accuracy
- **BEAGLE IBD**: Computationally intensive but handles unphased data well
- **Refined IBD**: Good for detecting shorter segments but requires phased data
- **IBIS**: Balances speed and accuracy, handles various input qualities
- **hap-IBD**: High accuracy for well-phased data

Adjustments needed for different data sources:

- **Parameter tuning**: Error rates, minimum segment lengths
- **Pre-processing**: Phasing quality, marker selection
- **Post-processing**: Filtering spurious segments, merging adjacent segments
- **Validation**: Cross-algorithm comparison for reliability

## Implementation: Data Preparation Pipeline

Let's create a skeleton for a data preparation pipeline that processes raw genetic data for IBD detection and relationship inference with Bonsai.

In [ ]:
# Data preparation pipeline outline
import os
import subprocess
import pandas as pd
import numpy as np
from collections import defaultdict
import json

class DataPreparationPipeline:
    """Pipeline for preparing genetic data for IBD detection and Bonsai analysis."""
    
    def __init__(self, config=None):
        """Initialize the pipeline with configuration parameters.
        
        Args:
            config (dict, optional): Configuration dictionary with parameters
        """
        # Default configuration
        self.config = {
            'input_format': 'vcf',
            'output_format': 'vcf',
            'reference_genome': 'hg38',
            'min_maf': 0.01,
            'max_missing_rate': 0.05,
            'min_hwe_pvalue': 1e-6,
            'output_directory': 'processed_data',
            'temp_directory': 'temp',
            'ibd_algorithm': 'ibis',
            'ibd_min_segment_cm': 7.0,
            'ibd_max_gaps': 1,
            'ibd_max_error': 0.01,
            'phasing_algorithm': 'eagle',
            'phasing_reference': None,
            'threads': 4
        }
        
        # Update with user configuration if provided
        if config:
            self.config.update(config)
        
        # Create output and temp directories if they don't exist
        os.makedirs(self.config['output_directory'], exist_ok=True)
        os.makedirs(self.config['temp_directory'], exist_ok=True)
        
        # Initialize tracking for processed files
        self.processed_files = []
        
    def convert_format(self, input_file, input_format=None, output_format=None):
        """Convert between genetic data formats.
        
        Args:
            input_file (str): Path to input file
            input_format (str, optional): Input format (auto-detected if None)
            output_format (str, optional): Output format (uses config default if None)
            
        Returns:
            str: Path to converted file
        """
        print(f"Converting {input_file} from {input_format or 'auto-detected'} to {output_format or self.config['output_format']}")
        
        # This would be a mock implementation - in reality would use tools like
        # PLINK, bcftools, etc. to perform the actual conversion
        
        # Determine input format if not specified
        if input_format is None:
            if input_file.endswith('.vcf') or input_file.endswith('.vcf.gz'):
                input_format = 'vcf'
            elif input_file.endswith('.bed') or input_file.endswith('.bim') or input_file.endswith('.fam'):
                input_format = 'plink'
            elif input_file.endswith('.txt') or input_file.endswith('.csv'):
                input_format = 'text'
            else:
                input_format = 'unknown'
        
        # Use config output format if not specified
        if output_format is None:
            output_format = self.config['output_format']
        
        # Create output filename
        basename = os.path.splitext(os.path.basename(input_file))[0]
        if basename.endswith('.vcf'):  # Handle .vcf.gz case
            basename = os.path.splitext(basename)[0]
            
        output_file = os.path.join(
            self.config['output_directory'], 
            f"{basename}.{output_format}"
        )
        
        # Mock conversion process with description of what would happen
        print(f"  Conversion would use appropriate tools based on formats:")
        if input_format == 'vcf' and output_format == 'plink':
            print("  - Would use 'plink2 --vcf input.vcf --make-bed --out output'")
        elif input_format == 'plink' and output_format == 'vcf':
            print("  - Would use 'plink2 --bfile input --recode vcf --out output'")
        elif input_format == 'text' and output_format == 'vcf':
            print("  - Would use custom parsing for the specific text format")
            print("  - Would map variant IDs to positions")
            print("  - Would convert genotypes to VCF format")
        
        # In a real implementation, we would run the conversion command here
        # but for this example, we just pretend we did
        
        self.processed_files.append({
            'stage': 'format_conversion',
            'input': input_file,
            'output': output_file,
            'parameters': {
                'input_format': input_format,
                'output_format': output_format
            }
        })
        
        print(f"  Converted file would be saved to: {output_file}")
        return output_file
    
    def perform_qc(self, input_file, variant_qc=True, sample_qc=True):
        """Perform quality control on genetic data.
        
        Args:
            input_file (str): Path to input file
            variant_qc (bool): Whether to perform variant-level QC
            sample_qc (bool): Whether to perform sample-level QC
            
        Returns:
            str: Path to QC'd file
        """
        print(f"Performing QC on {input_file}")
        
        # Create output filename
        basename = os.path.splitext(os.path.basename(input_file))[0]
        if basename.endswith('.vcf'):  # Handle .vcf.gz case
            basename = os.path.splitext(basename)[0]
            
        output_file = os.path.join(
            self.config['output_directory'], 
            f"{basename}.qc.vcf"
        )
        
        # Mock QC process
        filters_applied = []
        
        if variant_qc:
            print("  Variant QC:")
            print(f"  - Minor allele frequency filter: {self.config['min_maf']}")
            filters_applied.append(f"MAF > {self.config['min_maf']}")
            
            print(f"  - Missing rate filter: {self.config['max_missing_rate']}")
            filters_applied.append(f"Missing rate < {self.config['max_missing_rate']}")
            
            print(f"  - Hardy-Weinberg equilibrium filter: {self.config['min_hwe_pvalue']}")
            filters_applied.append(f"HWE p-value > {self.config['min_hwe_pvalue']}")
        
        if sample_qc:
            print("  Sample QC:")
            print("  - Missing rate per sample")
            print("  - Heterozygosity outlier detection")
            print("  - Sex check")
            print("  - Duplicate detection")
            filters_applied.append("Sample QC filters")
        
        self.processed_files.append({
            'stage': 'quality_control',
            'input': input_file,
            'output': output_file,
            'parameters': {
                'variant_qc': variant_qc,
                'sample_qc': sample_qc,
                'filters_applied': filters_applied
            }
        })
        
        print(f"  QC'd file would be saved to: {output_file}")
        return output_file
    
    def phase_genotypes(self, input_file):
        """Phase genotypes using the configured phasing algorithm.
        
        Args:
            input_file (str): Path to input file
            
        Returns:
            str: Path to phased file
        """
        print(f"Phasing genotypes in {input_file}")
        
        # Create output filename
        basename = os.path.splitext(os.path.basename(input_file))[0]
        if basename.endswith('.vcf'):  # Handle .vcf.gz case
            basename = os.path.splitext(basename)[0]
            
        output_file = os.path.join(
            self.config['output_directory'], 
            f"{basename}.phased.vcf"
        )
        
        # Mock phasing process
        print(f"  Using phasing algorithm: {self.config['phasing_algorithm']}")
        
        if self.config['phasing_reference']:
            print(f"  Using reference panel: {self.config['phasing_reference']}")
        
        print(f"  Processing with {self.config['threads']} threads")
        
        # In a real implementation, we would run the phasing command here
        if self.config['phasing_algorithm'] == 'shapeit':
            print("  - Would use SHAPEIT for phasing")
        elif self.config['phasing_algorithm'] == 'eagle':
            print("  - Would use Eagle for phasing")
        elif self.config['phasing_algorithm'] == 'beagle':
            print("  - Would use BEAGLE for phasing")
            
        self.processed_files.append({
            'stage': 'phasing',
            'input': input_file,
            'output': output_file,
            'parameters': {
                'algorithm': self.config['phasing_algorithm'],
                'reference': self.config['phasing_reference'],
                'threads': self.config['threads']
            }
        })
        
        print(f"  Phased file would be saved to: {output_file}")
        return output_file
    
    def detect_ibd(self, input_file):
        """Detect IBD segments using the configured algorithm.
        
        Args:
            input_file (str): Path to input file
            
        Returns:
            str: Path to IBD segments file
        """
        print(f"Detecting IBD segments in {input_file}")
        
        # Create output filename
        basename = os.path.splitext(os.path.basename(input_file))[0]
        if basename.endswith('.vcf'):  # Handle .vcf.gz case
            basename = os.path.splitext(basename)[0]
            
        output_file = os.path.join(
            self.config['output_directory'], 
            f"{basename}.ibd.seg"
        )
        
        # Mock IBD detection process
        print(f"  Using IBD algorithm: {self.config['ibd_algorithm']}")
        print(f"  Minimum segment length: {self.config['ibd_min_segment_cm']} cM")
        print(f"  Maximum gaps allowed: {self.config['ibd_max_gaps']}")
        print(f"  Maximum error rate: {self.config['ibd_max_error']}")
        
        # In a real implementation, we would run the IBD detection command here
        if self.config['ibd_algorithm'] == 'ibis':
            print("  - Would use IBIS for IBD detection")
        elif self.config['ibd_algorithm'] == 'hapibd':
            print("  - Would use hap-IBD for IBD detection")
        elif self.config['ibd_algorithm'] == 'refinedibd':
            print("  - Would use Refined-IBD for IBD detection")
        elif self.config['ibd_algorithm'] == 'germline':
            print("  - Would use GERMLINE for IBD detection")
            
        self.processed_files.append({
            'stage': 'ibd_detection',
            'input': input_file,
            'output': output_file,
            'parameters': {
                'algorithm': self.config['ibd_algorithm'],
                'min_segment_cm': self.config['ibd_min_segment_cm'],
                'max_gaps': self.config['ibd_max_gaps'],
                'max_error': self.config['ibd_max_error']
            }
        })
        
        print(f"  IBD segments would be saved to: {output_file}")
        return output_file
    
    def run_pipeline(self, input_file):
        """Run the complete data preparation pipeline.
        
        Args:
            input_file (str): Path to input file
            
        Returns:
            dict: Dictionary with paths to all output files
        """
        print(f"Running complete pipeline on {input_file}")
        
        results = {}
        
        # Track original input
        results['original_input'] = input_file
        
        # Step 1: Convert format if needed
        input_format = None  # Auto-detect
        if self.config['input_format'] != self.config['output_format']:
            converted_file = self.convert_format(input_file, input_format)
            results['converted_file'] = converted_file
            current_file = converted_file
        else:
            current_file = input_file
        
        # Step 2: Perform QC
        qc_file = self.perform_qc(current_file)
        results['qc_file'] = qc_file
        current_file = qc_file
        
        # Step 3: Phase genotypes
        phased_file = self.phase_genotypes(current_file)
        results['phased_file'] = phased_file
        current_file = phased_file
        
        # Step 4: Detect IBD
        ibd_file = self.detect_ibd(current_file)
        results['ibd_file'] = ibd_file
        
        # Save processing log
        log_file = os.path.join(
            self.config['output_directory'],
            f"{os.path.basename(input_file)}.processing_log.json"
        )
        
        with open(log_file, 'w') as f:
            json.dump({
                'config': self.config,
                'processed_files': self.processed_files,
                'results': results
            }, f, indent=2)
        
        results['log_file'] = log_file
        
        print(f"\nPipeline complete! Processing log saved to: {log_file}")
        return results

In [ ]:
# Example usage of the data preparation pipeline
if sample_vcf and not is_jupyterlite():
    print("Running data preparation pipeline on sample dataset (simulation only)...")
    
    # Configure pipeline with customized parameters
    pipeline_config = {
        'input_format': 'vcf',
        'output_format': 'vcf',
        'reference_genome': 'hg38',
        'min_maf': 0.05,
        'max_missing_rate': 0.1,
        'output_directory': 'example_processed',
        'ibd_algorithm': 'hapibd',
        'ibd_min_segment_cm': 8.0,
        'phasing_algorithm': 'eagle'
    }
    
    # Initialize pipeline
    pipeline = DataPreparationPipeline(config=pipeline_config)
    
    # Run the pipeline (simulation only)
    results = pipeline.run_pipeline(sample_vcf)
    
    print("\nGenerated output files (simulated):")
    for key, path in results.items():
        print(f"  {key}: {path}")
else:
    print("Pipeline usage example:")
    print("1. Configure the pipeline with appropriate parameters")
    print("2. Initialize the pipeline object")
    print("3. Run the pipeline on input data")
    print("4. Process the results for Bonsai analysis")
    
    # Show example code
    print("\nExample code:")
    print("""
    # Configure pipeline
    pipeline_config = {
        'input_format': 'vcf',
        'output_format': 'vcf',
        'min_maf': 0.05,
        'max_missing_rate': 0.1,
        'output_directory': 'processed_data',
        'ibd_algorithm': 'hapibd',
        'ibd_min_segment_cm': 8.0,
        'phasing_algorithm': 'eagle'
    }
    
    # Initialize and run pipeline
    pipeline = DataPreparationPipeline(config=pipeline_config)
    results = pipeline.run_pipeline('input_data.vcf')
    
    # Use the IBD segments for Bonsai analysis
    ibd_file = results['ibd_file']
    # ... proceed with Bonsai analysis ...
    """)

<cell_type>markdown</cell_type>## Part 3: Population-Specific Considerations

Different human populations have distinct genetic characteristics that impact IBD detection and relationship inference. Understanding these differences is crucial for accurate genealogical reconstruction.

### 3.1 Variation in Background IBD by Population

The background level of IBD sharing varies substantially across populations:

- **Founder populations**: Groups descended from a small number of ancestors (e.g., Ashkenazi Jews, Finns) show elevated IBD sharing even between distant relatives
- **Island populations**: Geographic isolation leads to higher background IBD (e.g., Sardinians, Icelanders)
- **Recently admixed populations**: May show complex IBD patterns reflecting multiple ancestral sources
- **Outbred populations**: Continental Europeans and East Asians typically show lower background IBD

These differences require:
- Population-specific thresholds for detecting relationships
- Adjustment of expected IBD sharing metrics by population
- Calibration of likelihood models based on population structure

### 3.2 Endogamy Levels and Impact

Endogamy (marriage within a community) profoundly affects genetic genealogy:

- **High endogamy**: Found in historically isolated communities, religious groups, and small populations
- **Moderate endogamy**: Common in rural communities and traditional societies
- **Low endogamy**: Typical in modern urban populations with high mobility

Impact on relationship inference:
- Multiple paths of relationship between individuals
- Elevated total IBD sharing compared to expected in outbred populations
- Difficulty distinguishing between relationship types (e.g., 2nd cousins vs. 3rd cousins)
- Need for specialized algorithms that account for multiple relationship paths

### 3.3 Different Demographic Histories

Population histories affect IBD patterns:

- **Bottlenecked populations**: Show longer shared segments due to recent common ancestry
- **Expanding populations**: Display fewer and shorter IBD segments
- **Populations with migration**: May show stratified IBD patterns by geographic region
- **Recently isolated populations**: Often have higher and more varied IBD sharing

Implications for Bonsai parameters:
- Adjustment of expected IBD distributions based on demographic history
- Population-specific age estimation models
- Different priors for relationship probabilities

### 3.4 Cultural Family Structure Variations

Cultural practices affect family structures and should inform analysis:

- **Average family size**: Varies substantially across cultures and time periods
- **Cousin marriage rates**: Common in some cultures, rare in others
- **Polygamy**: Multiple spouses create complex sibship patterns
- **Adoption practices**: Formal and informal adoption affects genetic vs. social relatedness
- **Age at reproduction**: Affects generation time estimates in relationship inference

## Implementation: Population-Specific Relationship Inference

Let's explore how Bonsai v3 can be configured to handle population-specific considerations in relationship inference. We'll implement functions to:

1. Estimate the endogamy level in a dataset
2. Adjust relationship likelihood models based on population information
3. Generate appropriate priors for different populations

In [ ]:
# Population-specific relationship inference functions

def estimate_endogamy_level(ibd_segments_df, population_label=None):
    """Estimate the endogamy level in a dataset based on IBD segment patterns.
    
    Args:
        ibd_segments_df (DataFrame): DataFrame containing IBD segments with columns:
                                    'id1', 'id2', 'chr', 'start', 'end', 'cm_length'
        population_label (str, optional): Population label for reference values
        
    Returns:
        dict: Dictionary with endogamy metrics
    """
    # This is a simplified model of endogamy estimation
    # In a real implementation, this would be more sophisticated
    
    # Create a mock IBD segments DataFrame if none provided
    if ibd_segments_df is None or len(ibd_segments_df) == 0:
        print("Using simulated IBD segments for demonstration")
        # Create a mock dataset for demonstration
        np.random.seed(42)
        n_samples = 100
        n_segments = 1000
        
        # Generate random sample IDs
        sample_ids = [f"sample_{i}" for i in range(n_samples)]
        
        # Generate random IBD segments
        id1 = np.random.choice(sample_ids, n_segments)
        id2 = np.random.choice(sample_ids, n_segments)
        
        # Ensure id1 != id2
        for i in range(n_segments):
            while id1[i] == id2[i]:
                id2[i] = np.random.choice(sample_ids)
        
        chr_num = np.random.randint(1, 23, n_segments)
        start_pos = np.random.randint(1, 200000000, n_segments)
        seg_lengths = np.random.exponential(10, n_segments)  # Length in cM with mean 10
        end_pos = start_pos + seg_lengths * 1000000  # Rough approximation
        
        ibd_segments_df = pd.DataFrame({
            'id1': id1,
            'id2': id2,
            'chr': chr_num,
            'start': start_pos,
            'end': end_pos,
            'cm_length': seg_lengths
        })
    
    # Count unique pairs of individuals
    unique_pairs = set(zip(ibd_segments_df['id1'], ibd_segments_df['id2']))
    n_unique_pairs = len(unique_pairs)
    
    # Count unique individuals
    unique_individuals = set(ibd_segments_df['id1']).union(set(ibd_segments_df['id2']))
    n_unique_individuals = len(unique_individuals)
    
    # Calculate maximum possible pairs
    max_possible_pairs = (n_unique_individuals * (n_unique_individuals - 1)) // 2
    
    # Calculate pair density (percentage of all possible pairs that share IBD)
    pair_density = n_unique_pairs / max_possible_pairs if max_possible_pairs > 0 else 0
    
    # Calculate average number of segments per pair
    segments_per_pair = len(ibd_segments_df) / n_unique_pairs if n_unique_pairs > 0 else 0
    
    # Calculate average total cM shared per pair
    cm_per_pair = ibd_segments_df.groupby(['id1', 'id2'])['cm_length'].sum().mean()
    
    # Calculate distribution of segment lengths
    segment_length_quantiles = ibd_segments_df['cm_length'].quantile([0.25, 0.5, 0.75]).to_dict()
    
    # Reference values for different endogamy levels
    # These are simplified and would be calibrated with real data in practice
    reference_values = {
        'high_endogamy': {
            'pair_density': 0.7,
            'segments_per_pair': 5.0,
            'avg_cm_per_pair': 100.0
        },
        'moderate_endogamy': {
            'pair_density': 0.4,
            'segments_per_pair': 3.0,
            'avg_cm_per_pair': 50.0
        },
        'low_endogamy': {
            'pair_density': 0.1,
            'segments_per_pair': 1.5,
            'avg_cm_per_pair': 20.0
        },
        # Population-specific reference values
        'ashkenazi': {
            'pair_density': 0.8,
            'segments_per_pair': 8.0,
            'avg_cm_per_pair': 150.0
        },
        'finnish': {
            'pair_density': 0.6,
            'segments_per_pair': 4.0,
            'avg_cm_per_pair': 80.0
        },
        'european': {
            'pair_density': 0.2,
            'segments_per_pair': 2.0,
            'avg_cm_per_pair': 30.0
        },
        'east_asian': {
            'pair_density': 0.15,
            'segments_per_pair': 1.8,
            'avg_cm_per_pair': 25.0
        }
    }
    
    # Calculate a simple endogamy score
    # This is a weighted average of normalized metrics
    norm_pair_density = pair_density / reference_values['high_endogamy']['pair_density']
    norm_segments_per_pair = segments_per_pair / reference_values['high_endogamy']['segments_per_pair']
    norm_cm_per_pair = cm_per_pair / reference_values['high_endogamy']['avg_cm_per_pair']
    
    # Weighted average (weights could be adjusted)
    endogamy_score = (norm_pair_density * 0.4 + 
                      norm_segments_per_pair * 0.3 + 
                      norm_cm_per_pair * 0.3)
    
    # Compare to reference population if provided
    if population_label and population_label in reference_values:
        ref = reference_values[population_label]
        relative_to_reference = {
            'pair_density_ratio': pair_density / ref['pair_density'],
            'segments_ratio': segments_per_pair / ref['segments_per_pair'],
            'cm_ratio': cm_per_pair / ref['avg_cm_per_pair']
        }
    else:
        relative_to_reference = None
    
    # Classify endogamy level
    if endogamy_score > 0.7:
        endogamy_level = "High"
    elif endogamy_score > 0.3:
        endogamy_level = "Moderate"
    else:
        endogamy_level = "Low"
    
    # Return comprehensive results
    return {
        'metrics': {
            'unique_pairs': n_unique_pairs,
            'unique_individuals': n_unique_individuals,
            'max_possible_pairs': max_possible_pairs,
            'pair_density': pair_density,
            'segments_per_pair': segments_per_pair,
            'avg_cm_per_pair': cm_per_pair,
            'segment_length_quantiles': segment_length_quantiles
        },
        'endogamy_score': endogamy_score,
        'endogamy_level': endogamy_level,
        'relative_to_reference': relative_to_reference
    }

def adjust_relationship_likelihoods(base_likelihoods, endogamy_level, population=None):
    """Adjust relationship likelihoods based on endogamy level and population.
    
    Args:
        base_likelihoods (dict): Base relationship likelihoods from Bonsai
        endogamy_level (float or str): Endogamy level (numeric score or categorical)
        population (str, optional): Population label for specific adjustments
        
    Returns:
        dict: Adjusted relationship likelihoods
    """
    # Convert categorical endogamy level to numeric if needed
    if isinstance(endogamy_level, str):
        if endogamy_level.lower() == "high":
            endogamy_factor = 0.8
        elif endogamy_level.lower() == "moderate":
            endogamy_factor = 0.5
        elif endogamy_level.lower() == "low":
            endogamy_factor = 0.2
        else:
            endogamy_factor = 0.3  # Default
    else:
        # Use the numeric score directly
        endogamy_factor = min(1.0, max(0.0, endogamy_level))
    
    # Base likelihood adjustments for endogamy
    adjustments = {
        'parent-child': 1.0,  # No adjustment for parent-child
        'full-sibling': 1.0,  # No adjustment for full siblings
        'half-sibling': 0.95,  # Slight adjustment
        'aunt-nephew': 0.9,
        'grandparent-grandchild': 0.9,
        'first-cousin': 0.8,
        'first-cousin-once-removed': 0.7,
        'second-cousin': 0.6,
        'second-cousin-once-removed': 0.5,
        'third-cousin': 0.4,
        'third-cousin-once-removed': 0.3,
        'fourth-cousin': 0.2,
        'unrelated': 0.1
    }
    
    # Population-specific adjustments (multiplicative factors)
    population_factors = {
        'ashkenazi': {
            'first-cousin': 0.7,
            'second-cousin': 0.5,
            'third-cousin': 0.3,
            'fourth-cousin': 0.1
        },
        'finnish': {
            'first-cousin': 0.8,
            'second-cousin': 0.6,
            'third-cousin': 0.4,
            'fourth-cousin': 0.2
        },
        'european': {
            'first-cousin': 0.9,
            'second-cousin': 0.8,
            'third-cousin': 0.7,
            'fourth-cousin': 0.6
        }
    }
    
    # Apply adjustments
    adjusted_likelihoods = {}
    for rel, likelihood in base_likelihoods.items():
        # Base adjustment factor
        if rel in adjustments:
            adj_factor = 1.0 - (endogamy_factor * (1.0 - adjustments[rel]))
        else:
            adj_factor = 1.0  # No adjustment for unknown relationships
        
        # Apply population-specific factor if available
        if population in population_factors and rel in population_factors[population]:
            pop_factor = population_factors[population][rel]
            # Blend with endogamy-based adjustment
            adj_factor = adj_factor * 0.7 + pop_factor * 0.3
        
        # Apply adjustment to likelihood
        adjusted_likelihoods[rel] = likelihood * adj_factor
    
    # Normalize likelihoods to sum to 1
    total = sum(adjusted_likelihoods.values())
    if total > 0:
        for rel in adjusted_likelihoods:
            adjusted_likelihoods[rel] /= total
    
    return adjusted_likelihoods

def generate_population_priors(population, endogamy_level=None):
    """Generate appropriate relationship priors for different populations.
    
    Args:
        population (str): Population label
        endogamy_level (float or str, optional): Endogamy level
        
    Returns:
        dict: Prior probabilities for different relationships
    """
    # Base priors for an outbred population
    base_priors = {
        'parent-child': 0.05,
        'full-sibling': 0.05,
        'half-sibling': 0.04,
        'aunt-nephew': 0.04,
        'grandparent-grandchild': 0.04,
        'first-cousin': 0.1,
        'first-cousin-once-removed': 0.1,
        'second-cousin': 0.1,
        'second-cousin-once-removed': 0.1,
        'third-cousin': 0.1,
        'third-cousin-once-removed': 0.08,
        'fourth-cousin': 0.08,
        'unrelated': 0.12
    }
    
    # Population-specific prior adjustments
    population_adjustments = {
        'ashkenazi': {
            'first-cousin': 1.5,      # Higher probability of first-cousin marriages
            'second-cousin': 1.5,      # Higher probability of second-cousin marriages
            'third-cousin': 1.5,       # Higher probability of third-cousin relationships
            'unrelated': 0.5           # Lower probability of truly unrelated individuals
        },
        'finnish': {
            'first-cousin': 1.3,
            'second-cousin': 1.3,
            'third-cousin': 1.3,
            'unrelated': 0.6
        },
        'european': {
            'first-cousin': 0.8,      # Lower probability of first-cousin marriages
            'unrelated': 1.2          # Higher probability of unrelated individuals
        },
        'east_asian': {
            'first-cousin': 0.9,
            'unrelated': 1.1
        }
    }
    
    # Apply population adjustments if available
    adjusted_priors = base_priors.copy()
    if population in population_adjustments:
        for rel, factor in population_adjustments[population].items():
            if rel in adjusted_priors:
                adjusted_priors[rel] *= factor
    
    # Further adjust based on endogamy level if provided
    if endogamy_level:
        # Convert categorical to numeric if needed
        if isinstance(endogamy_level, str):
            if endogamy_level.lower() == "high":
                endogamy_factor = 0.8
            elif endogamy_level.lower() == "moderate":
                endogamy_factor = 0.5
            elif endogamy_level.lower() == "low":
                endogamy_factor = 0.2
            else:
                endogamy_factor = 0.0
        else:
            endogamy_factor = min(1.0, max(0.0, endogamy_level))
        
        # Apply endogamy adjustments
        if endogamy_factor > 0:
            # Increase priors for distant relationships based on endogamy
            adjusted_priors['second-cousin'] *= (1 + endogamy_factor * 0.5)
            adjusted_priors['third-cousin'] *= (1 + endogamy_factor * 1.0)
            adjusted_priors['fourth-cousin'] *= (1 + endogamy_factor * 1.5)
            
            # Decrease prior for unrelated
            adjusted_priors['unrelated'] *= (1 - endogamy_factor * 0.5)
    
    # Normalize priors to sum to 1
    total = sum(adjusted_priors.values())
    for rel in adjusted_priors:
        adjusted_priors[rel] /= total
    
    return adjusted_priors

In [ ]:
# Demonstration of population-specific relationship inference

# Simulate an IBD segments dataframe for testing
# In practice, you would load this from a real IBD detection output
ibd_df = None  # Will be created inside the function

# Estimate endogamy level from IBD patterns
print("Estimating endogamy level from IBD patterns...")
endogamy_results = estimate_endogamy_level(ibd_df, population_label='european')

# Display endogamy metrics
print(f"\nEndogamy level: {endogamy_results['endogamy_level']}")
print(f"Endogamy score: {endogamy_results['endogamy_score']:.3f}")
print("\nKey metrics:")
for metric, value in endogamy_results['metrics'].items():
    if metric != 'segment_length_quantiles':
        print(f"  {metric}: {value:.4f}" if isinstance(value, float) else f"  {metric}: {value}")

# If we have population-specific reference
if endogamy_results['relative_to_reference']:
    print("\nRelative to reference population:")
    for metric, ratio in endogamy_results['relative_to_reference'].items():
        print(f"  {metric}: {ratio:.4f}")

# Example base likelihoods from a Bonsai analysis
# In practice, these would come from actual Bonsai output
base_likelihoods = {
    'parent-child': 0.01,
    'full-sibling': 0.05,
    'half-sibling': 0.15,
    'first-cousin': 0.30,
    'second-cousin': 0.25,
    'third-cousin': 0.15,
    'fourth-cousin': 0.05,
    'unrelated': 0.04
}

# Adjust likelihoods for population and endogamy
print("\nAdjusting relationship likelihoods for population and endogamy...")
adjusted_likelihoods = adjust_relationship_likelihoods(
    base_likelihoods, 
    endogamy_results['endogamy_level'],
    population='european'
)

# Compare base and adjusted likelihoods
print("\nRelationship likelihoods comparison:")
print(f"{'Relationship':<25} {'Base':<10} {'Adjusted':<10} {'Change':<10}")
print("-" * 55)
for rel in base_likelihoods:
    base = base_likelihoods[rel]
    adj = adjusted_likelihoods[rel]
    change = (adj - base) / base * 100
    print(f"{rel:<25} {base:<10.4f} {adj:<10.4f} {change:+<10.1f}%")

# Generate population-specific priors
print("\nGenerating population-specific relationship priors...")
populations = ['european', 'ashkenazi', 'finnish', 'east_asian']

# Base priors for reference - same as in the generate_population_priors function
base_priors = {
    'parent-child': 0.05,
    'full-sibling': 0.05,
    'half-sibling': 0.04,
    'aunt-nephew': 0.04,
    'grandparent-grandchild': 0.04,
    'first-cousin': 0.1,
    'first-cousin-once-removed': 0.1,
    'second-cousin': 0.1,
    'second-cousin-once-removed': 0.1,
    'third-cousin': 0.1,
    'third-cousin-once-removed': 0.08,
    'fourth-cousin': 0.08,
    'unrelated': 0.12
}

print("\nRelationship priors by population:")

# Create a DataFrame to compare priors across populations
rels = list(base_priors.keys())
pop_priors_df = pd.DataFrame(index=rels)

for pop in populations:
    priors = generate_population_priors(pop, endogamy_level=endogamy_results['endogamy_level'])
    pop_priors_df[pop] = [priors[rel] for rel in rels]

# Display the comparison
display(pop_priors_df.style.format("{:.4f}").background_gradient(cmap='viridis'))

<cell_type>markdown</cell_type>## Case Study: Population-Specific Analysis

Let's examine a case study showing how population-specific considerations impact relationship inference in a real-world scenario. This case study demonstrates how Bonsai v3 can be configured for optimal results with different populations.

In [ ]:
# Case study: Comparison of population-specific settings in Bonsai
# We'll simulate results for two populations: Ashkenazi Jewish and European

# Setup simulated data
class SimulatedBonsaiConfig:
    """Simple class to represent Bonsai configuration for a population"""
    
    def __init__(self, population, endogamy_level):
        self.population = population
        self.endogamy_level = endogamy_level
        
        # Set appropriate parameters based on population
        if population == 'ashkenazi':
            self.min_segment_cm = 8.0
            self.ibd_detection_algorithm = 'hapibd'
            self.ibd_error_tolerance = 0.005
            self.pedigree_depth_limit = 6
            self.relationship_prior_strength = 2.0
            
        elif population == 'european':
            self.min_segment_cm = 7.0
            self.ibd_detection_algorithm = 'ibis'
            self.ibd_error_tolerance = 0.01
            self.pedigree_depth_limit = 4
            self.relationship_prior_strength = 1.0
            
        else:
            # Default settings
            self.min_segment_cm = 7.0
            self.ibd_detection_algorithm = 'ibis'
            self.ibd_error_tolerance = 0.01
            self.pedigree_depth_limit = 5
            self.relationship_prior_strength = 1.0
    
    def display(self):
        """Display the configuration settings"""
        print(f"Bonsai Configuration for {self.population.capitalize()} Population")
        print(f"Endogamy Level: {self.endogamy_level}")
        print(f"Minimum IBD Segment (cM): {self.min_segment_cm}")
        print(f"IBD Detection Algorithm: {self.ibd_detection_algorithm}")
        print(f"IBD Error Tolerance: {self.ibd_error_tolerance}")
        print(f"Pedigree Depth Limit: {self.pedigree_depth_limit}")
        print(f"Relationship Prior Strength: {self.relationship_prior_strength}")

# Create configurations for different populations
ashkenazi_config = SimulatedBonsaiConfig('ashkenazi', 'high')
european_config = SimulatedBonsaiConfig('european', 'low')

# Simulate expected relationship inference outcomes
def simulate_relationship_inference(population, relationship, endogamy_level):
    """Simulate relationship inference outcomes for different populations"""
    
    # Base accuracy rates (reference values)
    # These rates are simplified for the example; real values would be derived from validation
    base_accuracy = {
        'parent-child': 0.99,
        'full-sibling': 0.98,
        'half-sibling': 0.92,
        'aunt-nephew': 0.90,
        'grandparent-grandchild': 0.95,
        'first-cousin': 0.85,
        'first-cousin-once-removed': 0.75,
        'second-cousin': 0.70,
        'second-cousin-once-removed': 0.60,
        'third-cousin': 0.50,
        'third-cousin-once-removed': 0.40,
        'fourth-cousin': 0.30
    }
    
    # Population-specific modifiers
    population_modifiers = {
        'ashkenazi': {
            'parent-child': 1.0,
            'full-sibling': 1.0,
            'half-sibling': 0.95,
            'aunt-nephew': 0.95,
            'grandparent-grandchild': 0.95,
            'first-cousin': 0.9,
            'first-cousin-once-removed': 0.85,
            'second-cousin': 0.7,
            'second-cousin-once-removed': 0.6,
            'third-cousin': 0.5,
            'third-cousin-once-removed': 0.4,
            'fourth-cousin': 0.3
        },
        'european': {
            'parent-child': 1.0,
            'full-sibling': 1.0,
            'half-sibling': 1.0,
            'aunt-nephew': 1.0,
            'grandparent-grandchild': 1.0,
            'first-cousin': 1.0,
            'first-cousin-once-removed': 1.0,
            'second-cousin': 1.0,
            'second-cousin-once-removed': 0.95,
            'third-cousin': 0.9,
            'third-cousin-once-removed': 0.85,
            'fourth-cousin': 0.8
        }
    }
    
    # Endogamy level modifiers
    endogamy_modifiers = {
        'high': {
            'parent-child': 1.0,
            'full-sibling': 1.0,
            'half-sibling': 0.95,
            'aunt-nephew': 0.9,
            'grandparent-grandchild': 0.9,
            'first-cousin': 0.85,
            'first-cousin-once-removed': 0.8,
            'second-cousin': 0.7,
            'second-cousin-once-removed': 0.6,
            'third-cousin': 0.5,
            'third-cousin-once-removed': 0.4,
            'fourth-cousin': 0.3
        },
        'moderate': {
            'parent-child': 1.0,
            'full-sibling': 1.0,
            'half-sibling': 0.98,
            'aunt-nephew': 0.95,
            'grandparent-grandchild': 0.95,
            'first-cousin': 0.9,
            'first-cousin-once-removed': 0.85,
            'second-cousin': 0.8,
            'second-cousin-once-removed': 0.7,
            'third-cousin': 0.6,
            'third-cousin-once-removed': 0.5,
            'fourth-cousin': 0.4
        },
        'low': {
            'parent-child': 1.0,
            'full-sibling': 1.0,
            'half-sibling': 1.0,
            'aunt-nephew': 1.0,
            'grandparent-grandchild': 1.0,
            'first-cousin': 1.0,
            'first-cousin-once-removed': 0.95,
            'second-cousin': 0.9,
            'second-cousin-once-removed': 0.85,
            'third-cousin': 0.8,
            'third-cousin-once-removed': 0.7,
            'fourth-cousin': 0.6
        }
    }
    
    # Apply modifiers to base accuracy
    if relationship in base_accuracy:
        # Get base accuracy
        accuracy = base_accuracy[relationship]
        
        # Apply population modifier if available
        if population in population_modifiers and relationship in population_modifiers[population]:
            accuracy *= population_modifiers[population][relationship]
        
        # Apply endogamy modifier if available
        if endogamy_level in endogamy_modifiers and relationship in endogamy_modifiers[endogamy_level]:
            accuracy *= endogamy_modifiers[endogamy_level][relationship]
        
        # Ensure accuracy is between 0 and 1
        accuracy = min(1.0, max(0.0, accuracy))
        
        # Generate random outcomes based on accuracy
        np.random.seed(hash(population + relationship + endogamy_level) % 10000)
        n_trials = 100
        correct_inferences = np.random.binomial(1, accuracy, n_trials).sum()
        
        # Calculate percent correct
        percent_correct = correct_inferences / n_trials * 100
        
        return {
            'accuracy': accuracy,
            'n_trials': n_trials,
            'correct_inferences': correct_inferences,
            'percent_correct': percent_correct
        }
    else:
        return {
            'accuracy': 0.0,
            'n_trials': 0,
            'correct_inferences': 0,
            'percent_correct': 0.0
        }

# List of relationships to test
relationships = [
    'parent-child',
    'full-sibling',
    'half-sibling',
    'aunt-nephew',
    'grandparent-grandchild',
    'first-cousin',
    'first-cousin-once-removed',
    'second-cousin',
    'second-cousin-once-removed',
    'third-cousin',
    'fourth-cousin'
]

# Simulate inference accuracy for each population and relationship
print("Simulating relationship inference accuracy for different populations...")
inference_results = {'ashkenazi': {}, 'european': {}}

for pop in ['ashkenazi', 'european']:
    endogamy = 'high' if pop == 'ashkenazi' else 'low'
    for rel in relationships:
        inference_results[pop][rel] = simulate_relationship_inference(pop, rel, endogamy)

# Display the configurations
print("\nPopulation-specific Bonsai configurations:")
print("-" * 50)
ashkenazi_config.display()
print("-" * 50)
european_config.display()

# Create a DataFrame for comparison
results_df = pd.DataFrame(index=relationships, columns=['Ashkenazi Accuracy (%)', 'European Accuracy (%)'])

for rel in relationships:
    results_df.loc[rel, 'Ashkenazi Accuracy (%)'] = inference_results['ashkenazi'][rel]['percent_correct']
    results_df.loc[rel, 'European Accuracy (%)'] = inference_results['european'][rel]['percent_correct']

# Display the results
print("\nRelationship Inference Accuracy Comparison:")
display(results_df.style.format("{:.1f}").background_gradient(cmap='RdYlGn'))

# Create a horizontal bar chart comparing the two populations
plt.figure(figsize=(12, 8))
bar_width = 0.35
index = np.arange(len(relationships))

ashkenazi_bars = plt.barh(index, results_df['Ashkenazi Accuracy (%)'], bar_width, label='Ashkenazi', color='#3498db')
european_bars = plt.barh(index + bar_width, results_df['European Accuracy (%)'], bar_width, label='European', color='#e74c3c')

plt.ylabel('Relationship Type')
plt.xlabel('Inference Accuracy (%)')
plt.title('Relationship Inference Accuracy by Population')
plt.yticks(index + bar_width/2, relationships)
plt.xlim(0, 105)
plt.legend()

# Add value labels on the bars
for i, bar in enumerate(ashkenazi_bars):
    width = bar.get_width()
    plt.text(width + 1, bar.get_y() + bar.get_height()/2, f'{width:.1f}%', 
             ha='left', va='center', fontsize=9)
    
for i, bar in enumerate(european_bars):
    width = bar.get_width()
    plt.text(width + 1, bar.get_y() + bar.get_height()/2, f'{width:.1f}%', 
             ha='left', va='center', fontsize=9)

plt.tight_layout()
plt.show()

# Key takeaways
print("\nKey Takeaways from Population Comparison:")
print("1. Close relationships (parent-child, siblings) show high accuracy across populations")
print("2. Ashkenazi population shows reduced accuracy for distant relationships due to endogamy")
print("3. European population maintains higher accuracy for distant relationships")
print("4. Population-specific configuration parameters are essential for optimal results")
print("5. Relationship priors should be calibrated based on endogamy level and population")
print("6. IBD detection parameters need adjustment based on expected background IBD")

<cell_type>markdown</cell_type>## Part 4: Validation Strategies

When working with real-world genetic datasets, validating relationship inference results is crucial. This section explores strategies for validating Bonsai's output.

### 4.1 Ground Truth Comparison

When known pedigrees are available:

- **Complete pedigrees**: Compare inferred relationships to known genealogical records
- **Partial pedigrees**: Validate relationships for subsets of individuals with known connections
- **Known relationship pairs**: Test specific relationship types (e.g., known siblings, cousins)
- **Metrics**: Precision (percentage of inferred relationships that are correct), recall (percentage of true relationships that were detected), F1 score (harmonic mean of precision and recall)

### 4.2 Cross-Validation Techniques

When ground truth is limited:

- **K-fold validation**: Divide data into K subsets, use K-1 subsets for training and 1 for testing
- **Leave-one-out**: Exclude a single individual/pair from inference, then predict their relationships
- **Bootstrap validation**: Generate multiple datasets by sampling with replacement
- **Time-split validation**: Use older data to predict newer data (useful for longitudinal datasets)

### 4.3 Hold-Out Testing

For unbiased evaluation:

- **Hold-out subset**: Set aside a portion of data not used during development
- **Blind validation**: Relationships unknown to the algorithm developer
- **External validation**: Test on entirely different datasets
- **Challenge datasets**: Use publicly available genealogical puzzles or competitions

### 4.4 Consistency Checks

Internal validation approaches:

- **Transitivity checks**: If A is related to B and B is related to C, check consistency of A to C
- **Genetic trait consistency**: Known genetic traits should segregate consistently through inferred pedigrees
- **Age consistency**: Ensure parent-child relationships align with birth dates (if available)
- **Geographic consistency**: Ensure relationships align with historical migration patterns
- **Multiple algorithm consistency**: Compare results from different relationship inference algorithms

## Implementation: Validation Framework

Let's implement a simple validation framework for Bonsai relationship inference results.

In [ ]:
# Validation framework for Bonsai relationship inference
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, classification_report

class BonsaiValidationFramework:
    """Framework for validating Bonsai relationship inference results."""
    
    def __init__(self, relationship_types=None):
        """Initialize the validation framework.
        
        Args:
            relationship_types (list, optional): List of relationship types to consider.
                                               If None, will be determined from the data.
        """
        self.relationship_types = relationship_types
        self.ground_truth = {}  # {(id1, id2): relationship}
        self.predictions = {}   # {(id1, id2): relationship}
        self.results = {}       # Validation results
    
    def add_ground_truth(self, id1, id2, relationship):
        """Add a ground truth relationship.
        
        Args:
            id1 (str): ID of the first individual
            id2 (str): ID of the second individual
            relationship (str): The true relationship between id1 and id2
        """
        # Ensure consistent ordering of pairs
        pair = tuple(sorted([id1, id2]))
        self.ground_truth[pair] = relationship
    
    def add_ground_truth_from_file(self, file_path, format_type='csv'):
        """Load ground truth relationships from a file.
        
        Args:
            file_path (str): Path to the file containing relationships
            format_type (str): File format ('csv', 'tsv', 'json')
        """
        if format_type.lower() == 'csv':
            df = pd.read_csv(file_path)
            for _, row in df.iterrows():
                self.add_ground_truth(row['id1'], row['id2'], row['relationship'])
        elif format_type.lower() == 'tsv':
            df = pd.read_csv(file_path, sep='\t')
            for _, row in df.iterrows():
                self.add_ground_truth(row['id1'], row['id2'], row['relationship'])
        elif format_type.lower() == 'json':
            with open(file_path, 'r') as f:
                data = json.load(f)
            for item in data:
                self.add_ground_truth(item['id1'], item['id2'], item['relationship'])
        else:
            raise ValueError(f"Unsupported format: {format_type}")
    
    def add_prediction(self, id1, id2, relationship):
        """Add a predicted relationship.
        
        Args:
            id1 (str): ID of the first individual
            id2 (str): ID of the second individual
            relationship (str): The predicted relationship between id1 and id2
        """
        # Ensure consistent ordering of pairs
        pair = tuple(sorted([id1, id2]))
        self.predictions[pair] = relationship
    
    def add_predictions_from_file(self, file_path, format_type='csv'):
        """Load predicted relationships from a file.
        
        Args:
            file_path (str): Path to the file containing relationships
            format_type (str): File format ('csv', 'tsv', 'json')
        """
        if format_type.lower() == 'csv':
            df = pd.read_csv(file_path)
            for _, row in df.iterrows():
                self.add_prediction(row['id1'], row['id2'], row['relationship'])
        elif format_type.lower() == 'tsv':
            df = pd.read_csv(file_path, sep='\t')
            for _, row in df.iterrows():
                self.add_prediction(row['id1'], row['id2'], row['relationship'])
        elif format_type.lower() == 'json':
            with open(file_path, 'r') as f:
                data = json.load(f)
            for item in data:
                self.add_prediction(item['id1'], item['id2'], item['relationship'])
        else:
            raise ValueError(f"Unsupported format: {format_type}")
    
    def compute_metrics(self):
        """Compute validation metrics."""
        # Find common pairs
        common_pairs = set(self.ground_truth.keys()) & set(self.predictions.keys())
        
        if not common_pairs:
            print("No common pairs found between ground truth and predictions.")
            return None
        
        # Extract relationships for common pairs
        y_true = [self.ground_truth[pair] for pair in common_pairs]
        y_pred = [self.predictions[pair] for pair in common_pairs]
        
        # Determine relationship types if not provided
        if self.relationship_types is None:
            self.relationship_types = sorted(list(set(y_true + y_pred)))
        
        # Compute precision, recall, F1 score
        precision, recall, f1, support = precision_recall_fscore_support(
            y_true, y_pred, labels=self.relationship_types, average=None
        )
        
        # Compute macro averages
        macro_precision, macro_recall, macro_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, labels=self.relationship_types, average='macro'
        )
        
        # Compute weighted averages
        weighted_precision, weighted_recall, weighted_f1, _ = precision_recall_fscore_support(
            y_true, y_pred, labels=self.relationship_types, average='weighted'
        )
        
        # Create results dictionary
        self.results = {
            'by_relationship': {
                rel: {
                    'precision': prec,
                    'recall': rec,
                    'f1': f1_score,
                    'support': sup
                }
                for rel, prec, rec, f1_score, sup in zip(
                    self.relationship_types, precision, recall, f1, support
                )
            },
            'macro_avg': {
                'precision': macro_precision,
                'recall': macro_recall,
                'f1': macro_f1
            },
            'weighted_avg': {
                'precision': weighted_precision,
                'recall': weighted_recall,
                'f1': weighted_f1
            },
            'n_common_pairs': len(common_pairs),
            'n_ground_truth': len(self.ground_truth),
            'n_predictions': len(self.predictions),
            'coverage': len(common_pairs) / len(self.ground_truth) if self.ground_truth else 0
        }
        
        # Compute confusion matrix
        self.results['confusion_matrix'] = confusion_matrix(
            y_true, y_pred, labels=self.relationship_types
        )
        
        return self.results
    
    def check_transitivity(self):
        """Check transitivity in predictions (A related to B, B to C => check A to C)."""
        # This is a simplified implementation that checks basic transitivity
        transitivity_results = {'valid': 0, 'invalid': 0, 'checks': 0}
        
        # Create a dictionary of all individuals and their relationships
        individual_relationships = {}
        for (id1, id2), rel in self.predictions.items():
            if id1 not in individual_relationships:
                individual_relationships[id1] = {}
            if id2 not in individual_relationships:
                individual_relationships[id2] = {}
            
            individual_relationships[id1][id2] = rel
            individual_relationships[id2][id1] = rel
        
        # Define symmetric relationship types (same in both directions)
        symmetric_relationships = {
            'full-sibling', 'half-sibling', 'first-cousin', 'second-cousin', 
            'third-cousin', 'fourth-cousin'
        }
        
        # Define transitive relationship rules (simplified)
        # Format: (rel_A_B, rel_B_C) -> expected_rel_A_C
        transitive_rules = {
            ('parent-child', 'parent-child'): 'grandparent-grandchild',
            ('full-sibling', 'parent-child'): 'aunt-nephew',
            ('parent-child', 'full-sibling'): 'aunt-nephew',
            ('parent-child', 'aunt-nephew'): 'first-cousin',
            ('full-sibling', 'full-sibling'): 'full-sibling',
            ('first-cousin', 'parent-child'): 'first-cousin-once-removed',
            ('parent-child', 'first-cousin'): 'first-cousin-once-removed'
        }
        
        # Check transitivity for all possible triplets
        individuals = list(individual_relationships.keys())
        for i in range(len(individuals)):
            for j in range(len(individuals)):
                if i == j:
                    continue
                    
                id_A = individuals[i]
                id_B = individuals[j]
                
                # Skip if A and B are not related
                if id_B not in individual_relationships[id_A]:
                    continue
                    
                rel_A_B = individual_relationships[id_A][id_B]
                
                for k in range(len(individuals)):
                    if k == i or k == j:
                        continue
                        
                    id_C = individuals[k]
                    
                    # Skip if B and C are not related
                    if id_C not in individual_relationships[id_B]:
                        continue
                        
                    rel_B_C = individual_relationships[id_B][id_C]
                    
                    # Skip if A and C are not related in our predictions
                    if id_C not in individual_relationships[id_A]:
                        continue
                        
                    rel_A_C = individual_relationships[id_A][id_C]
                    
                    # Check if this triplet matches a transitivity rule
                    if (rel_A_B, rel_B_C) in transitive_rules:
                        expected_rel_A_C = transitive_rules[(rel_A_B, rel_B_C)]
                        
                        transitivity_results['checks'] += 1
                        if rel_A_C == expected_rel_A_C:
                            transitivity_results['valid'] += 1
                        else:
                            transitivity_results['invalid'] += 1
                            # Record the invalid triplet
                            if 'invalid_triplets' not in transitivity_results:
                                transitivity_results['invalid_triplets'] = []
                            transitivity_results['invalid_triplets'].append({
                                'ids': (id_A, id_B, id_C),
                                'relationships': (rel_A_B, rel_B_C, rel_A_C),
                                'expected_A_C': expected_rel_A_C
                            })
        
        # Calculate transitivity validity percentage
        if transitivity_results['checks'] > 0:
            transitivity_results['validity_percent'] = (
                transitivity_results['valid'] / transitivity_results['checks'] * 100
            )
        else:
            transitivity_results['validity_percent'] = float('nan')
        
        self.results['transitivity'] = transitivity_results
        return transitivity_results
    
    def display_results(self, include_confusion_matrix=True):
        """Display validation results in a readable format."""
        if not self.results:
            print("No results available. Run compute_metrics() first.")
            return
        
        print("=== Bonsai Relationship Inference Validation Results ===")
        print(f"Total ground truth pairs: {self.results['n_ground_truth']}")
        print(f"Total predicted pairs: {self.results['n_predictions']}")
        print(f"Common pairs for validation: {self.results['n_common_pairs']}")
        print(f"Coverage: {self.results['coverage']:.2%}")
        
        print("\nAggregate Metrics:")
        print(f"Macro Precision: {self.results['macro_avg']['precision']:.4f}")
        print(f"Macro Recall: {self.results['macro_avg']['recall']:.4f}")
        print(f"Macro F1 Score: {self.results['macro_avg']['f1']:.4f}")
        
        print("\nMetrics by Relationship Type:")
        metrics_df = pd.DataFrame(
            {rel: {
                'precision': self.results['by_relationship'][rel]['precision'],
                'recall': self.results['by_relationship'][rel]['recall'],
                'f1': self.results['by_relationship'][rel]['f1'],
                'support': self.results['by_relationship'][rel]['support']
            } for rel in self.results['by_relationship']}
        ).transpose()
        
        display(metrics_df.style.format({'precision': '{:.4f}', 'recall': '{:.4f}', 'f1': '{:.4f}', 'support': '{:d}'})
                .background_gradient(subset=['precision', 'recall', 'f1'], cmap='YlGn'))
        
        if include_confusion_matrix and 'confusion_matrix' in self.results:
            cm = self.results['confusion_matrix']
            
            plt.figure(figsize=(10, 8))
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                       xticklabels=self.relationship_types,
                       yticklabels=self.relationship_types)
            plt.xlabel('Predicted')
            plt.ylabel('True')
            plt.title('Confusion Matrix')
            plt.tight_layout()
            plt.show()
        
        # Display transitivity results if available
        if 'transitivity' in self.results:
            trans = self.results['transitivity']
            print("\nTransitivity Check Results:")
            print(f"Total checks: {trans['checks']}")
            print(f"Valid transitivity: {trans['valid']}")
            print(f"Invalid transitivity: {trans['invalid']}")
            if 'validity_percent' in trans:
                print(f"Transitivity validity: {trans['validity_percent']:.2f}%")
            
            if 'invalid_triplets' in trans and trans['invalid_triplets']:
                print("\nSample Invalid Triplets:")
                for i, triplet in enumerate(trans['invalid_triplets'][:5]):  # Show first 5
                    print(f"  {i+1}. {triplet['ids'][0]} -> {triplet['ids'][1]} -> {triplet['ids'][2]}")
                    print(f"     Relationships: {triplet['relationships'][0]} -> {triplet['relationships'][1]} -> {triplet['relationships'][2]}")
                    print(f"     Expected {triplet['ids'][0]} -> {triplet['ids'][2]}: {triplet['expected_A_C']}")
                
                if len(trans['invalid_triplets']) > 5:
                    print(f"  ... and {len(trans['invalid_triplets']) - 5} more.")
    
    def plot_metrics_by_relationship(self):
        """Plot precision, recall, and F1 score for each relationship type."""
        if not self.results:
            print("No results available. Run compute_metrics() first.")
            return
        
        # Extract data for plotting
        rels = list(self.results['by_relationship'].keys())
        precision = [self.results['by_relationship'][rel]['precision'] for rel in rels]
        recall = [self.results['by_relationship'][rel]['recall'] for rel in rels]
        f1 = [self.results['by_relationship'][rel]['f1'] for rel in rels]
        support = [self.results['by_relationship'][rel]['support'] for rel in rels]
        
        # Create a figure with 2 subplots
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Plot precision, recall, and F1 score
        x = np.arange(len(rels))
        width = 0.25
        
        ax1.bar(x - width, precision, width, label='Precision')
        ax1.bar(x, recall, width, label='Recall')
        ax1.bar(x + width, f1, width, label='F1 Score')
        
        ax1.set_ylabel('Score')
        ax1.set_title('Precision, Recall, and F1 Score by Relationship Type')
        ax1.set_xticks(x)
        ax1.set_xticklabels(rels, rotation=45, ha='right')
        ax1.legend()
        ax1.set_ylim(0, 1.1)
        
        # Plot support (number of samples)
        ax2.bar(rels, support, color='skyblue')
        ax2.set_ylabel('Number of Samples')
        ax2.set_title('Number of Samples by Relationship Type')
        ax2.set_xticklabels(rels, rotation=45, ha='right')
        
        plt.tight_layout()
        plt.show()
        
        return fig

In [ ]:
# Demonstration of the validation framework with simulated data

# Define relationship types
relationship_types = [
    'parent-child',
    'full-sibling',
    'half-sibling',
    'aunt-nephew',
    'grandparent-grandchild',
    'first-cousin',
    'second-cousin',
    'third-cousin',
    'fourth-cousin',
    'unrelated'
]

# Initialize the validation framework
validator = BonsaiValidationFramework(relationship_types=relationship_types)

# Generate simulated ground truth and predictions
np.random.seed(42)
n_pairs = 200
individual_ids = [f"ind_{i}" for i in range(50)]

# Simulate ground truth relationships
for _ in range(n_pairs):
    id1, id2 = np.random.choice(individual_ids, 2, replace=False)
    rel = np.random.choice(relationship_types)
    validator.add_ground_truth(id1, id2, rel)

# Simulate predictions with varying accuracy
# Close relationships have higher accuracy
relationship_accuracy = {
    'parent-child': 0.95,
    'full-sibling': 0.90,
    'half-sibling': 0.85,
    'aunt-nephew': 0.80,
    'grandparent-grandchild': 0.85,
    'first-cousin': 0.75,
    'second-cousin': 0.65,
    'third-cousin': 0.55,
    'fourth-cousin': 0.45,
    'unrelated': 0.70
}

# Add predictions with simulated errors
for pair, true_rel in validator.ground_truth.items():
    # With probability equal to relationship accuracy, predict correctly
    if np.random.random() < relationship_accuracy[true_rel]:
        pred_rel = true_rel
    else:
        # Otherwise, predict a random relationship (error)
        other_rels = [r for r in relationship_types if r != true_rel]
        pred_rel = np.random.choice(other_rels)
    
    # Add the prediction
    validator.add_prediction(pair[0], pair[1], pred_rel)

# Add some extra predictions not in ground truth
for _ in range(50):
    id1, id2 = np.random.choice(individual_ids, 2, replace=False)
    pair = tuple(sorted([id1, id2]))
    
    # Skip if already in ground truth
    if pair in validator.ground_truth:
        continue
    
    rel = np.random.choice(relationship_types)
    validator.add_prediction(id1, id2, rel)

# Compute validation metrics
print("Computing validation metrics...")
validator.compute_metrics()

# Check transitivity
print("Checking transitivity...")
validator.check_transitivity()

# Display the results
validator.display_results()

# Plot metrics by relationship type
validator.plot_metrics_by_relationship()

<cell_type>markdown</cell_type>## Summary

In this lab, we explored the application of Bonsai v3 to real-world genetic datasets, examining practical challenges and adaptation strategies. Key points covered include:

1. **Types of real-world genetic datasets**: We examined different sources of genetic data (DTC testing, research cohorts, historical datasets) and their characteristics.

2. **Data preparation challenges**: We implemented a pipeline to handle format conversion, quality control, phasing, and IBD detection for various data types.

3. **Population-specific considerations**: We explored how population structure affects relationship inference and implemented population-aware adjustments for Bonsai.

4. **Validation strategies**: We developed a validation framework to assess the accuracy of relationship inference through metrics like precision, recall, and transitivity.

These tools and approaches enable effective application of Bonsai v3 to diverse real-world datasets, accounting for the complexities encountered in practical genetic genealogy.

<cell_type>markdown</cell_type>## Self-Assessment

**1. What factors distinguish direct-to-consumer (DTC) genetic data from research cohort data?**
<details>
<summary>Click to see answer</summary>

DTC genetic data differs from research cohort data in several key ways:
- **Coverage**: DTC tests typically use genotyping arrays with 500,000-1 million SNPs, while research often includes whole genome sequencing
- **Quality control**: Research data generally undergoes more rigorous QC
- **Metadata**: DTC relies on self-reported information, while research cohorts have clinically verified data
- **Format**: DTC data comes in proprietary formats, research data typically uses standard formats like VCF
- **Sample selection**: DTC data represents self-selected consumers, research cohorts are selected based on study design
</details>

**2. What are the main data preparation steps needed before applying Bonsai to genetic data?**
<details>
<summary>Click to see answer</summary>

Main data preparation steps include:
1. **Format conversion**: Converting between formats (e.g., 23andMe text to VCF)
2. **Quality control**: Filtering for variant quality, sample quality, missingness, etc.
3. **Phasing**: Determining haplotype phase for accurate IBD detection
4. **IBD detection**: Running appropriate IBD detection algorithms
5. **Parameter adjustment**: Tuning IBD parameters for dataset characteristics
6. **Data harmonization**: Ensuring reference genome compatibility and consistent marker sets
7. **Metadata preparation**: Organizing age, sex, and other information for relationship inference
</details>

**3. How does endogamy impact the accuracy of relationship inference?**
<details>
<summary>Click to see answer</summary>

Endogamy impacts relationship inference in several ways:
- **Elevated background IBD**: Higher levels of shared IBD between nominally unrelated individuals
- **Multiple relationship paths**: Individuals may be related through multiple common ancestors
- **Relationship classification difficulty**: Harder to distinguish between different relationship types
- **False positives**: Increased risk of inferring relationships that don't exist
- **Confidence reduction**: Lower confidence in distant relationship classifications
- **Parameter adjustment needs**: Requires adjusting IBD thresholds and relationship priors
- **Transitivity issues**: Simple transitivity rules may not hold in highly endogamous populations
</details>

**4. What metrics should you use to validate relationship inference results?**
<details>
<summary>Click to see answer</summary>

Key validation metrics include:
- **Precision**: Percentage of inferred relationships that are correct
- **Recall**: Percentage of true relationships that were detected
- **F1 score**: Harmonic mean of precision and recall
- **Confusion matrix**: Shows the pattern of relationship classification errors
- **Transitivity validity**: Percentage of relationship triplets that follow expected transitivity rules
- **Coverage**: Percentage of ground truth relationships that were predicted
- **Relationship-specific metrics**: Performance broken down by relationship type
- **ROC and PR curves**: For evaluating performance across different threshold settings
</details>

**5. What population-specific adjustments should be made to Bonsai for optimal performance?**
<details>
<summary>Click to see answer</summary>

Population-specific adjustments include:
- **IBD detection parameters**: Adjust minimum segment length and error tolerance 
- **Relationship priors**: Modify prior probabilities based on known population structure
- **Endogamy correction**: Apply population-specific endogamy models
- **Age model parameters**: Adjust generation time estimates based on cultural factors
- **Relationship likelihood functions**: Calibrate expected IBD distributions for the population
- **Pedigree depth limits**: Adjust based on background IBD levels
- **Transitivity rules**: Modify for population-specific family structure patterns
- **Population-specific reference datasets**: Use appropriate reference data for phasing and imputation
</details>

**6. How would you handle a dataset with high levels of missing data?**
<details>
<summary>Click to see answer</summary>

Strategies for handling high levels of missing data:
1. **Imputation**: Use reference panels to statistically infer missing genotypes
2. **Filtering**: Remove variants/samples with excessive missingness
3. **Weighted analysis**: Downweight evidence from regions with high missingness
4. **Confidence adjustment**: Reduce relationship confidence scores proportionally to missingness
5. **Missing data simulation**: Test impact by simulating different levels of missingness
6. **Platform-aware approaches**: Account for systematic platform differences
7. **Marker selection**: Focus on high-coverage markers shared across samples
8. **Region-specific analysis**: Analyze genome regions with good coverage separately
</details>

<cell_type>markdown</cell_type>## Further Reading

To deepen your understanding of real-world dataset considerations in genetic genealogy, explore these resources:

1. Browning, S. R., & Browning, B. L. (2020). Genetic phasing and imputation. Annual Review of Genomics and Human Genetics, 21, 285-308.

2. Skov, L., & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLoS Genetics, 13(8), e1006834.

3. Carmi, S., Hui, K. Y., Kochav, E., Liu, X., Xue, J., Grady, F., ... & Pe'er, I. (2014). Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nature Communications, 5(1), 4835.

4. Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., ... & Marchini, J. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562(7726), 203-209.

5. Diroma, M. A., Cinnirella, F., Pesole, G., & Picardi, E. (2020). Investigating population structure and genealogical relationships in human mitochondrial DNA variation: a comprehensive review. Nucleic Acids Research, 48(19), 10545-10566.

6. Ramstetter, M. D., Dyer, T. D., Lehman, D. M., Göring, H. H., Curran, J. E., Duggirala, R., ... & Williams, A. L. (2017). Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics, 207(1), 75-82.

7. Ball, C. A., Barber, M. J., Byrnes, J., Carbonetto, P., Chahine, K. G., Curtis, R. E., ... & Wang, Y. (2020). AncestryDNA solves for identity. Nature Communications, 11(1), 2711.