# Lab 18: Data Quality and Preprocessing in Bonsai

In this lab, we'll explore how data quality and preprocessing impact pedigree reconstruction with Bonsai. Building upon our understanding of Bonsai's architecture and likelihood calculations from previous labs, we'll focus on techniques to identify and address data quality issues, implement effective filtering strategies, and build robust preprocessing pipelines.

## Why This Matters

The quality of input data significantly affects the accuracy of pedigree reconstruction. Even with excellent algorithms, poor quality data leads to unreliable results. Understanding how to detect and mitigate data quality issues is essential for:
- Improving the accuracy of relationship inference
- Reducing false positives and false negatives in IBD detection
- Handling missing or incomplete data effectively
- Adapting to population-specific challenges
- Building confidence in the reconstructed pedigrees

**Learning Objectives**:
- Identify common data quality issues affecting pedigree reconstruction
- Implement advanced filtering techniques for IBD segment data
- Develop strategies for handling missing data and incomplete pedigrees
- Build effective data preprocessing pipelines for Bonsai
- Detect and mitigate errors in IBD detection
- Apply quality control metrics to evaluate input data reliability

## Environment Setup

In [None]:
!poetry install --no-root

In [None]:
import os
import math
import logging
import sys
import re
import warnings
from pathlib import Path
import subprocess
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import pandas as pd
import numpy as np
import networkx as nx
from scipy import stats
from collections import defaultdict, Counter
import random
import time
import json
from tqdm.notebook import tqdm
from dotenv import load_dotenv

In [None]:
# Environment setup for cross-compatibility
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Now you can use DATA_DIR and RESULTS_DIR consistently across environments


## 1. Common Data Quality Issues in Genetic Genealogy

Before diving into specific preprocessing techniques, let's explore common data quality issues that affect pedigree reconstruction. Understanding these issues is the first step toward developing effective solutions.

### 1.1 IBD Detection Errors

IBD detection algorithms are not perfect and can produce both false positives and false negatives:

- **False positives**: Segments incorrectly identified as IBD
- **False negatives**: True IBD segments that are missed
- **Boundary errors**: Imprecise determination of segment start/end positions

Let's load some IBD segments and examine potential quality issues:

In [None]:
# Load IBD segments from a sample file
seg_file = os.path.join(data_directory, "class_data/ped_sim_run2.seg")
seg_df = pd.read_csv(seg_file, sep="\t", header=None)
seg_df.columns = ["sample1", "sample2", "chrom", "phys_start", "phys_end", "ibd_type", "gen_start", "gen_end", "gen_seg_len"]

# Display basic information about the segments
print(f"Loaded {len(seg_df)} IBD segments")
print(f"Number of unique pairs: {seg_df[['sample1', 'sample2']].drop_duplicates().shape[0]}")
print(f"Number of unique individuals: {len(set(seg_df['sample1']).union(set(seg_df['sample2'])))}")

# Check for potential issues
print("\nPotential data quality issues:")

# Check for very short segments (potential false positives)
short_segments = seg_df[seg_df['gen_seg_len'] < 4]
print(f"Very short segments (< 4 cM): {len(short_segments)} ({len(short_segments)/len(seg_df)*100:.2f}%)")

# Check for suspiciously long segments (potential errors)
long_segments = seg_df[seg_df['gen_seg_len'] > 200]
print(f"Very long segments (> 200 cM): {len(long_segments)} ({len(long_segments)/len(seg_df)*100:.2f}%)")

# Check for segments with unusual IBD type
unusual_ibd_type = seg_df[~seg_df['ibd_type'].isin(['IBD1', 'IBD2'])]
print(f"Segments with unusual IBD type: {len(unusual_ibd_type)}")

# Display a sample of potentially problematic segments
print("\nSample of short segments (potential false positives):")
display(short_segments.head())

print("\nSample of long segments (potential errors):")
display(long_segments.head())

# Visualize segment length distribution to identify potential issues
plt.figure(figsize=(12, 6))
plt.hist(seg_df['gen_seg_len'], bins=50, alpha=0.7)
plt.axvline(x=4, color='red', linestyle='--', label='4 cM threshold')
plt.axvline(x=7, color='green', linestyle='--', label='7 cM threshold')
plt.axvline(x=20, color='purple', linestyle='--', label='20 cM threshold')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('IBD Segment Length Distribution')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, 100)  # Focus on the relevant range
plt.show()

### 1.2 Phasing and Genotyping Errors

Errors in the underlying genetic data can propagate to IBD detection and pedigree reconstruction:

- **Phasing errors**: Incorrect assignment of alleles to parental chromosomes
- **Genotyping errors**: Incorrect genotype calls in the raw data
- **Missing genotypes**: Gaps in genetic data

Let's examine how these errors might manifest in segment data:

In [None]:
# Function to simulate the effect of phasing and genotyping errors on IBD detection
def simulate_errors(base_segments, error_rate=0.05, seg_length_effect=0.2, num_segments_effect=0.1):
    """Simulate the effect of phasing and genotyping errors on IBD segments.
    
    Args:
        base_segments: DataFrame of IBD segments without errors
        error_rate: Rate of segments affected by errors
        seg_length_effect: Factor by which segment lengths are affected
        num_segments_effect: Factor affecting the number of segments
        
    Returns:
        DataFrame with simulated errors
    """
    # Create a copy of the base segments
    error_segments = base_segments.copy()
    
    # Randomly select segments to affect with errors
    num_affected = int(len(error_segments) * error_rate)
    affected_indices = np.random.choice(len(error_segments), size=num_affected, replace=False)
    
    # Apply effects to selected segments
    for idx in affected_indices:
        # Randomly shorten segments (phasing errors often break segments)
        if np.random.random() < 0.7:  # 70% chance of segment shortening
            orig_length = error_segments.loc[idx, 'gen_seg_len']
            new_length = orig_length * (1 - seg_length_effect * np.random.random())
            error_segments.loc[idx, 'gen_seg_len'] = new_length
            
            # Adjust genetic positions accordingly
            orig_range = error_segments.loc[idx, 'gen_end'] - error_segments.loc[idx, 'gen_start']
            new_range = new_length
            scale_factor = new_range / orig_range if orig_range > 0 else 1
            
            error_segments.loc[idx, 'gen_end'] = error_segments.loc[idx, 'gen_start'] + new_range
            
            # Adjust physical positions proportionally
            phys_range = error_segments.loc[idx, 'phys_end'] - error_segments.loc[idx, 'phys_start']
            error_segments.loc[idx, 'phys_end'] = error_segments.loc[idx, 'phys_start'] + int(phys_range * scale_factor)
    
    # Drop some segments entirely (false negatives)
    drop_indices = np.random.choice(len(error_segments), size=int(len(error_segments) * num_segments_effect), replace=False)
    error_segments = error_segments.drop(drop_indices)
    
    # Add some false positive segments (typically shorter)
    num_false_pos = int(len(base_segments) * num_segments_effect * 0.8)  # 80% of dropped segments
    
    false_pos_segments = []
    for _ in range(num_false_pos):
        # Random pair of individuals
        all_individuals = list(set(base_segments['sample1']).union(set(base_segments['sample2'])))
        sample1, sample2 = np.random.choice(all_individuals, size=2, replace=False)
        
        # Random chromosome
        chrom = np.random.randint(1, 23)
        
        # Generate a short segment (false positives tend to be short)
        gen_start = np.random.uniform(0, 250)
        gen_seg_len = np.random.uniform(1, 5)  # Short segments
        gen_end = gen_start + gen_seg_len
        
        # Generate physical positions (simplified)
        phys_start = np.random.randint(1000000, 200000000)
        phys_end = phys_start + np.random.randint(1000000, 10000000)
        
        # Always IBD1 for simplicity
        ibd_type = 'IBD1'
        
        false_pos_segments.append({
            'sample1': sample1,
            'sample2': sample2,
            'chrom': chrom,
            'phys_start': phys_start,
            'phys_end': phys_end,
            'ibd_type': ibd_type,
            'gen_start': gen_start,
            'gen_end': gen_end,
            'gen_seg_len': gen_seg_len
        })
    
    # Combine original segments (with errors) and false positives
    error_segments = pd.concat([error_segments, pd.DataFrame(false_pos_segments)], ignore_index=True)
    
    return error_segments

# Simulate some errors to demonstrate their effects
# Select a subset of the data for demonstration
subset_df = seg_df.sample(n=min(1000, len(seg_df)), random_state=42)

# Simulate errors with different rates
low_error_df = simulate_errors(subset_df, error_rate=0.05)
high_error_df = simulate_errors(subset_df, error_rate=0.2)

# Compare the distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(subset_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Original Segments')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 3, 2)
plt.hist(low_error_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Low Error Rate (5%)')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 3, 3)
plt.hist(high_error_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('High Error Rate (20%)')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.tight_layout()
plt.show()

# Compare summary statistics
def print_summary_stats(df, name):
    print(f"Summary for {name}:")
    print(f"  Number of segments: {len(df)}")
    print(f"  Mean segment length: {df['gen_seg_len'].mean():.2f} cM")
    print(f"  Median segment length: {df['gen_seg_len'].median():.2f} cM")
    print(f"  Segments < 5 cM: {(df['gen_seg_len'] < 5).sum()} ({(df['gen_seg_len'] < 5).sum() / len(df) * 100:.1f}%)")
    print(f"  Segments > 50 cM: {(df['gen_seg_len'] > 50).sum()} ({(df['gen_seg_len'] > 50).sum() / len(df) * 100:.1f}%)")

print_summary_stats(subset_df, "Original Data")
print_summary_stats(low_error_df, "Low Error Data")
print_summary_stats(high_error_df, "High Error Data")

### 1.3 Population-Specific Issues

Different populations present unique challenges for IBD detection and pedigree reconstruction:

- **Endogamy**: Increased background IBD due to historical intermarriage
- **Population bottlenecks**: Reduced genetic diversity leading to more identical segments
- **Recent admixture**: Complex patterns of ancestry that can complicate IBD detection

Let's examine how these issues might manifest in the data:

In [None]:
# Function to simulate different population scenarios
def simulate_population_effects(base_segments, population_type, num_individuals=50):
    """Simulate the effect of different population structures on IBD sharing.
    
    Args:
        base_segments: DataFrame to use as template for segment structure
        population_type: One of 'outbred', 'endogamous', 'bottleneck', or 'admixed'
        num_individuals: Number of individuals to simulate
        
    Returns:
        DataFrame with simulated segments
    """
    # Create random individual IDs
    individuals = [f"ind_{i}" for i in range(num_individuals)]
    
    # Generate all possible pairs
    pairs = [(i, j) for i in range(num_individuals) for j in range(i+1, num_individuals)]
    
    # Initialize parameters based on population type
    if population_type == 'outbred':
        # Standard outbred population
        bg_ibd_rate = 0.01  # Background IBD rate (proportion of genome shared on average between unrelated individuals)
        related_pair_rate = 0.05  # Proportion of pairs that are related
        related_icor_factor = 0.5  # "Inbreeding coefficient of relationship" for related pairs (r=0.5 for parent-child and siblings)
        segment_count_factor = 1.0
        
    elif population_type == 'endogamous':
        # Endogamous population with elevated background IBD
        bg_ibd_rate = 0.05  # Higher background IBD
        related_pair_rate = 0.1  # More related pairs
        related_icor_factor = 0.6  # Higher sharing due to multiple relationships
        segment_count_factor = 1.5  # More segments due to endogamy
        
    elif population_type == 'bottleneck':
        # Population that went through a bottleneck
        bg_ibd_rate = 0.03  # Elevated background IBD
        related_pair_rate = 0.08  # More related pairs
        related_icor_factor = 0.55  # Slightly higher sharing
        segment_count_factor = 1.2  # More segments
        
    elif population_type == 'admixed':
        # Recently admixed population
        bg_ibd_rate = 0.02  # Moderate background IBD
        related_pair_rate = 0.05  # Standard related pair rate
        related_icor_factor = 0.5  # Standard relatedness
        segment_count_factor = 0.8  # Fewer segments due to more heterozygosity
        
    else:
        raise ValueError(f"Unknown population type: {population_type}")
    
    # Generate segments for each pair
    simulated_segments = []
    
    for i, j in pairs:
        # Determine if this pair is closely related
        is_related = np.random.random() < related_pair_rate
        
        # Determine amount of sharing
        if is_related:
            # Related pairs (siblings, parent-child, etc.)
            icor = related_icor_factor
        else:
            # Unrelated pairs - still have some background IBD
            icor = bg_ibd_rate
        
        # Generate segments based on icor
        # Simplified model: expected total sharing is icor * 3500 cM (total genome length)
        expected_total_cm = icor * 3500
        
        # For closely related pairs, generate structured segments
        if is_related and icor > 0.1:
            # Parent-child or sibling-like pattern
            if np.random.random() < 0.5:  # Parent-child
                # Generate ~22 segments (one per chromosome) of substantial length
                num_segments = np.random.randint(20, 25)
                segment_lengths = np.random.normal(expected_total_cm / num_segments, 
                                                 expected_total_cm / num_segments * 0.2, 
                                                 num_segments)
                ibd_types = ['IBD1'] * num_segments  # Parent-child is always IBD1
            else:  # Sibling-like
                # More segments with a mix of IBD1 and IBD2
                num_segments = np.random.randint(30, 40)
                segment_lengths = np.random.normal(expected_total_cm / num_segments, 
                                                 expected_total_cm / num_segments * 0.3, 
                                                 num_segments)
                # About 25% of sibling segments are IBD2
                ibd_types = ['IBD1'] * num_segments
                for k in range(int(num_segments * 0.25)):
                    ibd_types[k] = 'IBD2'
        else:
            # For distant relationships, exponential distribution of segment lengths
            # Average segment length decreases with more distant relationships
            avg_segment_len = max(5, 30 * icor)  # in cM
            num_segments = int(max(1, expected_total_cm / avg_segment_len * segment_count_factor))
            
            # Generate segment lengths from exponential distribution
            segment_lengths = np.random.exponential(avg_segment_len, num_segments)
            
            # Trim very short segments and very long segments
            segment_lengths = np.clip(segment_lengths, 3, 100)
            
            # All IBD1 for distant relationships
            ibd_types = ['IBD1'] * num_segments
        
        # Create segment records
        for k in range(num_segments):
            # Skip if segment is too short
            if segment_lengths[k] < 1:
                continue
                
            # Random chromosome
            chrom = np.random.randint(1, 23)
            
            # Random genetic positions
            genetic_length = min(segment_lengths[k], 280)  # Cap at chromosome length
            gen_start = np.random.uniform(0, 280 - genetic_length)
            gen_end = gen_start + genetic_length
            
            # Random physical positions (simplified)
            phys_start = np.random.randint(1000000, 200000000)
            phys_end = phys_start + np.random.randint(1000000, 50000000)
            
            # Add segment
            simulated_segments.append({
                'sample1': f"ind_{i}",
                'sample2': f"ind_{j}",
                'chrom': chrom,
                'phys_start': phys_start,
                'phys_end': phys_end,
                'ibd_type': ibd_types[k],
                'gen_start': gen_start,
                'gen_end': gen_end,
                'gen_seg_len': genetic_length
            })
    
    return pd.DataFrame(simulated_segments)

# Simulate different population scenarios
np.random.seed(42)  # For reproducibility
outbred_df = simulate_population_effects(seg_df, 'outbred', num_individuals=30)
endogamous_df = simulate_population_effects(seg_df, 'endogamous', num_individuals=30)
bottleneck_df = simulate_population_effects(seg_df, 'bottleneck', num_individuals=30)
admixed_df = simulate_population_effects(seg_df, 'admixed', num_individuals=30)

# Compare summary statistics
print_summary_stats(outbred_df, "Outbred Population")
print_summary_stats(endogamous_df, "Endogamous Population")
print_summary_stats(bottleneck_df, "Bottleneck Population")
print_summary_stats(admixed_df, "Admixed Population")

# Compare segment length distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 4, 1)
plt.hist(outbred_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Outbred Population')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 4, 2)
plt.hist(endogamous_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Endogamous Population')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 4, 3)
plt.hist(bottleneck_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Bottleneck Population')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 4, 4)
plt.hist(admixed_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Admixed Population')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.tight_layout()
plt.show()

# Compare pairwise IBD sharing distributions
def plot_pairwise_sharing_distributions(dfs, names):
    plt.figure(figsize=(12, 6))
    
    for i, (df, name) in enumerate(zip(dfs, names)):
        # Calculate total IBD sharing for each pair
        pair_sharing = df.groupby(['sample1', 'sample2'])['gen_seg_len'].sum().reset_index()
        
        # Plot as CDF for comparison
        x = np.sort(pair_sharing['gen_seg_len'])
        y = np.arange(1, len(x)+1) / len(x)
        plt.plot(x, y, label=name)
    
    plt.xscale('log')
    plt.xlabel('Total IBD Sharing (cM)')
    plt.ylabel('Cumulative Probability')
    plt.title('Pairwise IBD Sharing Distributions')
    plt.grid(True, alpha=0.3)
    plt.legend()
    plt.show()

# Plot pairwise sharing distributions
plot_pairwise_sharing_distributions(
    [outbred_df, endogamous_df, bottleneck_df, admixed_df],
    ['Outbred', 'Endogamous', 'Bottleneck', 'Admixed']
)

### 2.1 Length-Based Filtering Strategies

The most common and straightforward filtering technique is to remove segments below a certain genetic length threshold. Let's implement this and other more sophisticated length-based filters:

In [ ]:
# Let's implement various filtering techniques for IBD segments
# First, create a simple class to handle IBD filtering operations

class IBDSegmentFilter:
    """A class for filtering IBD segments based on various criteria."""
    
    def __init__(self, segments_df):
        """Initialize with a DataFrame of IBD segments."""
        self.original_segments = segments_df.copy()
        self.filtered_segments = segments_df.copy()
        self.filter_history = []
    
    def reset(self):
        """Reset to original segments."""
        self.filtered_segments = self.original_segments.copy()
        self.filter_history = []
        return self
    
    def filter_by_length(self, min_length=7, max_length=None):
        """Filter segments by genetic length."""
        n_before = len(self.filtered_segments)
        
        # Apply minimum length filter
        self.filtered_segments = self.filtered_segments[self.filtered_segments['gen_seg_len'] >= min_length]
        
        # Apply maximum length filter if specified
        if max_length is not None:
            self.filtered_segments = self.filtered_segments[self.filtered_segments['gen_seg_len'] <= max_length]
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': 'length',
            'params': {'min_length': min_length, 'max_length': max_length},
            'removed': n_before - n_after
        })
        
        return self
    
    def get_filtered_segments(self):
        """Return the filtered segments DataFrame."""
        return self.filtered_segments
    
    def get_filter_summary(self):
        """Return a summary of applied filters."""
        orig_count = len(self.original_segments)
        final_count = len(self.filtered_segments)
        
        summary = {
            'original_count': orig_count,
            'filtered_count': final_count,
            'total_removed': orig_count - final_count,
            'percent_removed': (orig_count - final_count) / orig_count * 100,
            'filter_history': self.filter_history
        }
        
        return summary

# Apply length-based filtering with different thresholds
filter_thresholds = [3, 5, 7, 10, 15]
results = {}

for threshold in filter_thresholds:
    filter_obj = IBDSegmentFilter(seg_df.copy())
    filtered_df = filter_obj.filter_by_length(min_length=threshold).get_filtered_segments()
    results[threshold] = {
        'filtered_count': len(filtered_df),
        'remaining_percent': len(filtered_df) / len(seg_df) * 100
    }

# Visualize the effect of different length thresholds
thresholds = list(results.keys())
remaining = [results[t]['remaining_percent'] for t in thresholds]

plt.figure(figsize=(10, 6))
plt.plot(thresholds, remaining, 'o-', linewidth=2)
plt.xlabel('Minimum Segment Length Threshold (cM)')
plt.ylabel('Remaining Segments (%)')
plt.title('Effect of Length Filtering on Segment Retention')
plt.grid(True, alpha=0.3)
plt.xticks(thresholds)
plt.ylim(0, 100)

# Add text annotations
for i, threshold in enumerate(thresholds):
    plt.annotate(f"{remaining[i]:.1f}%", 
                 (threshold, remaining[i]),
                 textcoords="offset points",
                 xytext=(0,10), 
                 ha='center')

plt.show()

# Show detailed results
print("Effect of different length thresholds:")
print("-" * 50)
for threshold in filter_thresholds:
    print(f"Threshold: {threshold} cM")
    print(f"  Remaining segments: {results[threshold]['filtered_count']} ({results[threshold]['remaining_percent']:.1f}%)")
    print(f"  Removed segments: {len(seg_df) - results[threshold]['filtered_count']} ({100 - results[threshold]['remaining_percent']:.1f}%)")
    print()

# Let's see what the filtered data looks like with a common threshold of 7 cM
filter_obj = IBDSegmentFilter(seg_df.copy())
filtered_7cm = filter_obj.filter_by_length(min_length=7).get_filtered_segments()

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(seg_df['gen_seg_len'], bins=30, alpha=0.7)
plt.axvline(x=7, color='red', linestyle='--', label='7 cM threshold')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Original Segments')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 2, 2)
plt.hist(filtered_7cm['gen_seg_len'], bins=30, alpha=0.7)
plt.axvline(x=7, color='red', linestyle='--', label='7 cM threshold')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Filtered Segments (≥ 7 cM)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.tight_layout()
plt.show()

### 2.2 Confidence-Based Filtering

In [ ]:
# Extend our IBD filter class with confidence-based filtering
class ConfidenceFilter(IBDSegmentFilter):
    """Extended IBD filter with confidence-based filtering."""
    
    def __init__(self, segments_df):
        super().__init__(segments_df)
        
        # Add a simulated confidence score if it doesn't exist
        if 'confidence' not in self.filtered_segments.columns:
            self._add_simulated_confidence()
    
    def _add_simulated_confidence(self):
        """Add simulated confidence scores based on segment characteristics.
        
        In a real-world scenario, these scores would come from the IBD detection algorithm.
        Here we simulate them based on segment length and other features.
        """
        # Base confidence on segment length (longer segments are more reliable)
        # Use a sigmoid function scaled between 0.5 and 1.0
        length_conf = 0.5 + 0.5 / (1 + np.exp(-(self.filtered_segments['gen_seg_len'] - 7)))
        
        # Add some noise
        noise = np.random.normal(0, 0.05, size=len(self.filtered_segments))
        
        # Combine and clip to [0, 1]
        confidence = np.clip(length_conf + noise, 0, 1)
        
        self.filtered_segments['confidence'] = confidence
        self.original_segments['confidence'] = confidence.copy()
    
    def filter_by_confidence(self, min_confidence=0.8):
        """Filter segments by their confidence score."""
        n_before = len(self.filtered_segments)
        
        self.filtered_segments = self.filtered_segments[self.filtered_segments['confidence'] >= min_confidence]
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': 'confidence',
            'params': {'min_confidence': min_confidence},
            'removed': n_before - n_after
        })
        
        return self
    
    def filter_by_percentile(self, keep_percentile=80):
        """Keep only the top X percentile of segments by confidence."""
        n_before = len(self.filtered_segments)
        
        threshold = np.percentile(self.filtered_segments['confidence'], 100 - keep_percentile)
        self.filtered_segments = self.filtered_segments[self.filtered_segments['confidence'] >= threshold]
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': 'percentile',
            'params': {'keep_percentile': keep_percentile, 'threshold': threshold},
            'removed': n_before - n_after
        })
        
        return self

# Apply confidence-based filtering
confidence_filter = ConfidenceFilter(seg_df.copy())

# Show the distribution of confidence scores
plt.figure(figsize=(10, 6))
plt.hist(confidence_filter.filtered_segments['confidence'], bins=30, alpha=0.7)
plt.axvline(x=0.8, color='red', linestyle='--', label='0.8 threshold')
plt.axvline(x=0.9, color='green', linestyle='--', label='0.9 threshold')
plt.xlabel('Confidence Score')
plt.ylabel('Count')
plt.title('Distribution of Segment Confidence Scores')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Try different confidence thresholds
confidence_thresholds = [0.6, 0.7, 0.8, 0.9]
conf_results = {}

for threshold in confidence_thresholds:
    cf = ConfidenceFilter(seg_df.copy())
    filtered = cf.filter_by_confidence(min_confidence=threshold).get_filtered_segments()
    conf_results[threshold] = {
        'filtered_count': len(filtered),
        'remaining_percent': len(filtered) / len(seg_df) * 100
    }

# Plot results
thresholds = list(conf_results.keys())
remaining = [conf_results[t]['remaining_percent'] for t in thresholds]

plt.figure(figsize=(10, 6))
plt.plot(thresholds, remaining, 'o-', linewidth=2)
plt.xlabel('Minimum Confidence Threshold')
plt.ylabel('Remaining Segments (%)')
plt.title('Effect of Confidence Filtering on Segment Retention')
plt.grid(True, alpha=0.3)
plt.xticks(thresholds)
plt.ylim(0, 100)

# Add text annotations
for i, threshold in enumerate(thresholds):
    plt.annotate(f"{remaining[i]:.1f}%", 
                 (threshold, remaining[i]),
                 textcoords="offset points",
                 xytext=(0,10), 
                 ha='center')

plt.show()

# Compare length-based vs. confidence-based filtering
print("Comparison of filtering methods:")
print("-" * 50)
lf = IBDSegmentFilter(seg_df.copy())
cf = ConfidenceFilter(seg_df.copy())

length_filtered = lf.filter_by_length(min_length=7).get_filtered_segments()
confidence_filtered = cf.filter_by_confidence(min_confidence=0.8).get_filtered_segments()

print(f"Original segments: {len(seg_df)}")
print(f"After length filtering (≥ 7 cM): {len(length_filtered)} ({len(length_filtered)/len(seg_df)*100:.1f}%)")
print(f"After confidence filtering (≥ 0.8): {len(confidence_filtered)} ({len(confidence_filtered)/len(seg_df)*100:.1f}%)")

# Visualize the relationship between segment length and confidence
plt.figure(figsize=(10, 6))
plt.scatter(seg_df['gen_seg_len'], cf.original_segments['confidence'], alpha=0.3)
plt.axhline(y=0.8, color='red', linestyle='--', label='0.8 confidence threshold')
plt.axvline(x=7, color='green', linestyle='--', label='7 cM threshold')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Confidence Score')
plt.title('Relationship Between Segment Length and Confidence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)
plt.ylim(0, 1)
plt.show()

### 2.3 Consistency-Based Filtering

In [ ]:
# Extend our filter with consistency-based methods
class ConsistencyFilter(IBDSegmentFilter):
    """Filter IBD segments based on consistency metrics."""
    
    def __init__(self, segments_df):
        super().__init__(segments_df)
        # Precompute useful metrics
        self._compute_pair_metrics()
    
    def _compute_pair_metrics(self):
        """Compute metrics for each pair of individuals."""
        # Number of segments per pair
        self.pair_counts = self.filtered_segments.groupby(['sample1', 'sample2']).size().reset_index(name='seg_count')
        
        # Total shared cM per pair
        self.pair_totals = self.filtered_segments.groupby(['sample1', 'sample2'])['gen_seg_len'].sum().reset_index(name='total_cm')
        
        # Merge these metrics
        self.pair_metrics = pd.merge(self.pair_counts, self.pair_totals, on=['sample1', 'sample2'])
    
    def filter_by_segment_count(self, min_segments=2):
        """Filter pairs with too few segments."""
        n_before = len(self.filtered_segments)
        
        # Get valid pairs
        valid_pairs = self.pair_counts[self.pair_counts['seg_count'] >= min_segments]
        
        # Filter segments
        valid_pairs_tuples = list(zip(valid_pairs['sample1'], valid_pairs['sample2']))
        pair_tuples = list(zip(self.filtered_segments['sample1'], self.filtered_segments['sample2']))
        self.filtered_segments = self.filtered_segments[
            [p in valid_pairs_tuples for p in pair_tuples]
        ]
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': 'segment_count',
            'params': {'min_segments': min_segments},
            'removed': n_before - n_after
        })
        
        # Recompute metrics
        self._compute_pair_metrics()
        
        return self
    
    def filter_by_total_sharing(self, min_total_cm=20):
        """Filter pairs with too little total sharing."""
        n_before = len(self.filtered_segments)
        
        # Get valid pairs
        valid_pairs = self.pair_totals[self.pair_totals['total_cm'] >= min_total_cm]
        
        # Filter segments
        valid_pairs_tuples = list(zip(valid_pairs['sample1'], valid_pairs['sample2']))
        pair_tuples = list(zip(self.filtered_segments['sample1'], self.filtered_segments['sample2']))
        self.filtered_segments = self.filtered_segments[
            [p in valid_pairs_tuples for p in pair_tuples]
        ]
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': 'total_sharing',
            'params': {'min_total_cm': min_total_cm},
            'removed': n_before - n_after
        })
        
        # Recompute metrics
        self._compute_pair_metrics()
        
        return self
    
    def filter_by_cross_chromosome_consistency(self, min_chromosomes=2):
        """Keep only pairs that share segments on multiple chromosomes."""
        n_before = len(self.filtered_segments)
        
        # Count unique chromosomes per pair
        chrom_counts = self.filtered_segments.groupby(['sample1', 'sample2'])['chrom'].nunique().reset_index(name='chrom_count')
        
        # Get valid pairs
        valid_pairs = chrom_counts[chrom_counts['chrom_count'] >= min_chromosomes]
        
        # Filter segments
        valid_pairs_tuples = list(zip(valid_pairs['sample1'], valid_pairs['sample2']))
        pair_tuples = list(zip(self.filtered_segments['sample1'], self.filtered_segments['sample2']))
        self.filtered_segments = self.filtered_segments[
            [p in valid_pairs_tuples for p in pair_tuples]
        ]
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': 'cross_chromosome',
            'params': {'min_chromosomes': min_chromosomes},
            'removed': n_before - n_after
        })
        
        # Recompute metrics
        self._compute_pair_metrics()
        
        return self

# Apply consistency-based filtering
consistency_filter = ConsistencyFilter(seg_df.copy())

# Explore initial pair metrics
print("Initial pair statistics:")
print(f"Total pairs: {len(consistency_filter.pair_metrics)}")
print(f"Average segments per pair: {consistency_filter.pair_metrics['seg_count'].mean():.2f}")
print(f"Average total cM per pair: {consistency_filter.pair_metrics['total_cm'].mean():.2f}")
print(f"Pairs with only 1 segment: {(consistency_filter.pair_metrics['seg_count'] == 1).sum()} ({(consistency_filter.pair_metrics['seg_count'] == 1).sum() / len(consistency_filter.pair_metrics) * 100:.1f}%)")

# Visualize segment count distribution
plt.figure(figsize=(10, 6))
plt.hist(consistency_filter.pair_metrics['seg_count'], bins=range(1, 20), alpha=0.7)
plt.axvline(x=2, color='red', linestyle='--', label='min 2 segments')
plt.axvline(x=3, color='green', linestyle='--', label='min 3 segments')
plt.xlabel('Number of Segments per Pair')
plt.ylabel('Count of Pairs')
plt.title('Distribution of Segment Counts per Pair')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 20))
plt.show()

# Apply segment count filtering and see the effect
filtered_by_count = consistency_filter.filter_by_segment_count(min_segments=2).get_filtered_segments()
print(f"\nAfter filtering pairs with fewer than 2 segments:")
print(f"Remaining segments: {len(filtered_by_count)} ({len(filtered_by_count) / len(seg_df) * 100:.1f}%)")
print(f"Remaining pairs: {len(consistency_filter.pair_metrics)}")

# Apply chromosome consistency filtering
consistency_filter = ConsistencyFilter(seg_df.copy())
chrom_counts = consistency_filter.filtered_segments.groupby(['sample1', 'sample2'])['chrom'].nunique().reset_index(name='chrom_count')

# Visualize chromosome count distribution
plt.figure(figsize=(10, 6))
plt.hist(chrom_counts['chrom_count'], bins=range(1, 15), alpha=0.7)
plt.axvline(x=2, color='red', linestyle='--', label='min 2 chromosomes')
plt.xlabel('Number of Unique Chromosomes per Pair')
plt.ylabel('Count of Pairs')
plt.title('Distribution of Chromosome Counts per Pair')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 15))
plt.show()

# Apply chromosome consistency filtering
filtered_by_chrom = consistency_filter.filter_by_cross_chromosome_consistency(min_chromosomes=2).get_filtered_segments()
print(f"\nAfter filtering pairs with segments on fewer than 2 chromosomes:")
print(f"Remaining segments: {len(filtered_by_chrom)} ({len(filtered_by_chrom) / len(seg_df) * 100:.1f}%)")
print(f"Remaining pairs: {len(consistency_filter.pair_metrics)}")

# Compare the effects of different consistency filters
print("\nComparison of consistency filtering methods:")
print("-" * 50)
cf1 = ConsistencyFilter(seg_df.copy())
cf2 = ConsistencyFilter(seg_df.copy())
cf3 = ConsistencyFilter(seg_df.copy())

filtered1 = cf1.filter_by_segment_count(min_segments=2).get_filtered_segments()
filtered2 = cf2.filter_by_total_sharing(min_total_cm=20).get_filtered_segments()
filtered3 = cf3.filter_by_cross_chromosome_consistency(min_chromosomes=2).get_filtered_segments()

print(f"Original segments: {len(seg_df)}")
print(f"After segment count filtering (≥ 2): {len(filtered1)} ({len(filtered1)/len(seg_df)*100:.1f}%)")
print(f"After total sharing filtering (≥ 20 cM): {len(filtered2)} ({len(filtered2)/len(seg_df)*100:.1f}%)")
print(f"After chromosome consistency filtering (≥ 2 chroms): {len(filtered3)} ({len(filtered3)/len(seg_df)*100:.1f}%)")

### 2.4 Population-Specific Filtering Strategies

In [ ]:
# Implement population-specific filtering
class PopulationAwareFilter(IBDSegmentFilter):
    """Filter IBD segments with awareness of population structure."""
    
    def __init__(self, segments_df, population_type='outbred'):
        """Initialize the filter with population type."""
        super().__init__(segments_df)
        self.population_type = population_type
    
    def apply_population_specific_filters(self):
        """Apply filters appropriate for the specified population."""
        n_before = len(self.filtered_segments)
        
        if self.population_type == 'outbred':
            # Standard thresholds for outbred populations
            self.filter_by_length(min_length=7)  # Standard threshold
            
        elif self.population_type == 'endogamous':
            # More stringent thresholds for endogamous populations
            # Endogamous populations have elevated background IBD
            self.filter_by_length(min_length=10)  # Higher threshold to reduce false positives
            
            # Add additional constraints - implement a simple consistency filter
            # Count segments per pair
            pair_counts = self.filtered_segments.groupby(['sample1', 'sample2']).size().reset_index(name='count')
            valid_pairs = pair_counts[pair_counts['count'] >= 2]  # Require at least 2 segments
            
            # Filter segments
            valid_pairs_tuples = list(zip(valid_pairs['sample1'], valid_pairs['sample2']))
            pair_tuples = list(zip(self.filtered_segments['sample1'], self.filtered_segments['sample2']))
            self.filtered_segments = self.filtered_segments[
                [p in valid_pairs_tuples for p in pair_tuples]
            ]
            
        elif self.population_type == 'bottleneck':
            # Intermediate thresholds for bottleneck populations
            self.filter_by_length(min_length=8)  # Slightly higher threshold
            
        elif self.population_type == 'admixed':
            # Different strategy for admixed populations
            self.filter_by_length(min_length=6)  # Slightly lower threshold
            
            # Apply an outlier filter instead
            mean_length = self.filtered_segments['gen_seg_len'].mean()
            std_length = self.filtered_segments['gen_seg_len'].std()
            self.filtered_segments = self.filtered_segments[
                abs(self.filtered_segments['gen_seg_len'] - mean_length) <= 3 * std_length
            ]
            
        else:
            raise ValueError(f"Unknown population type: {self.population_type}")
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': f'population_{self.population_type}',
            'params': {},
            'removed': n_before - n_after
        })
        
        return self

# Compare filtering results for different population types
# First, recreate our simulated population data
np.random.seed(42)  # For reproducibility
outbred_df = simulate_population_effects(seg_df, 'outbred', num_individuals=30)
endogamous_df = simulate_population_effects(seg_df, 'endogamous', num_individuals=30)
bottleneck_df = simulate_population_effects(seg_df, 'bottleneck', num_individuals=30)
admixed_df = simulate_population_effects(seg_df, 'admixed', num_individuals=30)

# Apply population-aware filtering
outbred_filter = PopulationAwareFilter(outbred_df, 'outbred')
endogamous_filter = PopulationAwareFilter(endogamous_df, 'endogamous')
bottleneck_filter = PopulationAwareFilter(bottleneck_df, 'bottleneck')
admixed_filter = PopulationAwareFilter(admixed_df, 'admixed')

# Apply filters
outbred_filtered = outbred_filter.apply_population_specific_filters().get_filtered_segments()
endogamous_filtered = endogamous_filter.apply_population_specific_filters().get_filtered_segments()
bottleneck_filtered = bottleneck_filter.apply_population_specific_filters().get_filtered_segments()
admixed_filtered = admixed_filter.apply_population_specific_filters().get_filtered_segments()

# Compare results
print("Population-specific filtering results:")
print("-" * 50)
print(f"Outbred population:")
print(f"  Original segments: {len(outbred_df)}")
print(f"  After filtering: {len(outbred_filtered)} ({len(outbred_filtered)/len(outbred_df)*100:.1f}%)")
print(f"  Filter summary: {outbred_filter.get_filter_summary()['filter_history']}")
print()

print(f"Endogamous population:")
print(f"  Original segments: {len(endogamous_df)}")
print(f"  After filtering: {len(endogamous_filtered)} ({len(endogamous_filtered)/len(endogamous_df)*100:.1f}%)")
print(f"  Filter summary: {endogamous_filter.get_filter_summary()['filter_history']}")
print()

print(f"Bottleneck population:")
print(f"  Original segments: {len(bottleneck_df)}")
print(f"  After filtering: {len(bottleneck_filtered)} ({len(bottleneck_filtered)/len(bottleneck_df)*100:.1f}%)")
print(f"  Filter summary: {bottleneck_filter.get_filter_summary()['filter_history']}")
print()

print(f"Admixed population:")
print(f"  Original segments: {len(admixed_df)}")
print(f"  After filtering: {len(admixed_filtered)} ({len(admixed_filtered)/len(admixed_df)*100:.1f}%)")
print(f"  Filter summary: {admixed_filter.get_filter_summary()['filter_history']}")
print()

# Visualize the differences in segment length distributions after filtering
plt.figure(figsize=(20, 10))

plt.subplot(2, 4, 1)
plt.hist(outbred_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Outbred - Before Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(2, 4, 2)
plt.hist(endogamous_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Endogamous - Before Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(2, 4, 3)
plt.hist(bottleneck_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Bottleneck - Before Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(2, 4, 4)
plt.hist(admixed_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Admixed - Before Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(2, 4, 5)
plt.hist(outbred_filtered['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Outbred - After Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(2, 4, 6)
plt.hist(endogamous_filtered['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Endogamous - After Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(2, 4, 7)
plt.hist(bottleneck_filtered['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Bottleneck - After Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(2, 4, 8)
plt.hist(admixed_filtered['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Admixed - After Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.tight_layout()
plt.show()

# Function to compare pair-wise sharing distributions before and after filtering
def plot_pairwise_sharing_comparison(before_dfs, after_dfs, names):
    plt.figure(figsize=(15, 6))
    
    plt.subplot(1, 2, 1)
    for i, (df, name) in enumerate(zip(before_dfs, names)):
        # Calculate total IBD sharing for each pair
        pair_sharing = df.groupby(['sample1', 'sample2'])['gen_seg_len'].sum().reset_index()
        
        # Plot as CDF for comparison
        x = np.sort(pair_sharing['gen_seg_len'])
        y = np.arange(1, len(x)+1) / len(x)
        plt.plot(x, y, label=name)
    
    plt.xscale('log')
    plt.xlabel('Total IBD Sharing (cM)')
    plt.ylabel('Cumulative Probability')
    plt.title('Before Filtering')
    plt.grid(True, alpha=0.3)
    plt.legend()
    
    plt.subplot(1, 2, 2)
    for i, (df, name) in enumerate(zip(after_dfs, names)):
        # Calculate total IBD sharing for each pair
        pair_sharing = df.groupby(['sample1', 'sample2'])['gen_seg_len'].sum().reset_index()
        
        # Plot as CDF for comparison
        if len(pair_sharing) > 0:  # Check if there are any pairs left
            x = np.sort(pair_sharing['gen_seg_len'])
            y = np.arange(1, len(x)+1) / len(x)
            plt.plot(x, y, label=name)
    
    plt.xscale('log')
    plt.xlabel('Total IBD Sharing (cM)')
    plt.ylabel('Cumulative Probability')
    plt.title('After Filtering')
    plt.grid(True, alpha=0.3)
    plt.legend()
    
    plt.tight_layout()
    plt.show()

# Compare pairwise sharing distributions before and after filtering
plot_pairwise_sharing_comparison(
    [outbred_df, endogamous_df, bottleneck_df, admixed_df],
    [outbred_filtered, endogamous_filtered, bottleneck_filtered, admixed_filtered],
    ['Outbred', 'Endogamous', 'Bottleneck', 'Admixed']
)

## 3. Handling Missing Data in Pedigree Reconstruction

### 3.1 Types of Missing Data

In genetic genealogy, missing data can take various forms:

- **Missing individuals**: Key ancestors or relatives who connect branches of a family may not have genetic data available
- **Missing segments**: Due to sparse testing coverage, IBD detection errors, or filtering
- **Missing chromosomes**: Some testing platforms may not cover certain chromosomes (e.g., Y or X)
- **Missing metadata**: Important context like birth years, locations, or relationships

Let's explore how missing data impacts pedigree reconstruction and strategies to address these issues.

In [ ]:
# Create a simulated pedigree network with missing data
def create_simulated_pedigree(num_individuals=30, missing_rate=0.2):
    """
    Create a simulated pedigree with some missing individuals.
    
    Args:
        num_individuals: Number of individuals in the complete pedigree
        missing_rate: Fraction of individuals to mark as missing
        
    Returns:
        G: NetworkX graph of the pedigree
        missing_ids: IDs of individuals marked as missing
    """
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add individuals
    for i in range(num_individuals):
        birth_year = np.random.randint(1900, 2000)
        G.add_node(i, id=i, birth_year=birth_year, has_genetic_data=True)
    
    # Add family relationships - simplified model
    # Each person has up to 2 parents and some number of children
    # Start from the oldest generation
    individuals = list(range(num_individuals))
    individuals.sort(key=lambda x: G.nodes[x]['birth_year'])
    
    for i, person_id in enumerate(individuals):
        birth_year = G.nodes[person_id]['birth_year']
        
        # Add parents for everyone except the oldest generation
        if i >= num_individuals // 3:  # Skip oldest ~1/3 of individuals
            # Find potential parents (at least 20 years older)
            potential_parents = [
                p for p in individuals[:i] 
                if G.nodes[p]['birth_year'] <= birth_year - 20
            ]
            
            # Add one or two parents if available
            if len(potential_parents) >= 2:
                # Add father and mother
                father = np.random.choice(potential_parents)
                potential_parents.remove(father)
                mother = np.random.choice(potential_parents)
                
                G.add_edge(father, person_id, relationship='parent-child')
                G.add_edge(mother, person_id, relationship='parent-child')
            elif len(potential_parents) == 1:
                # Add one parent
                parent = potential_parents[0]
                G.add_edge(parent, person_id, relationship='parent-child')
    
    # Mark some individuals as missing (no genetic data)
    num_missing = int(num_individuals * missing_rate)
    missing_ids = np.random.choice(individuals, size=num_missing, replace=False)
    
    for person_id in missing_ids:
        G.nodes[person_id]['has_genetic_data'] = False
    
    return G, missing_ids

# Create a pedigree with some missing individuals
np.random.seed(42)
pedigree, missing_ids = create_simulated_pedigree(num_individuals=30, missing_rate=0.2)

# Analyze the impact of missing individuals
print(f"Total individuals in pedigree: {len(pedigree.nodes())}")
print(f"Missing individuals: {len(missing_ids)} ({len(missing_ids)/len(pedigree.nodes())*100:.1f}%)")

# Check how many connections are affected by missing individuals
total_edges = len(pedigree.edges())
affected_edges = 0

for u, v in pedigree.edges():
    if u in missing_ids or v in missing_ids:
        affected_edges += 1

print(f"Total relationships: {total_edges}")
print(f"Relationships affected by missing individuals: {affected_edges} ({affected_edges/total_edges*100:.1f}%)")

# Visualize the pedigree with missing individuals highlighted
plt.figure(figsize=(12, 10))

# Create a spring layout positioned by birth year (older individuals at the top)
pos = nx.spring_layout(pedigree, seed=42)

# Adjust y-coordinate based on birth year
for node in pedigree.nodes():
    birth_year = pedigree.nodes[node]['birth_year']
    # Normalize to 0-1 range
    normalized_year = (birth_year - 1900) / 100
    # Invert so older people are at the top
    pos[node][1] = 1 - normalized_year

# Draw nodes
node_colors = ['red' if node in missing_ids else 'lightblue' for node in pedigree.nodes()]
nx.draw_networkx_nodes(pedigree, pos, node_color=node_colors, node_size=500, alpha=0.8)

# Draw edges
edge_colors = ['red' if u in missing_ids or v in missing_ids else 'black' for u, v in pedigree.edges()]
nx.draw_networkx_edges(pedigree, pos, edge_color=edge_colors, width=1.5, alpha=0.7)

# Add labels
labels = {node: f"{node}\n({pedigree.nodes[node]['birth_year']})" for node in pedigree.nodes()}
nx.draw_networkx_labels(pedigree, pos, labels=labels, font_size=10)

plt.title('Simulated Pedigree with Missing Individuals (in red)')
plt.axis('off')
plt.tight_layout()
plt.show()

# Function to simulate the impact of missing data on IBD segment detection
def simulate_missing_data_impact(pedigree, missing_ids, num_segments_per_relationship=10):
    """
    Simulate IBD segments for all genetic relationships and measure the impact of missing individuals.
    
    Args:
        pedigree: NetworkX graph of the pedigree
        missing_ids: IDs of individuals marked as missing
        num_segments_per_relationship: Number of segments to simulate per relationship
        
    Returns:
        segments_df: DataFrame of simulated IBD segments
        missing_segments_df: DataFrame showing segments lost due to missing individuals
    """
    segments = []
    missing_segments = []
    
    # Get all pairs with genetic relationships (up to 3 degrees of separation)
    for start_node in pedigree.nodes():
        if start_node in missing_ids:
            continue  # Skip missing individuals
            
        # Find all related individuals within 3 steps
        for length in range(1, 4):
            for path in nx.all_simple_paths(pedigree, start_node, cutoff=length):
                if len(path) > 1:
                    end_node = path[-1]
                    
                    # Skip if end node is missing
                    if end_node in missing_ids:
                        continue
                    
                    # Calculate a relationship coefficient (simplified)
                    # For parent-child: 0.5, grandparent-grandchild: 0.25, etc.
                    relationship_coef = 0.5 ** (len(path) - 1)
                    
                    # Check if any intermediary is missing
                    has_missing_intermediary = any(node in missing_ids for node in path[1:-1])
                    
                    # Simulate segments for this relationship
                    for _ in range(num_segments_per_relationship):
                        # Segment length depends on relationship (closer = longer)
                        avg_length = max(5, 50 * relationship_coef)
                        length_cm = np.random.exponential(avg_length)
                        
                        # Skip very short segments
                        if length_cm < 3:
                            continue
                            
                        segment = {
                            'sample1': f"ind_{start_node}",
                            'sample2': f"ind_{end_node}",
                            'chrom': np.random.randint(1, 23),
                            'gen_start': np.random.uniform(0, 250),
                            'gen_seg_len': length_cm,
                            'relationship_type': f"{len(path) - 1}_degrees",
                            'relationship_coef': relationship_coef
                        }
                        segment['gen_end'] = segment['gen_start'] + segment['gen_seg_len']
                        
                        # Record if this segment is lost due to missing intermediaries
                        if has_missing_intermediary:
                            missing_segments.append(segment)
                        else:
                            segments.append(segment)
    
    # Convert to DataFrames
    segments_df = pd.DataFrame(segments)
    missing_segments_df = pd.DataFrame(missing_segments)
    
    return segments_df, missing_segments_df

# Simulate IBD segments and measure impact of missing individuals
segments_df, missing_segments_df = simulate_missing_data_impact(pedigree, missing_ids)

print(f"Detected segments: {len(segments_df)}")
print(f"Missing segments (due to missing intermediaries): {len(missing_segments_df)}")
print(f"Percentage of segments lost: {len(missing_segments_df)/(len(segments_df) + len(missing_segments_df))*100:.1f}%")

# Compare segment length distributions for detected vs. missing segments
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(segments_df['gen_seg_len'], bins=30, alpha=0.7, label='Detected')
plt.hist(missing_segments_df['gen_seg_len'], bins=30, alpha=0.7, label='Missing')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Segment Length Distribution')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, 100)

plt.subplot(1, 2, 2)
plt.hist(segments_df['relationship_coef'], bins=[0, 0.125, 0.25, 0.5, 1.0], alpha=0.7, label='Detected')
plt.hist(missing_segments_df['relationship_coef'], bins=[0, 0.125, 0.25, 0.5, 1.0], alpha=0.7, label='Missing')
plt.xlabel('Relationship Coefficient')
plt.ylabel('Count')
plt.title('Relationship Distribution')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate the percentage of segments lost by relationship degree
relationship_impact = pd.DataFrame({
    'detected': segments_df.groupby('relationship_type').size(),
    'missing': missing_segments_df.groupby('relationship_type').size()
}).fillna(0)

relationship_impact['total'] = relationship_impact['detected'] + relationship_impact['missing']
relationship_impact['percent_lost'] = relationship_impact['missing'] / relationship_impact['total'] * 100

print("Impact by relationship degree:")
print(relationship_impact)

### 3.2 Strategies for Handling Missing Data

In [ ]:
# Implement strategies for handling missing data in pedigree reconstruction

class MissingDataHandler:
    """
    Class for implementing strategies to handle missing data in pedigree reconstruction.
    """
    
    def __init__(self, pedigree, segments_df):
        """
        Initialize with a pedigree and IBD segments.
        
        Args:
            pedigree: NetworkX graph representing the pedigree structure
            segments_df: DataFrame of IBD segments
        """
        self.pedigree = pedigree
        self.segments_df = segments_df
        
        # Extract individual IDs from segment data
        self.individuals = set()
        for _, row in segments_df.iterrows():
            self.individuals.add(row['sample1'])
            self.individuals.add(row['sample2'])
    
    def identify_missing_connections(self):
        """
        Identify potential missing connections based on transitive relationships.
        
        Returns:
            DataFrame of potential missing connections
        """
        # Create an undirected graph of observed IBD connections
        G = nx.Graph()
        
        # Add all individuals as nodes
        for ind in self.individuals:
            G.add_node(ind)
        
        # Add edges for observed segments
        for _, row in self.segments_df.iterrows():
            G.add_edge(row['sample1'], row['sample2'], 
                      length=row['gen_seg_len'], 
                      count=1)
        
        # Find potential missing connections through transitive relationships
        missing_connections = []
        
        # Check all pairs of individuals with distance 2 (common connection)
        for ind1 in G.nodes():
            for ind2 in G.nodes():
                if ind1 >= ind2:  # Skip self-connections and duplicates
                    continue
                    
                # Skip if direct connection already exists
                if G.has_edge(ind1, ind2):
                    continue
                
                # Check if there's a path of length 2
                paths = list(nx.all_simple_paths(G, ind1, ind2, cutoff=2))
                if len(paths) > 0:
                    # Found at least one common connection
                    for path in paths:
                        if len(path) == 3:  # Path of length 2 (3 nodes)
                            intermediary = path[1]
                            
                            # Get segment lengths for both connections
                            len1 = G[ind1][intermediary]['length']
                            len2 = G[intermediary][ind2]['length']
                            
                            # Estimate expected segment length for missing connection
                            # This is a simplification - in reality it depends on the relationship
                            expected_length = min(len1, len2) / 2
                            
                            missing_connections.append({
                                'sample1': ind1,
                                'sample2': ind2,
                                'intermediary': intermediary,
                                'expected_length': expected_length
                            })
        
        return pd.DataFrame(missing_connections)
    
    def infer_phantom_ancestors(self):
        """
        Infer the existence of "phantom" ancestors who might explain observed IBD patterns.
        
        Returns:
            List of inferred phantom ancestors with their likely descendants
        """
        # Create a graph of observed IBD connections
        G = nx.Graph()
        
        # Add edges for observed segments
        for _, row in self.segments_df.iterrows():
            if G.has_edge(row['sample1'], row['sample2']):
                # Update existing edge
                G[row['sample1']][row['sample2']]['segments'].append({
                    'length': row['gen_seg_len'],
                    'chrom': row['chrom'],
                    'start': row['gen_start'],
                    'end': row['gen_end']
                })
                G[row['sample1']][row['sample2']]['total_cm'] += row['gen_seg_len']
                G[row['sample1']][row['sample2']]['count'] += 1
            else:
                # Add new edge
                G.add_edge(row['sample1'], row['sample2'], 
                          segments=[{
                              'length': row['gen_seg_len'],
                              'chrom': row['chrom'],
                              'start': row['gen_start'],
                              'end': row['gen_end']
                          }],
                          total_cm=row['gen_seg_len'],
                          count=1)
        
        # Use community detection to find clusters that might share a common ancestor
        communities = list(nx.community.greedy_modularity_communities(G))
        
        phantom_ancestors = []
        
        for i, community in enumerate(communities):
            if len(community) < 3:
                continue  # Need at least 3 individuals to infer a common ancestor
                
            # Check if all pairs share IBD
            is_fully_connected = True
            for ind1 in community:
                for ind2 in community:
                    if ind1 != ind2 and not G.has_edge(ind1, ind2):
                        is_fully_connected = False
                        break
                if not is_fully_connected:
                    break
            
            # If not fully connected, create a phantom ancestor
            if not is_fully_connected:
                # Calculate average IBD sharing within community
                total_cm = 0
                pair_count = 0
                for ind1 in community:
                    for ind2 in community:
                        if ind1 < ind2 and G.has_edge(ind1, ind2):
                            total_cm += G[ind1][ind2]['total_cm']
                            pair_count += 1
                
                if pair_count > 0:
                    avg_cm = total_cm / pair_count
                    
                    # Infer ancestor generation based on average sharing
                    if avg_cm > 100:
                        gen_distance = 1  # Parent
                    elif avg_cm > 50:
                        gen_distance = 2  # Grandparent
                    else:
                        gen_distance = 3  # Great-grandparent
                    
                    phantom_ancestors.append({
                        'id': f"phantom_{i}",
                        'descendants': list(community),
                        'avg_sharing_cm': avg_cm,
                        'generation_distance': gen_distance
                    })
        
        return phantom_ancestors
    
    def predict_missing_segments(self, min_confidence=0.7):
        """
        Predict IBD segments that might exist but weren't detected.
        
        Args:
            min_confidence: Minimum confidence threshold for predictions
            
        Returns:
            DataFrame of predicted missing segments
        """
        # Group segments by pair
        pair_segments = self.segments_df.groupby(['sample1', 'sample2'])
        
        predicted_segments = []
        
        for (sample1, sample2), group in pair_segments:
            # Skip pairs with only one segment
            if len(group) < 2:
                continue
                
            # Count segments per chromosome
            chrom_counts = group['chrom'].value_counts()
            
            # Identify chromosomes with missing segments
            for chrom in range(1, 23):
                if chrom not in chrom_counts.index:
                    # This chromosome has no segments
                    # Calculate probability of missing segment based on pair's other segments
                    other_chroms_avg = group['gen_seg_len'].mean()
                    
                    # Simple confidence model: higher if pair has many segments
                    confidence = 1 - (1 / (0.5 * len(group)))
                    
                    # Only include predictions with sufficient confidence
                    if confidence >= min_confidence:
                        predicted_segments.append({
                            'sample1': sample1,
                            'sample2': sample2,
                            'chrom': chrom,
                            'predicted_length': other_chroms_avg,
                            'confidence': confidence
                        })
        
        return pd.DataFrame(predicted_segments)

# Apply missing data strategies to our simulated data
handler = MissingDataHandler(pedigree, segments_df)

# 1. Identify potentially missing connections
missing_connections = handler.identify_missing_connections()
print(f"Identified {len(missing_connections)} potentially missing connections")
display(missing_connections.head(10))

# 2. Infer phantom ancestors
phantom_ancestors = handler.infer_phantom_ancestors()
print(f"\nInferred {len(phantom_ancestors)} phantom ancestors")
for i, ancestor in enumerate(phantom_ancestors[:3]):  # Show first 3
    print(f"\nPhantom ancestor {i+1}:")
    print(f"  ID: {ancestor['id']}")
    print(f"  Generation distance: {ancestor['generation_distance']}")
    print(f"  Average sharing: {ancestor['avg_sharing_cm']:.2f} cM")
    print(f"  Number of descendants: {len(ancestor['descendants'])}")
    print(f"  Descendants: {', '.join(ancestor['descendants'][:5])}{'...' if len(ancestor['descendants']) > 5 else ''}")

# 3. Predict missing segments
predicted_segments = handler.predict_missing_segments()
print(f"\nPredicted {len(predicted_segments)} potentially missing segments")
display(predicted_segments.head(10))

# Visualize predicted segments by chromosome
plt.figure(figsize=(12, 6))
chrom_counts = predicted_segments['chrom'].value_counts().sort_index()
plt.bar(chrom_counts.index, chrom_counts.values)
plt.xlabel('Chromosome')
plt.ylabel('Number of Predicted Segments')
plt.title('Distribution of Predicted Missing Segments by Chromosome')
plt.xticks(range(1, 23))
plt.grid(True, alpha=0.3)
plt.show()

# Visualize relationship between prediction confidence and predicted length
plt.figure(figsize=(10, 6))
plt.scatter(predicted_segments['confidence'], predicted_segments['predicted_length'], alpha=0.5)
plt.xlabel('Prediction Confidence')
plt.ylabel('Predicted Segment Length (cM)')
plt.title('Relationship Between Prediction Confidence and Segment Length')
plt.grid(True, alpha=0.3)
plt.xlim(0.7, 1.0)
plt.ylim(0, predicted_segments['predicted_length'].max() * 1.1)
plt.show()

## 4. Building Data Preprocessing Pipelines for Bonsai

### 4.1 Designing an End-to-End Preprocessing Pipeline

Now that we've explored various filtering techniques and strategies for handling missing data, let's design a comprehensive preprocessing pipeline for Bonsai. This pipeline will integrate the different components we've developed and prepare IBD data for pedigree reconstruction.

In [ ]:
# Implement an end-to-end preprocessing pipeline for Bonsai

class BonsaiPreprocessingPipeline:
    """
    End-to-end pipeline for preprocessing IBD data for Bonsai.
    
    This pipeline handles:
    1. Data loading and validation
    2. Quality control and filtering
    3. Missing data handling
    4. Data transformation for Bonsai input
    """
    
    def __init__(self, population_type='outbred'):
        """
        Initialize the pipeline.
        
        Args:
            population_type: Type of population ('outbred', 'endogamous', 'bottleneck', 'admixed')
        """
        self.population_type = population_type
        self.logs = []
        self.original_segments = None
        self.filtered_segments = None
        self.predicted_segments = None
        self.pedigree_data = None
    
    def log(self, message):
        """Add message to log and print it."""
        self.logs.append(message)
        print(message)
    
    def load_data(self, segments_file, metadata_file=None):
        """
        Load IBD segment data and optional metadata.
        
        Args:
            segments_file: Path to the IBD segments file
            metadata_file: Optional path to metadata file
            
        Returns:
            self for method chaining
        """
        self.log(f"Loading IBD segments from {segments_file}")
        
        # Load segments
        if segments_file.endswith('.seg'):
            # Load in the format from previous IBD detection steps
            self.original_segments = pd.read_csv(segments_file, sep="\t", header=None)
            self.original_segments.columns = ["sample1", "sample2", "chrom", "phys_start", 
                                              "phys_end", "ibd_type", "gen_start", "gen_end", "gen_seg_len"]
        elif segments_file.endswith('.csv'):
            # Assume standard CSV format
            self.original_segments = pd.read_csv(segments_file)
        else:
            # Try to infer format
            try:
                self.original_segments = pd.read_csv(segments_file, sep="\t")
            except:
                self.original_segments = pd.read_csv(segments_file)
        
        # Load metadata if provided
        self.metadata = None
        if metadata_file:
            self.log(f"Loading metadata from {metadata_file}")
            try:
                self.metadata = pd.read_csv(metadata_file)
            except:
                self.log(f"Warning: Could not load metadata from {metadata_file}")
        
        self.log(f"Loaded {len(self.original_segments)} IBD segments involving "
                f"{len(set(self.original_segments['sample1']).union(set(self.original_segments['sample2'])))} individuals")
        
        # Make a copy for filtering
        self.filtered_segments = self.original_segments.copy()
        
        return self
    
    def validate_data(self):
        """
        Validate the loaded data for required columns and formats.
        
        Returns:
            self for method chaining
        """
        self.log("Validating data formats...")
        
        # Check required columns
        required_columns = [
            "sample1", "sample2", "chrom", "gen_seg_len"
        ]
        
        missing_columns = [col for col in required_columns if col not in self.original_segments.columns]
        if missing_columns:
            self.log(f"Warning: Missing required columns: {', '.join(missing_columns)}")
            
            # Try to infer or rename columns
            if "gen_seg_len" not in self.original_segments.columns and "cM" in self.original_segments.columns:
                self.original_segments["gen_seg_len"] = self.original_segments["cM"]
                self.log("Inferred 'gen_seg_len' from 'cM' column")
        
        # Check data types
        try:
            self.original_segments["gen_seg_len"] = self.original_segments["gen_seg_len"].astype(float)
            self.original_segments["chrom"] = self.original_segments["chrom"].astype(int)
        except:
            self.log("Warning: Could not convert segment length or chromosome to numeric values")
        
        # Check for duplicates
        duplicates = self.original_segments.duplicated().sum()
        if duplicates > 0:
            self.log(f"Warning: Found {duplicates} duplicate rows")
        
        # Standardize individual IDs
        sample_cols = ["sample1", "sample2"]
        for col in sample_cols:
            if col in self.original_segments.columns:
                self.original_segments[col] = self.original_segments[col].astype(str)
        
        # Ensure sample1 < sample2 for consistency
        swap_mask = self.original_segments["sample1"] > self.original_segments["sample2"]
        if swap_mask.sum() > 0:
            self.log(f"Standardizing {swap_mask.sum()} rows to ensure sample1 < sample2")
            self.original_segments.loc[swap_mask, sample_cols] = self.original_segments.loc[swap_mask, sample_cols[::-1]].values
        
        # Update filtered segments
        self.filtered_segments = self.original_segments.copy()
        
        return self
    
    def apply_qc_and_filtering(self, min_length=7, min_confidence=None, consistency_filter=True):
        """
        Apply quality control and filtering steps.
        
        Args:
            min_length: Minimum segment length (cM)
            min_confidence: Minimum confidence score (if available)
            consistency_filter: Whether to apply consistency-based filtering
            
        Returns:
            self for method chaining
        """
        self.log(f"Applying quality control and filtering (population type: {self.population_type})...")
        
        # Create a filter based on population type
        if self.population_type == 'endogamous':
            filter_obj = PopulationAwareFilter(self.filtered_segments, 'endogamous')
            self.filtered_segments = filter_obj.apply_population_specific_filters().get_filtered_segments()
        elif self.population_type == 'bottleneck':
            filter_obj = PopulationAwareFilter(self.filtered_segments, 'bottleneck')
            self.filtered_segments = filter_obj.apply_population_specific_filters().get_filtered_segments()
        elif self.population_type == 'admixed':
            filter_obj = PopulationAwareFilter(self.filtered_segments, 'admixed')
            self.filtered_segments = filter_obj.apply_population_specific_filters().get_filtered_segments()
        else:  # outbred
            # Apply standard filtering for outbred populations
            filter_obj = IBDSegmentFilter(self.filtered_segments)
            self.filtered_segments = filter_obj.filter_by_length(min_length=min_length).get_filtered_segments()
        
        # Apply confidence filtering if specified
        if min_confidence is not None and 'confidence' in self.filtered_segments.columns:
            self.filtered_segments = self.filtered_segments[self.filtered_segments['confidence'] >= min_confidence]
            self.log(f"Applied confidence threshold of {min_confidence}")
        
        # Apply consistency filtering if specified
        if consistency_filter:
            consistency_obj = ConsistencyFilter(self.filtered_segments)
            self.filtered_segments = consistency_obj.filter_by_segment_count(min_segments=2).get_filtered_segments()
            self.log("Applied consistency filtering (min 2 segments per pair)")
        
        self.log(f"After filtering: {len(self.filtered_segments)} segments remaining "
                f"({len(self.filtered_segments)/len(self.original_segments)*100:.1f}% of original)")
        
        return self
    
    def handle_missing_data(self, predict_segments=True, infer_ancestors=True):
        """
        Apply strategies to handle missing data.
        
        Args:
            predict_segments: Whether to predict potentially missing segments
            infer_ancestors: Whether to infer phantom ancestors
            
        Returns:
            self for method chaining
        """
        self.log("Handling missing data...")
        
        # Create a simple pedigree graph from the filtered segments
        G = nx.Graph()
        
        # Add all individuals as nodes
        individuals = set()
        for _, row in self.filtered_segments.iterrows():
            individuals.add(row['sample1'])
            individuals.add(row['sample2'])
        
        for ind in individuals:
            G.add_node(ind)
        
        # Add edges for IBD connections
        for _, row in self.filtered_segments.iterrows():
            G.add_edge(row['sample1'], row['sample2'], 
                      length=row['gen_seg_len'], 
                      count=1)
        
        # Create a handler
        handler = MissingDataHandler(G, self.filtered_segments)
        
        # Predict missing segments if specified
        if predict_segments:
            self.predicted_segments = handler.predict_missing_segments(min_confidence=0.8)
            self.log(f"Predicted {len(self.predicted_segments)} potentially missing segments")
        
        # Infer phantom ancestors if specified
        if infer_ancestors:
            self.phantom_ancestors = handler.infer_phantom_ancestors()
            self.log(f"Inferred {len(self.phantom_ancestors)} phantom ancestors")
        
        return self
    
    def prepare_bonsai_input(self, output_file=None, include_predicted=True):
        """
        Prepare data in the format required by Bonsai.
        
        Args:
            output_file: Optional path to save the Bonsai input file
            include_predicted: Whether to include predicted segments
            
        Returns:
            Dict with Bonsai input data
        """
        self.log("Preparing Bonsai input data...")
        
        # Combine filtered segments with predicted segments if specified
        segments_for_bonsai = self.filtered_segments.copy()
        
        if include_predicted and self.predicted_segments is not None and len(self.predicted_segments) > 0:
            # Convert predicted segments to the same format
            predicted_for_bonsai = self.predicted_segments.copy()
            predicted_for_bonsai['gen_start'] = 0  # Placeholder
            predicted_for_bonsai['gen_end'] = predicted_for_bonsai['predicted_length']
            predicted_for_bonsai['phys_start'] = 0  # Placeholder
            predicted_for_bonsai['phys_end'] = 0  # Placeholder
            predicted_for_bonsai['ibd_type'] = 'IBD1'  # Assuming IBD1
            predicted_for_bonsai['is_predicted'] = True
            
            # Rename columns to match
            predicted_for_bonsai = predicted_for_bonsai.rename(
                columns={'predicted_length': 'gen_seg_len'}
            )
            
            # Add required columns
            required_cols = set(segments_for_bonsai.columns) - set(predicted_for_bonsai.columns)
            for col in required_cols:
                if col != 'is_predicted':
                    predicted_for_bonsai[col] = 0  # Placeholder
            
            # Mark original segments
            segments_for_bonsai['is_predicted'] = False
            
            # Combine
            segments_for_bonsai = pd.concat([segments_for_bonsai, predicted_for_bonsai[segments_for_bonsai.columns]])
            
            self.log(f"Added {len(predicted_for_bonsai)} predicted segments")
        
        # Create a mapping of individuals to metadata if available
        individuals = {}
        if self.metadata is not None:
            for ind in set(segments_for_bonsai['sample1']).union(set(segments_for_bonsai['sample2'])):
                # Try to find this individual in metadata
                ind_metadata = self.metadata[self.metadata['id'] == ind]
                if len(ind_metadata) > 0:
                    individuals[ind] = ind_metadata.iloc[0].to_dict()
                else:
                    individuals[ind] = {'id': ind}
        else:
            # Create basic entries for each individual
            for ind in set(segments_for_bonsai['sample1']).union(set(segments_for_bonsai['sample2'])):
                individuals[ind] = {'id': ind}
        
        # Prepare Bonsai input format
        bonsai_input = {
            'individuals': individuals,
            'segments': segments_for_bonsai.to_dict(orient='records'),
            'phantom_ancestors': self.phantom_ancestors if hasattr(self, 'phantom_ancestors') else [],
            'metadata': {
                'pipeline_version': '1.0.0',
                'population_type': self.population_type,
                'filtering_info': {
                    'original_count': len(self.original_segments),
                    'filtered_count': len(self.filtered_segments),
                    'predicted_count': len(self.predicted_segments) if hasattr(self, 'predicted_segments') else 0
                }
            }
        }
        
        # Save to file if specified
        if output_file:
            try:
                with open(output_file, 'w') as f:
                    json.dump(bonsai_input, f, indent=2)
                self.log(f"Saved Bonsai input to {output_file}")
            except Exception as e:
                self.log(f"Error saving to {output_file}: {e}")
        
        self.pedigree_data = bonsai_input
        return bonsai_input
    
    def run_pipeline(self, segments_file, output_file=None, metadata_file=None, min_length=7):
        """
        Run the complete pipeline in one go.
        
        Args:
            segments_file: Path to the IBD segments file
            output_file: Optional path to save the Bonsai input file
            metadata_file: Optional path to metadata file
            min_length: Minimum segment length (cM)
            
        Returns:
            Dict with Bonsai input data
        """
        return (self
                .load_data(segments_file, metadata_file)
                .validate_data()
                .apply_qc_and_filtering(min_length=min_length)
                .handle_missing_data()
                .prepare_bonsai_input(output_file))

# Demonstrate the complete pipeline
pipeline = BonsaiPreprocessingPipeline(population_type='outbred')
bonsai_input = pipeline.run_pipeline(
    segments_file=os.path.join(data_directory, "class_data/ped_sim_run2.seg"),
    min_length=7
)

# Display summary statistics
individuals = set(pipeline.filtered_segments['sample1']).union(set(pipeline.filtered_segments['sample2']))
print(f"\nSummary:")
print(f"- {len(individuals)} individuals")
print(f"- {len(pipeline.filtered_segments)} filtered segments")
print(f"- {len(pipeline.predicted_segments) if pipeline.predicted_segments is not None else 0} predicted segments")
print(f"- {len(pipeline.phantom_ancestors) if hasattr(pipeline, 'phantom_ancestors') else 0} inferred phantom ancestors")

# Plot the distribution of segments by chromosome
plt.figure(figsize=(12, 6))
chrom_counts = pipeline.filtered_segments['chrom'].value_counts().sort_index()
plt.bar(chrom_counts.index, chrom_counts.values)
plt.xlabel('Chromosome')
plt.ylabel('Number of Segments')
plt.title('Distribution of IBD Segments by Chromosome')
plt.xticks(range(1, 23))
plt.grid(True, alpha=0.3)
plt.show()

# Plot the distribution of segment lengths
plt.figure(figsize=(12, 6))
plt.hist(pipeline.filtered_segments['gen_seg_len'], bins=30, alpha=0.7)
plt.axvline(x=7, color='red', linestyle='--', label='7 cM threshold')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Distribution of IBD Segment Lengths')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)
plt.show()

## 5. Exercises

Complete the following exercises to test your understanding of data quality and preprocessing for Bonsai:

### Exercise 1: Modify the Population-Aware Filter
Extend the `PopulationAwareFilter` class to handle a new population type called "island" that represents an isolated population with high levels of endogamy but also some distinctive features.

### Exercise 2: Evaluate Filter Performance
Design a function that evaluates the performance of different filtering strategies by comparing the filtered segments to a "ground truth" set of known relationships.

### Exercise 3: Create a Custom Preprocessing Pipeline
Create a custom preprocessing pipeline for a specific use case (e.g., ancient DNA analysis, multi-ethnic dataset, or a specific family history question).

In [ ]:
# Example solution for Exercise 1: Extend the PopulationAwareFilter for "island" populations

class ExtendedPopulationFilter(PopulationAwareFilter):
    """Extended filter that handles additional population types including island populations."""
    
    def apply_population_specific_filters(self):
        """Apply filters appropriate for the specified population."""
        n_before = len(self.filtered_segments)
        
        if self.population_type == 'island':
            # Island populations tend to have:
            # 1. High background IBD due to isolation
            # 2. Distinct patterns of sharing across specific chromosomes due to founder effects
            # 3. Need for careful consistency checks to distinguish true from background IBD
            
            # Step 1: Apply a higher length threshold to reduce false positives from background IBD
            self.filter_by_length(min_length=12)  # More stringent than endogamous
            
            # Step 2: Identify "common segments" that appear frequently in the population
            # (in a real scenario, these would be pre-identified based on population-wide analysis)
            # For this example, we'll simulate by counting the most frequent segment regions
            if len(self.filtered_segments) > 0:
                # Group segments by chromosome and divide into regions
                region_counts = {}
                bin_size = 10  # in cM
                
                for _, row in self.filtered_segments.iterrows():
                    chrom = row['chrom']
                    start_bin = int(row['gen_start'] / bin_size)
                    end_bin = int(row['gen_end'] / bin_size)
                    
                    # Count each affected bin
                    for bin_idx in range(start_bin, end_bin + 1):
                        region_key = f"{chrom}_{bin_idx}"
                        if region_key in region_counts:
                            region_counts[region_key] += 1
                        else:
                            region_counts[region_key] = 1
                
                # Identify common regions (top 10% most frequent)
                sorted_regions = sorted(region_counts.items(), key=lambda x: x[1], reverse=True)
                common_regions = []
                if len(sorted_regions) > 10:
                    threshold_idx = max(1, int(len(sorted_regions) * 0.1))
                    common_regions = [r[0] for r in sorted_regions[:threshold_idx]]
                
                # Filter out segments that are entirely in common regions
                if common_regions:
                    filtered_idx = []
                    for i, row in self.filtered_segments.iterrows():
                        chrom = row['chrom']
                        start_bin = int(row['gen_start'] / bin_size)
                        end_bin = int(row['gen_end'] / bin_size)
                        
                        # Check if all bins are in common regions
                        all_common = True
                        for bin_idx in range(start_bin, end_bin + 1):
                            region_key = f"{chrom}_{bin_idx}"
                            if region_key not in common_regions:
                                all_common = False
                                break
                        
                        # Keep if not all bins are common
                        if not all_common:
                            filtered_idx.append(i)
                    
                    # Apply the filter
                    self.filtered_segments = self.filtered_segments.loc[filtered_idx]
            
            # Step 3: Apply consistency filtering - pairs should have multiple segments
            # Count segments per pair
            pair_counts = self.filtered_segments.groupby(['sample1', 'sample2']).size().reset_index(name='count')
            valid_pairs = pair_counts[pair_counts['count'] >= 3]  # Higher than standard
            
            # Filter segments
            valid_pairs_tuples = list(zip(valid_pairs['sample1'], valid_pairs['sample2']))
            pair_tuples = list(zip(self.filtered_segments['sample1'], self.filtered_segments['sample2']))
            
            self.filtered_segments = self.filtered_segments[
                [p in valid_pairs_tuples for p in pair_tuples]
            ]
            
        else:
            # Use the parent class implementation for other population types
            super().apply_population_specific_filters()
        
        n_after = len(self.filtered_segments)
        self.filter_history.append({
            'filter': f'population_{self.population_type}',
            'params': {},
            'removed': n_before - n_after
        })
        
        return self

# Test the extended filter with an island population
np.random.seed(42)
# Create a simulated island population - extreme version of endogamous
island_df = simulate_population_effects(seg_df, 'endogamous', num_individuals=30)
# Add some common segments that appear in many pairs (simulating founder effect)
common_segments = []
for _ in range(50):  # Add 50 common segments
    # Randomly select 30-60% of pairs to have this common segment
    all_pairs = [(i, j) for i in range(30) for j in range(i+1, 30)]
    num_pairs = int(len(all_pairs) * np.random.uniform(0.3, 0.6))
    pairs = np.random.choice(range(len(all_pairs)), size=num_pairs, replace=False)
    
    # Create a segment on the same region
    chrom = np.random.randint(1, 23)
    start = np.random.uniform(0, 200)
    length = np.random.uniform(15, 30)  # Longer segments
    
    for idx in pairs:
        i, j = all_pairs[idx]
        common_segments.append({
            'sample1': f"ind_{i}",
            'sample2': f"ind_{j}",
            'chrom': chrom,
            'phys_start': np.random.randint(1000000, 200000000),
            'phys_end': np.random.randint(200000000, 300000000),
            'ibd_type': 'IBD1',
            'gen_start': start,
            'gen_end': start + length,
            'gen_seg_len': length
        })

# Add common segments to the dataset
island_df = pd.concat([island_df, pd.DataFrame(common_segments)], ignore_index=True)

# Apply both standard and extended filter for comparison
standard_filter = PopulationAwareFilter(island_df.copy(), 'endogamous')
extended_filter = ExtendedPopulationFilter(island_df.copy(), 'island')

# Apply filters
standard_filtered = standard_filter.apply_population_specific_filters().get_filtered_segments()
extended_filtered = extended_filter.apply_population_specific_filters().get_filtered_segments()

# Compare results
print(f"Original island population segments: {len(island_df)}")
print(f"After standard endogamous filtering: {len(standard_filtered)} ({len(standard_filtered)/len(island_df)*100:.1f}%)")
print(f"After island-specific filtering: {len(extended_filtered)} ({len(extended_filtered)/len(island_df)*100:.1f}%)")

# Visualize the distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(island_df['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Original Island Population')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 3, 2)
plt.hist(standard_filtered['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Standard Endogamous Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.subplot(1, 3, 3)
plt.hist(extended_filtered['gen_seg_len'], bins=30, alpha=0.7)
plt.xlabel('Segment Length (cM)')
plt.ylabel('Count')
plt.title('Island-Specific Filtering')
plt.grid(True, alpha=0.3)
plt.xlim(0, 50)

plt.tight_layout()
plt.show()

# Visualize the distribution of segments by chromosome
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
island_chrom_counts = island_df['chrom'].value_counts().sort_index()
plt.bar(island_chrom_counts.index, island_chrom_counts.values)
plt.xlabel('Chromosome')
plt.ylabel('Number of Segments')
plt.title('Original Island Population')
plt.xticks(range(1, 23))
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
standard_chrom_counts = standard_filtered['chrom'].value_counts().sort_index()
plt.bar(standard_chrom_counts.index, standard_chrom_counts.values)
plt.xlabel('Chromosome')
plt.ylabel('Number of Segments')
plt.title('Standard Endogamous Filtering')
plt.xticks(range(1, 23))
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
extended_chrom_counts = extended_filtered['chrom'].value_counts().sort_index()
plt.bar(extended_chrom_counts.index, extended_chrom_counts.values)
plt.xlabel('Chromosome')
plt.ylabel('Number of Segments')
plt.title('Island-Specific Filtering')
plt.xticks(range(1, 23))
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [ ]:
# Example solution for Exercise 2: Evaluate filter performance against ground truth

def evaluate_filter_performance(true_relationships, filtered_segments, relationship_thresholds=None):
    """
    Evaluate the performance of filtering strategies by comparing to known relationships.
    
    Args:
        true_relationships: DataFrame with columns 'sample1', 'sample2', 'relationship_degree'
        filtered_segments: DataFrame of filtered IBD segments
        relationship_thresholds: Dict mapping relationship degrees to min total cM thresholds
        
    Returns:
        Dict with performance metrics
    """
    if relationship_thresholds is None:
        # Default thresholds: minimum total cM expected for each relationship degree
        relationship_thresholds = {
            1: 1500,  # Parent-child: ~1500+ cM
            2: 500,   # Sibling, grandparent: ~500+ cM
            3: 200,   # Aunt/uncle, half-sibling: ~200+ cM
            4: 90,    # First cousin: ~90+ cM
            5: 45,    # First cousin once removed: ~45+ cM
            6: 20     # Second cousin: ~20+ cM
        }
    
    # Calculate total sharing between each pair in the filtered segments
    pair_sharing = filtered_segments.groupby(['sample1', 'sample2'])['gen_seg_len'].sum().reset_index()
    
    # Create a lookup dictionary for quick access
    sharing_dict = {(row['sample1'], row['sample2']): row['gen_seg_len'] for _, row in pair_sharing.iterrows()}
    
    # Initialize counters for evaluation
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    true_negatives_unknown = 0  # We can't know true negatives without all possible pairs
    
    # Check each true relationship
    for _, rel in true_relationships.iterrows():
        sample1, sample2 = rel['sample1'], rel['sample2']
        degree = rel['relationship_degree']
        threshold = relationship_thresholds.get(degree, 0)
        
        # Standardize pair order for lookup
        if sample1 > sample2:
            sample1, sample2 = sample2, sample1
        
        # Check if this pair has sufficient sharing
        pair_key = (sample1, sample2)
        if pair_key in sharing_dict:
            total_sharing = sharing_dict[pair_key]
            if total_sharing >= threshold:
                # Correctly identified relationship
                true_positives += 1
            else:
                # Relationship detected but below threshold
                false_negatives += 1
        else:
            # Relationship not detected at all
            false_negatives += 1
    
    # Check for false positives
    for pair_key, total_sharing in sharing_dict.items():
        sample1, sample2 = pair_key
        # Find this pair in true relationships
        rel_row = true_relationships[
            ((true_relationships['sample1'] == sample1) & (true_relationships['sample2'] == sample2)) |
            ((true_relationships['sample1'] == sample2) & (true_relationships['sample2'] == sample1))
        ]
        
        if len(rel_row) == 0:
            # This pair isn't in the true relationships - check if it exceeds any threshold
            min_threshold = min(relationship_thresholds.values())
            if total_sharing >= min_threshold:
                false_positives += 1
            else:
                true_negatives_unknown += 1
    
    # Calculate metrics
    total_predictions = true_positives + false_positives + true_negatives_unknown
    total_true = true_positives + false_negatives
    
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / total_true if total_true > 0 else 0
    f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    # Prepare results
    results = {
        'true_positives': true_positives,
        'false_positives': false_positives,
        'false_negatives': false_negatives,
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score
    }
    
    return results

# Create a set of true relationships for evaluation
def generate_true_relationships(pedigree):
    """Generate true relationships from a pedigree graph."""
    true_relationships = []
    
    # Convert to undirected graph for path finding
    undirected = pedigree.to_undirected()
    
    # Find paths between all pairs
    for source in pedigree.nodes():
        for target in pedigree.nodes():
            if source >= target:
                continue  # Skip self and duplicates
                
            # Find shortest path between these individuals
            try:
                path = nx.shortest_path(undirected, source, target)
                # Path length - 1 = relationship degree
                degree = len(path) - 1
                
                if degree <= 6:  # Only consider up to 6th degree relationships
                    true_relationships.append({
                        'sample1': f"ind_{source}",
                        'sample2': f"ind_{target}",
                        'relationship_degree': degree,
                        'path': path
                    })
            except nx.NetworkXNoPath:
                # No path between these individuals
                pass
    
    return pd.DataFrame(true_relationships)

# Generate true relationships from our simulated pedigree
true_relationships = generate_true_relationships(pedigree)
print(f"Generated {len(true_relationships)} true relationships")

# Simulate segments for these relationships
segments_df_for_eval, _ = simulate_missing_data_impact(pedigree, missing_ids, num_segments_per_relationship=20)

# Apply different filtering strategies
filters_to_evaluate = {
    'minimal': IBDSegmentFilter(segments_df_for_eval.copy()).filter_by_length(min_length=3).get_filtered_segments(),
    'standard': IBDSegmentFilter(segments_df_for_eval.copy()).filter_by_length(min_length=7).get_filtered_segments(),
    'strict': IBDSegmentFilter(segments_df_for_eval.copy()).filter_by_length(min_length=15).get_filtered_segments(),
    'consistency': ConsistencyFilter(segments_df_for_eval.copy()).filter_by_segment_count(min_segments=2).get_filtered_segments(),
    'combined': ConsistencyFilter(segments_df_for_eval.copy()).filter_by_length(min_length=7).filter_by_segment_count(min_segments=2).get_filtered_segments()
}

# Evaluate each filter
results = {}
for filter_name, filtered_segments in filters_to_evaluate.items():
    results[filter_name] = evaluate_filter_performance(true_relationships, filtered_segments)
    
# Display results
results_df = pd.DataFrame.from_dict(results, orient='index')
print("Filter Performance Metrics:")
display(results_df)

# Visualize precision, recall, and F1 score
plt.figure(figsize=(12, 6))
metrics = ['precision', 'recall', 'f1_score']
x = np.arange(len(filters_to_evaluate))
width = 0.25

for i, metric in enumerate(metrics):
    values = [results[filter_name][metric] for filter_name in filters_to_evaluate]
    plt.bar(x + i*width - width, values, width, label=metric)

plt.xlabel('Filtering Strategy')
plt.ylabel('Score')
plt.title('Performance Metrics by Filtering Strategy')
plt.xticks(x, list(filters_to_evaluate.keys()), rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Analyze performance by relationship degree
def evaluate_by_degree(true_relationships, filtered_segments, relationship_thresholds=None):
    """Evaluate filter performance broken down by relationship degree."""
    if relationship_thresholds is None:
        # Default thresholds (as above)
        relationship_thresholds = {
            1: 1500, 2: 500, 3: 200, 4: 90, 5: 45, 6: 20
        }
    
    # Calculate total sharing between each pair
    pair_sharing = filtered_segments.groupby(['sample1', 'sample2'])['gen_seg_len'].sum().reset_index()
    sharing_dict = {(row['sample1'], row['sample2']): row['gen_seg_len'] for _, row in pair_sharing.iterrows()}
    
    # Evaluate each degree separately
    degree_results = {}
    for degree in range(1, 7):
        # Filter relationships to this degree
        degree_rels = true_relationships[true_relationships['relationship_degree'] == degree]
        
        if len(degree_rels) == 0:
            continue  # Skip if no relationships of this degree
            
        threshold = relationship_thresholds[degree]
        
        # Count true positives and false negatives
        tp = 0
        fn = 0
        
        for _, rel in degree_rels.iterrows():
            sample1, sample2 = rel['sample1'], rel['sample2']
            
            # Standardize pair order
            if sample1 > sample2:
                sample1, sample2 = sample2, sample1
            
            # Check if this pair has sufficient sharing
            pair_key = (sample1, sample2)
            if pair_key in sharing_dict:
                total_sharing = sharing_dict[pair_key]
                if total_sharing >= threshold:
                    tp += 1
                else:
                    fn += 1
            else:
                fn += 1
        
        # Calculate recall for this degree
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        
        degree_results[degree] = {
            'true_positives': tp,
            'false_negatives': fn,
            'recall': recall,
            'total_relationships': len(degree_rels)
        }
    
    return degree_results

# Evaluate breakdown by degree for each filter
degree_results = {}
for filter_name, filtered_segments in filters_to_evaluate.items():
    degree_results[filter_name] = evaluate_by_degree(true_relationships, filtered_segments)

# Visualize recall by relationship degree
plt.figure(figsize=(12, 6))
x = np.arange(1, 7)
width = 0.15
i = 0

for filter_name, results in degree_results.items():
    recall_values = [results.get(degree, {}).get('recall', 0) for degree in range(1, 7)]
    plt.bar(x + i*width - 2*width, recall_values, width, label=filter_name)
    i += 1

plt.xlabel('Relationship Degree')
plt.ylabel('Recall')
plt.title('Recall by Relationship Degree and Filtering Strategy')
plt.xticks(x)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Summary and Next Steps

In this lab, we explored data quality issues and preprocessing techniques for Bonsai. Let's summarize what we've learned:

### Key Concepts Covered

1. **Common Data Quality Issues**
   - IBD detection errors (false positives, false negatives, boundary errors)
   - Phasing and genotyping errors
   - Population-specific challenges (endogamy, bottlenecks, admixture)

2. **Filtering Techniques**
   - Length-based filtering
   - Confidence-based filtering
   - Consistency-based filtering
   - Population-aware filtering

3. **Handling Missing Data**
   - Understanding the impact of missing individuals
   - Predicting missing segments
   - Inferring phantom ancestors

4. **Building Preprocessing Pipelines**
   - End-to-end data processing
   - Data validation and transformation
   - Integration of multiple filtering strategies

### Next Steps

1. **Advanced Pipeline Customization**
   - Create domain-specific preprocessing pipelines
   - Integrate demographic and historical information

2. **Performance Analysis**
   - Conduct systematic evaluations of filtering strategies
   - Benchmark against known pedigrees
   - Measure impact on Bonsai accuracy

3. **Further Learning**
   - Explore specialized techniques for challenging populations
   - Study phasing improvements and their impact on IBD detection
   - Investigate machine learning approaches for segment filtering

In the next lab, we'll explore how to use Bonsai for multi-sample relationship inference and examine strategies for building larger pedigrees from IBD data.