# Lab 21: Bonsai - Real-World Applications

## Introduction

In the previous lab, we explored various optimization algorithms for reconstructing pedigrees using the Bonsai framework. Now, we'll examine how these techniques can be applied to real-world genetic genealogy datasets.

Real-world applications of pedigree reconstruction present unique challenges not encountered in controlled simulations, including:

- Incomplete or missing data
- Genotyping and IBD detection errors
- Complex genetic structures (endogamy, multiple family lines, etc.)
- Computational efficiency requirements for large datasets
- Privacy and ethical considerations

This lab will demonstrate how to address these challenges using Bonsai's optimization frameworks.

## Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
import os
import time
import random
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from collections import defaultdict
import scipy.stats as stats
from IPython.display import display, HTML
from typing import List, Dict, Tuple, Set, Optional

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Configure visualization settings
plt.style.use('ggplot')
sns.set_context('notebook')
%matplotlib inline

Let's also import our Bonsai framework components:

In [None]:
# Import Bonsai components from previous labs
# Note: In a real application, these would be properly packaged modules

# Pedigree data structures
class Individual:
    """Represents an individual in a pedigree."""
    def __init__(self, id_str, father_id=None, mother_id=None, sex=None):
        self.id = id_str
        self.father_id = father_id
        self.mother_id = mother_id
        self.sex = sex  # 'M' or 'F'
        self.ibd_segments = []  # List of IBD segments with other individuals

class Pedigree:
    """Represents a pedigree as a collection of individuals."""
    def __init__(self):
        self.individuals = {}  # Dictionary mapping IDs to Individual objects
        self.graph = nx.DiGraph()  # Directed graph representing the pedigree

    def add_individual(self, individual):
        """Add an individual to the pedigree."""
        self.individuals[individual.id] = individual
        self.graph.add_node(individual.id)
        # Add edges for parental relationships if they exist
        if individual.father_id and individual.father_id in self.individuals:
            self.graph.add_edge(individual.father_id, individual.id)
        if individual.mother_id and individual.mother_id in self.individuals:
            self.graph.add_edge(individual.mother_id, individual.id)

    def visualize(self, figsize=(12, 8)):
        """Visualize the pedigree graph."""
        plt.figure(figsize=figsize)
        pos = nx.nx_agraph.graphviz_layout(self.graph, prog='dot')
        nx.draw(self.graph, pos, with_labels=True, node_color='lightblue', 
                node_size=2000, arrowsize=20, font_size=10)
        plt.title("Pedigree Graph")
        plt.show()

## 1. Working with Real Genetic Genealogy Datasets

Real genetic genealogy datasets differ from simulations in several important ways. Let's explore how to load, clean, and prepare real-world data for pedigree reconstruction.

In [None]:
def load_ibd_segments(file_path):
    """Load IBD segments from a real dataset file.
    
    Args:
        file_path: Path to the IBD segments file (e.g., from IBIS, hap-IBD, etc.)
        
    Returns:
        DataFrame containing IBD segments
    """
    # In practice, this would handle various file formats from IBD detection tools
    # For this lab, we'll use a simplified format
    
    # Example: Reading from a CSV or similar format
    try:
        ibd_df = pd.read_csv(file_path)
        # Ensure required columns exist
        required_cols = ['id1', 'id2', 'chromosome', 'start_position', 'end_position', 'cM_length']
        for col in required_cols:
            if col not in ibd_df.columns:
                print(f"Warning: Missing required column {col}")
        return ibd_df
    except Exception as e:
        print(f"Error loading IBD data: {e}")
        # Create a sample dataframe for demonstration
        return pd.DataFrame({
            'id1': ['sample1', 'sample1', 'sample2'],
            'id2': ['sample2', 'sample3', 'sample3'],
            'chromosome': [1, 2, 1],
            'start_position': [1000000, 2000000, 1500000],
            'end_position': [2000000, 3000000, 2500000],
            'cM_length': [10.5, 15.2, 8.7]
        })

In [None]:
# For this lab, we'll use the ped_sim data as a realistic example
# In a real application, you would replace this with your actual data path
sample_ibd_file = "../data/class_data/ped_sim_run2.seg"

# Check if the file exists, otherwise use sample data
if os.path.exists(sample_ibd_file):
    # Load the real data
    print(f"Loading real IBD segments from {sample_ibd_file}")
    raw_ibd_segments = pd.read_csv(sample_ibd_file, sep='\t')
    # Display the first few rows and the column names
    print("\nColumn names:")
    print(raw_ibd_segments.columns.tolist())
    display(raw_ibd_segments.head())
else:
    print(f"File {sample_ibd_file} not found. Using synthetic data for demonstration.")
    # Create synthetic data for demonstration
    raw_ibd_segments = load_ibd_segments("dummy_path")

### 1.1 Data Cleaning and Preprocessing

Real-world data often contains errors, inconsistencies, and missing values. Let's implement some common preprocessing steps:

In [None]:
def preprocess_ibd_data(ibd_df, min_cm_length=7.0, max_cm_length=None, remove_duplicates=True):
    """Clean and preprocess IBD segment data.
    
    Args:
        ibd_df: DataFrame containing IBD segments
        min_cm_length: Minimum centiMorgan length to keep (filter out short segments)
        max_cm_length: Maximum centiMorgan length to keep (filter out potential errors)
        remove_duplicates: Whether to remove duplicate segments
        
    Returns:
        Cleaned DataFrame
    """
    # Make a copy to avoid modifying the original
    df = ibd_df.copy()
    
    # Initial row count
    initial_count = len(df)
    print(f"Initial IBD segment count: {initial_count}")
    
    # Standardize column names if needed
    column_mapping = {
        # Map various possible column names to standardized names
        # Example: 'ID1': 'id1', 'sample1': 'id1', etc.
    }
    df = df.rename(columns=column_mapping)
    
    # Handle missing values
    print(f"Missing values before cleaning:\n{df.isnull().sum()}")
    df = df.dropna(subset=['id1', 'id2', 'cM_length'])  # Drop rows with missing critical data
    print(f"Rows after dropping missing values: {len(df)}")
    
    # Filter by segment length
    if min_cm_length is not None:
        df = df[df['cM_length'] >= min_cm_length]
        print(f"Rows after filtering segments < {min_cm_length} cM: {len(df)}")
    
    if max_cm_length is not None:
        df = df[df['cM_length'] <= max_cm_length]
        print(f"Rows after filtering segments > {max_cm_length} cM: {len(df)}")
    
    # Remove duplicate segments
    if remove_duplicates:
        # Define what constitutes a duplicate (e.g., same individuals, chromosome, and positions)
        dup_cols = ['id1', 'id2', 'chromosome', 'start_position', 'end_position']
        # Sort id1 and id2 to treat (A,B) and (B,A) as the same pair
        df['sorted_id1'] = df.apply(lambda x: min(x['id1'], x['id2']), axis=1)
        df['sorted_id2'] = df.apply(lambda x: max(x['id1'], x['id2']), axis=1)
        # Check for duplicates using sorted IDs
        dupe_mask = df.duplicated(subset=['sorted_id1', 'sorted_id2', 'chromosome', 'start_position', 'end_position'], keep='first')
        print(f"Duplicate segments found: {dupe_mask.sum()}")
        df = df[~dupe_mask]
        # Remove the temporary sorting columns
        df = df.drop(columns=['sorted_id1', 'sorted_id2'])
    
    # Calculate reduction percentage
    final_count = len(df)
    reduction = (initial_count - final_count) / initial_count * 100 if initial_count > 0 else 0
    print(f"Final IBD segment count: {final_count} ({reduction:.1f}% reduction)")
    
    return df

In [None]:
# Apply preprocessing to our IBD data
# Let's adapt the function to match the column names in our actual data
if 'cM_length' not in raw_ibd_segments.columns and 'seg_cm' in raw_ibd_segments.columns:
    raw_ibd_segments = raw_ibd_segments.rename(columns={'seg_cm': 'cM_length'})
if 'id1' not in raw_ibd_segments.columns and 'indiv1' in raw_ibd_segments.columns:
    raw_ibd_segments = raw_ibd_segments.rename(columns={'indiv1': 'id1', 'indiv2': 'id2'})

# Process IBD data with appropriate thresholds for real-world analysis
cleaned_ibd_segments = preprocess_ibd_data(
    raw_ibd_segments,
    min_cm_length=7.0,         # Standard threshold for meaningful IBD segments
    max_cm_length=300.0,       # Filter out extremely long segments that might be errors
    remove_duplicates=True
)

# Display the cleaned data
display(cleaned_ibd_segments.head())

### 1.2 IBD Segment Statistics and Quality Assessment

Let's analyze the distribution of IBD segments to assess data quality and identify potential issues.

In [None]:
def analyze_ibd_distribution(ibd_df):
    """Analyze the distribution of IBD segments to identify potential issues.
    
    Args:
        ibd_df: DataFrame containing cleaned IBD segments
    """
    # Create a figure with multiple subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Distribution of segment lengths
    ax1 = axes[0, 0]
    sns.histplot(ibd_df['cM_length'], bins=50, kde=True, ax=ax1)
    ax1.set_title('Distribution of IBD Segment Lengths')
    ax1.set_xlabel('Length (cM)')
    ax1.set_ylabel('Count')
    
    # 2. Distribution by chromosome
    ax2 = axes[0, 1]
    if 'chromosome' in ibd_df.columns:
        chr_counts = ibd_df['chromosome'].value_counts().sort_index()
        chr_counts.plot(kind='bar', ax=ax2)
        ax2.set_title('IBD Segments by Chromosome')
        ax2.set_xlabel('Chromosome')
        ax2.set_ylabel('Count')
    else:
        ax2.text(0.5, 0.5, 'Chromosome data not available', ha='center', va='center')
    
    # 3. Number of segments per individual
    ax3 = axes[1, 0]
    # Combine id1 and id2 to count segments per individual
    all_ids = pd.concat([ibd_df['id1'], ibd_df['id2']])
    id_counts = all_ids.value_counts()
    sns.histplot(id_counts, bins=30, kde=True, ax=ax3)
    ax3.set_title('Distribution of IBD Segments per Individual')
    ax3.set_xlabel('Number of IBD Segments')
    ax3.set_ylabel('Count of Individuals')
    
    # 4. Average segment length by individual
    ax4 = axes[1, 1]
    
    # Create a dictionary to store segments for each individual
    individual_segments = defaultdict(list)
    
    # Populate the dictionary with segment lengths
    for _, row in ibd_df.iterrows():
        individual_segments[row['id1']].append(row['cM_length'])
        individual_segments[row['id2']].append(row['cM_length'])
    
    # Calculate average segment length for each individual
    avg_lengths = {ind: np.mean(lengths) for ind, lengths in individual_segments.items()}
    
    # Plot the distribution of average segment lengths
    sns.histplot(list(avg_lengths.values()), bins=30, kde=True, ax=ax4)
    ax4.set_title('Average IBD Segment Length per Individual')
    ax4.set_xlabel('Average Length (cM)')
    ax4.set_ylabel('Count of Individuals')
    
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print("\nSummary Statistics:")
    print(f"Total IBD segments: {len(ibd_df)}")
    print(f"Unique individuals: {len(set(all_ids))}")
    print(f"Average segments per individual: {all_ids.value_counts().mean():.2f}")
    print(f"Average segment length: {ibd_df['cM_length'].mean():.2f} cM")
    print(f"Median segment length: {ibd_df['cM_length'].median():.2f} cM")
    print(f"Longest segment: {ibd_df['cM_length'].max():.2f} cM")
    
    # Identify potential outliers or data quality issues
    print("\nPotential Data Quality Issues:")
    
    # Very long segments (potential errors or close relationships)
    long_threshold = ibd_df['cM_length'].quantile(0.99)
    long_segments = ibd_df[ibd_df['cM_length'] > long_threshold]
    print(f"Unusually long segments (>{long_threshold:.2f} cM): {len(long_segments)}")
    
    # Individuals with unusually many segments (potential quality issues or highly connected)
    many_segments_threshold = id_counts.quantile(0.95)
    individuals_many_segments = id_counts[id_counts > many_segments_threshold]
    print(f"Individuals with unusually many segments (>{many_segments_threshold:.0f}): {len(individuals_many_segments)}")

In [None]:
# Analyze the distribution of IBD segments
analyze_ibd_distribution(cleaned_ibd_segments)

## 2. Adapting Bonsai for Real-World Challenges

Now that we've explored the characteristics of real-world IBD data, let's adapt our Bonsai framework to handle common challenges:

### 2.1 Handling Incomplete Data

Real-world datasets often have incomplete data. Let's implement strategies to handle missing or incomplete information during pedigree reconstruction:

In [None]:
class RobustPedigreeReconstructor:
    """A pedigree reconstructor designed to handle incomplete or noisy data."""
    
    def __init__(self, ibd_segments, min_confidence=0.7):
        """Initialize the reconstructor.
        
        Args:
            ibd_segments: DataFrame of IBD segments
            min_confidence: Minimum confidence threshold for making relationship inferences
        """
        self.ibd_segments = ibd_segments
        self.min_confidence = min_confidence
        self.pedigree = Pedigree()
        self.relationship_confidences = defaultdict(float)  # Store confidence for each inferred relationship
        
    def calculate_relationship_likelihood(self, ind1, ind2):
        """Calculate likelihood of different relationships based on IBD patterns.
        
        Args:
            ind1, ind2: IDs of the two individuals
            
        Returns:
            Dictionary mapping relationship types to confidence scores
        """
        # Filter segments between these two individuals
        segments = self.ibd_segments[
            ((self.ibd_segments['id1'] == ind1) & (self.ibd_segments['id2'] == ind2)) |
            ((self.ibd_segments['id1'] == ind2) & (self.ibd_segments['id2'] == ind1))
        ]
        
        if len(segments) == 0:
            return {'no_relationship': 1.0}
        
        # Calculate key metrics
        total_cm = segments['cM_length'].sum()
        segment_count = len(segments)
        longest_segment = segments['cM_length'].max() if not segments.empty else 0
        
        # Simple heuristic rules for relationship inference
        # In a real implementation, this would use more sophisticated statistical models
        # trained on real data distributions for different relationship types
        
        likelihoods = {
            'parent-child': 0.0,
            'sibling': 0.0,
            'half-sibling': 0.0,
            'first-cousin': 0.0,
            'second-cousin': 0.0,
            'distant': 0.0,
            'no_relationship': 0.0
        }
        
        # Parent-child: typically share ~3400-3800 cM, few long segments spanning entire chromosomes
        if total_cm > 3300 and longest_segment > 180:
            likelihoods['parent-child'] = 0.95
        
        # Siblings: typically share ~2400-2800 cM, many segments of varied lengths
        elif total_cm > 2300 and segment_count > 30:
            likelihoods['sibling'] = 0.90
        
        # Half-siblings: typically share ~1700-2200 cM
        elif total_cm > 1600 and total_cm < 2300:
            likelihoods['half-sibling'] = 0.85
        
        # First cousins: typically share ~800-1300 cM
        elif total_cm > 700 and total_cm < 1400:
            likelihoods['first-cousin'] = 0.80
        
        # Second cousins: typically share ~200-600 cM
        elif total_cm > 180 and total_cm < 700:
            likelihoods['second-cousin'] = 0.75
        
        # Distant relationships
        elif total_cm > 40:
            likelihoods['distant'] = 0.70
        
        # No clear relationship
        else:
            likelihoods['no_relationship'] = 0.60
        
        # Normalize so highest confidence relationship is 1.0
        max_likelihood = max(likelihoods.values())
        if max_likelihood > 0:
            likelihoods = {rel: val/max_likelihood for rel, val in likelihoods.items()}
        
        return likelihoods
    
    def infer_relationships(self):
        """Infer relationships between all pairs of individuals."""
        # Get unique individuals
        all_individuals = set(self.ibd_segments['id1']).union(set(self.ibd_segments['id2']))
        print(f"Inferring relationships among {len(all_individuals)} individuals")
        
        # Initialize progress tracking
        total_pairs = len(all_individuals) * (len(all_individuals) - 1) // 2
        processed = 0
        
        # Store inferred relationships
        self.inferred_relationships = []
        
        # Iterate through all pairs
        for ind1, ind2 in itertools.combinations(all_individuals, 2):
            # Calculate relationship likelihoods
            likelihoods = self.calculate_relationship_likelihood(ind1, ind2)
            
            # Get most likely relationship
            most_likely_rel = max(likelihoods.items(), key=lambda x: x[1])
            relationship, confidence = most_likely_rel
            
            # Store if confidence exceeds threshold and it's not 'no_relationship'
            if confidence >= self.min_confidence and relationship != 'no_relationship':
                self.inferred_relationships.append({
                    'id1': ind1,
                    'id2': ind2,
                    'relationship': relationship,
                    'confidence': confidence
                })
            
            # Update progress (in a real implementation, use a proper progress bar)
            processed += 1
            if processed % 1000 == 0 or processed == total_pairs:
                print(f"Processed {processed}/{total_pairs} pairs ({(processed/total_pairs*100):.1f}%)")
        
        # Convert to DataFrame for easier analysis
        self.relationship_df = pd.DataFrame(self.inferred_relationships)
        print(f"Inferred {len(self.relationship_df)} relationships with confidence >= {self.min_confidence}")
        
        return self.relationship_df

In [None]:
# Create a reduced dataset for faster demonstration
# In a real application, you would use the full dataset
if len(cleaned_ibd_segments) > 1000:
    # Get the top 100 individuals with the most IBD segments for a meaningful subset
    top_individuals = pd.concat([cleaned_ibd_segments['id1'], cleaned_ibd_segments['id2']])\
                        .value_counts().head(20).index.tolist()
    
    # Filter segments to only include these individuals
    demo_segments = cleaned_ibd_segments[
        (cleaned_ibd_segments['id1'].isin(top_individuals)) & 
        (cleaned_ibd_segments['id2'].isin(top_individuals))
    ]
    print(f"Created demonstration dataset with {len(demo_segments)} segments among {len(top_individuals)} individuals")
else:
    demo_segments = cleaned_ibd_segments
    print(f"Using full dataset with {len(demo_segments)} segments")

# Create and run the robust reconstructor
reconstructor = RobustPedigreeReconstructor(demo_segments, min_confidence=0.7)
relationship_df = reconstructor.infer_relationships()

# Display the inferred relationships
if not relationship_df.empty:
    display(relationship_df.sort_values('confidence', ascending=False).head(10))
else:
    print("No relationships were inferred with sufficient confidence.")

### 2.2 Dealing with Conflicting Evidence

In real-world data, we often encounter conflicting evidence about relationships. Let's implement a strategy to resolve conflicts:

In [None]:
def resolve_relationship_conflicts(relationship_df):
    """Resolve conflicts in inferred relationships.
    
    Args:
        relationship_df: DataFrame of inferred relationships
        
    Returns:
        DataFrame with conflicts resolved
    """
    if relationship_df.empty:
        return relationship_df
    
    print("Checking for conflicts in relationship inferences...")
    
    # Create a copy to avoid modifying the original
    resolved_df = relationship_df.copy()
    
    # Get unique individuals
    all_individuals = set(relationship_df['id1']).union(set(relationship_df['id2']))
    
    # Dictionary to track relationship conflicts
    conflicts = []
    
    # Check for logical inconsistencies
    for ind in all_individuals:
        # Get all relationships involving this individual
        ind_rels = relationship_df[
            (relationship_df['id1'] == ind) | (relationship_df['id2'] == ind)
        ]
        
        if len(ind_rels) < 2:
            continue  # Need at least 2 relationships to have a conflict
        
        # Check for parent-child relationship conflicts
        parent_child_rels = ind_rels[ind_rels['relationship'] == 'parent-child']
        
        # Example conflict: person has more than 2 parents
        if len(parent_child_rels) > 2:
            print(f"Conflict detected: Individual {ind} has more than 2 parent-child relationships")
            
            # Resolve by keeping the 2 highest-confidence relationships
            sorted_rels = parent_child_rels.sort_values('confidence', ascending=False)
            to_keep = sorted_rels.iloc[:2].index
            to_remove = sorted_rels.iloc[2:].index
            
            # Log conflicts
            for idx in to_remove:
                conflicts.append({
                    'individual': ind, 
                    'relationship_index': idx,
                    'conflict_type': 'too_many_parents',
                    'resolution': 'removed_lower_confidence'
                })
            
            # Update the resolved dataframe
            resolved_df.loc[to_remove, 'relationship'] = 'conflicted_removed'
    
    # Check for sibling relationship conflicts
    # Similar conflict resolution logic would be implemented here
    
    # Check for generational conflicts (e.g., someone can't be both your sibling and your grandparent)
    # Complex conflict resolution logic would be implemented here
    
    # Remove conflicted relationships
    resolved_df = resolved_df[resolved_df['relationship'] != 'conflicted_removed']
    
    # Report conflicts
    conflicts_df = pd.DataFrame(conflicts) if conflicts else pd.DataFrame()
    print(f"Resolved {len(conflicts)} conflicts")
    
    return resolved_df, conflicts_df

In [None]:
# Resolve conflicts in our inferred relationships
if not relationship_df.empty:
    resolved_relationships, conflicts = resolve_relationship_conflicts(relationship_df)
    
    print(f"\nRelationships after conflict resolution: {len(resolved_relationships)}")
    display(resolved_relationships.head())
    
    # Display conflicts if any
    if not conflicts.empty:
        print("\nConflicts detected and resolved:")
        display(conflicts)
else:
    print("No relationships to check for conflicts.")

## 3. Building Pedigrees from Real-World Data

Now let's apply our robust framework to construct pedigrees from real-world IBD data:

In [None]:
class RealWorldPedigreeBuilder:
    """Builds pedigrees from real-world relationship inferences."""
    
    def __init__(self, relationship_df):
        """Initialize the pedigree builder.
        
        Args:
            relationship_df: DataFrame of inferred relationships
        """
        self.relationship_df = relationship_df
        self.pedigree = Pedigree()
    
    def build_pedigree(self, method='greedy'):
        """Build a pedigree using the specified method.
        
        Args:
            method: Reconstruction method to use ('greedy', 'simulated_annealing', etc.)
            
        Returns:
            Constructed Pedigree object
        """
        if self.relationship_df.empty:
            print("No relationships provided. Cannot build pedigree.")
            return self.pedigree
        
        print(f"Building pedigree using {method} method...")
        
        # Get all individuals
        all_individuals = set(self.relationship_df['id1']).union(set(self.relationship_df['id2']))
        print(f"Creating pedigree with {len(all_individuals)} individuals")
        
        # Add all individuals to the pedigree
        for ind_id in all_individuals:
            individual = Individual(id_str=ind_id)
            self.pedigree.add_individual(individual)
        
        # Sort relationships by confidence (highest first)
        sorted_rels = self.relationship_df.sort_values('confidence', ascending=False)
        
        if method == 'greedy':
            self._greedy_reconstruction(sorted_rels)
        elif method == 'simulated_annealing':
            self._simulated_annealing_reconstruction(sorted_rels)
        else:
            print(f"Method {method} not implemented. Using greedy reconstruction.")
            self._greedy_reconstruction(sorted_rels)
        
        print(f"Pedigree construction complete. Added {len(self.pedigree.graph.edges)} relationships.")
        return self.pedigree
    
    def _greedy_reconstruction(self, sorted_rels):
        """Greedily add relationships to the pedigree based on confidence."""
        # Track used relationships to avoid conflicts
        used_relationships = set()
        
        # Keep track of parents for each individual
        parents = {ind_id: [] for ind_id in self.pedigree.individuals}
        
        # Process relationships in order of confidence
        for _, rel in sorted_rels.iterrows():
            id1, id2 = rel['id1'], rel['id2']
            relationship = rel['relationship']
            
            # Skip if either individual is already in a used relationship
            if (id1, id2) in used_relationships or (id2, id1) in used_relationships:
                continue
            
            # Handle parent-child relationships
            if relationship == 'parent-child':
                # Determine which is parent and which is child
                # In a real implementation, this would use additional evidence like age
                # For demo, we'll just assign id1 as parent and id2 as child
                parent_id, child_id = id1, id2
                
                # Check if child already has 2 parents
                if len(parents[child_id]) >= 2:
                    continue
                
                # Add parent-child relationship
                parents[child_id].append(parent_id)
                
                # Update individual objects and graph
                individual = self.pedigree.individuals[child_id]
                
                # If this is the first parent, assign as father (in real world would use evidence)
                if len(parents[child_id]) == 1:
                    individual.father_id = parent_id
                    individual.sex = 'M'  # Assume father for simplicity
                # If this is the second parent, assign as mother
                else:
                    individual.mother_id = parent_id
                    # Update the sex of the parent (for simplicity)
                    parent_individual = self.pedigree.individuals[parent_id]
                    parent_individual.sex = 'F'  # Assume mother for simplicity
                
                # Add edge to the graph
                self.pedigree.graph.add_edge(parent_id, child_id)
                
                # Mark as used
                used_relationships.add((id1, id2))
            
            # Handle sibling relationships
            elif relationship == 'sibling':
                # In a real implementation, would ensure both share parents
                # For simplicity, just note the sibling relationship without modifying the graph
                used_relationships.add((id1, id2))
            
            # Other relationship types would have specific handling here
    
    def _simulated_annealing_reconstruction(self, sorted_rels):
        """Use simulated annealing to find optimal pedigree structure."""
        # In a real implementation, this would use a more sophisticated approach
        # For this lab, we'll use a simplified placeholder implementation
        print("Simulated annealing reconstruction (simplified version)")
        self._greedy_reconstruction(sorted_rels)

In [None]:
# Build a pedigree from our resolved relationships
if 'resolved_relationships' in locals() and not resolved_relationships.empty:
    # Create the pedigree builder
    pedigree_builder = RealWorldPedigreeBuilder(resolved_relationships)
    
    # Build the pedigree using greedy reconstruction
    pedigree = pedigree_builder.build_pedigree(method='greedy')
    
    # Visualize the resulting pedigree
    try:
        pedigree.visualize(figsize=(15, 10))
    except Exception as e:
        print(f"Error visualizing pedigree: {e}")
        print("Pedigree structure:")
        print(f"- Individuals: {len(pedigree.individuals)}")
        print(f"- Relationships: {len(pedigree.graph.edges)}")
        
        # Show the first few relationships as text
        print("\nSample relationships:")
        for i, edge in enumerate(list(pedigree.graph.edges())[:10]):
            print(f"{edge[0]} -> {edge[1]}")
else:
    print("No resolved relationships available for building a pedigree.")

## 4. Evaluating Real-World Pedigree Reconstruction

In real-world applications, we often don't have the ground truth to evaluate our reconstructed pedigrees. Let's explore methods for evaluating reconstruction quality without ground truth:

In [None]:
def evaluate_pedigree_without_ground_truth(pedigree, ibd_segments):
    """Evaluate a reconstructed pedigree without access to ground truth.
    
    Args:
        pedigree: Reconstructed Pedigree object
        ibd_segments: DataFrame of IBD segments used for reconstruction
        
    Returns:
        Dictionary of evaluation metrics
    """
    # Initialize metrics dictionary
    metrics = {}
    
    # 1. Structural metrics
    graph = pedigree.graph
    metrics['node_count'] = len(graph.nodes)
    metrics['edge_count'] = len(graph.edges)
    metrics['connected_components'] = nx.number_connected_components(graph.to_undirected())
    
    # Calculate density as ratio of actual edges to maximum possible edges
    n = metrics['node_count']
    metrics['density'] = metrics['edge_count'] / (n * (n - 1) / 2) if n > 1 else 0
    
    # Check for cycles, which shouldn't exist in a valid pedigree
    try:
        cycles = list(nx.simple_cycles(graph))
        metrics['has_cycles'] = len(cycles) > 0
        metrics['cycle_count'] = len(cycles)
    except:
        metrics['has_cycles'] = "Error checking cycles"
        metrics['cycle_count'] = "Unknown"
    
    # 2. IBD consistency metrics
    # Check if the pedigree structure is consistent with IBD patterns
    
    # Count relationships in the pedigree that are supported by IBD segments
    ibd_consistent = 0
    ibd_inconsistent = 0
    
    # Get unique pairs of individuals with IBD segments
    ibd_pairs = set()
    for _, row in ibd_segments.iterrows():
        id1, id2 = row['id1'], row['id2']
        ibd_pairs.add((id1, id2) if id1 < id2 else (id2, id1))
    
    # Check if related individuals in the pedigree share IBD segments
    for edge in graph.edges:
        parent, child = edge
        edge_pair = (parent, child) if parent < child else (child, parent)
        
        if edge_pair in ibd_pairs:
            ibd_consistent += 1
        else:
            ibd_inconsistent += 1
    
    metrics['ibd_consistent_relationships'] = ibd_consistent
    metrics['ibd_inconsistent_relationships'] = ibd_inconsistent
    metrics['ibd_consistency_ratio'] = ibd_consistent / len(graph.edges) if len(graph.edges) > 0 else 0
    
    # 3. Biological plausibility metrics
    
    # Check if each individual has at most two parents
    invalid_parent_count = 0
    for node in graph.nodes:
        # Count incoming edges (parents)
        parent_count = len(list(graph.predecessors(node)))
        if parent_count > 2:
            invalid_parent_count += 1
    
    metrics['invalid_parent_count'] = invalid_parent_count
    
    # 4. Cross-validation
    # In a real implementation, would include methods to:  
    # - Split IBD data
    # - Reconstruct on partial data
    # - Validate consistency across reconstructions
    
    # Return evaluation metrics
    return metrics

In [None]:
# Evaluate the reconstructed pedigree
if 'pedigree' in locals() and 'demo_segments' in locals():
    evaluation_metrics = evaluate_pedigree_without_ground_truth(pedigree, demo_segments)
    
    # Display metrics
    print("Pedigree Evaluation Metrics:")
    for metric, value in evaluation_metrics.items():
        print(f"{metric}: {value}")
    
    # Visualize key metrics
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Select numeric metrics for visualization
    numeric_metrics = {k: v for k, v in evaluation_metrics.items() 
                      if isinstance(v, (int, float)) and k != 'node_count' and k != 'edge_count'}
    
    # Bar chart
    bars = ax.bar(numeric_metrics.keys(), numeric_metrics.values())
    
    # Add value labels
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.2f}' if isinstance(height, float) else f'{height}',
                ha='center', va='bottom', rotation=0)
    
    ax.set_title('Pedigree Evaluation Metrics')
    ax.set_ylabel('Value')
    ax.set_xticklabels(numeric_metrics.keys(), rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
else:
    print("No pedigree or IBD segments available for evaluation.")

## 5. Case Study: Success Stories and Challenges

Let's examine a few real-world scenarios where pedigree reconstruction has been successful or challenging.

### 5.1 Success Story: Identifying Unknown Relationships

One common application of computational genetic genealogy is identifying unknown biological relationships. Let's consider a simplified case study:

#### Case Study: Locating Biological Family

In this scenario, an adoptee is searching for biological family members. They have:
- Their own DNA test results
- Access to a database of genetic matches
- No prior knowledge of their biological family

The approach would typically involve:
1. Identifying close genetic matches through IBD detection
2. Using the Bonsai algorithm to reconstruct a plausible pedigree
3. Combining the genetic evidence with available non-genetic information (age, location, etc.)

Success factors in such cases include:
- Multiple close relatives in the database (2nd cousins or closer)
- High-quality IBD detection with minimal errors
- Robust pedigree reconstruction that handles missing individuals
- Integration of non-genetic data to resolve ambiguities

### 5.2 Challenge: Endogamy and Complex Pedigrees

One of the most significant challenges in genetic genealogy is handling endogamy, where individuals marry within a closed community over generations, leading to complex and interrelated pedigrees.

#### Challenge: Reconstructing Endogamous Pedigrees

In endogamous populations, the standard assumptions about IBD sharing break down because:
- Individuals share multiple recent common ancestors
- IBD segments may come from multiple ancestral paths
- Relationship predictions are often overestimated

Approaches to address endogamy include:
1. Adjusting IBD thresholds based on population-specific models
2. Using more sophisticated likelihood models that account for endogamy
3. Incorporating historical records and documentary evidence alongside genetics
4. Focusing on identifying specific branches rather than the entire pedigree

Recent advances in the field have shown promise in handling endogamy by:
- Developing population-specific genetic models
- Using graph theory to identify and handle complex pedigree structures
- Incorporating statistical methods that account for endogamy in relationship prediction

## 6. Future Directions and Ethical Considerations

As computational genetic genealogy continues to advance, several important future directions and ethical considerations emerge:

### 6.1 Future Directions

1. **Integration with other data types:**
   - Combining genetic data with documentary evidence (census, vital records)
   - Incorporating epigenetic data for age estimation
   - Using demographic and historical context to improve reconstructions

2. **Scalability improvements:**
   - Algorithms that can handle millions of individuals
   - Distributed computing approaches for large-scale pedigree reconstruction
   - Efficient data structures for very large pedigrees

3. **Advanced statistical models:**
   - Better modeling of recombination and genetic inheritance
   - Population-specific IBD distribution models
   - Machine learning approaches for relationship prediction

### 6.2 Ethical Considerations

1. **Privacy concerns:**
   - Reconstructed pedigrees can reveal relationships that individuals may not be aware of
   - Genetic privacy extends beyond the tested individual to biological relatives
   - Need for clear consent processes and privacy protections

2. **Unexpected findings:**
   - Misattributed parentage
   - Previously unknown siblings or other close relatives
   - Medical information that could impact family members

3. **Equity and representation:**
   - Most genetic databases have limited diversity
   - Algorithms may perform differently across population groups
   - Need for inclusive approaches that work for all ancestries

## Conclusion

In this lab, we've explored the real-world applications of the Bonsai algorithm for pedigree reconstruction. We've seen how to:

1. Process and clean real genetic genealogy datasets
2. Adapt our algorithms to handle incomplete and noisy data
3. Resolve conflicts in relationship evidence
4. Build and evaluate pedigrees without ground truth
5. Consider success stories, challenges, and ethical implications

As computational methods continue to improve, the field of genetic genealogy will increasingly rely on sophisticated algorithms like Bonsai to make sense of complex genetic relationships. By combining these computational approaches with traditional genealogical research and ethical considerations, we can advance our understanding of human relationships while respecting privacy and promoting equity.

## Exercises

1. **Data Quality Assessment:**
   - Modify the `preprocess_ibd_data` function to include additional quality checks specific to real-world data.
   - Implement a visualization to identify potential errors in IBD segments.

2. **Relationship Inference Improvements:**
   - Enhance the `calculate_relationship_likelihood` function to use more sophisticated criteria for relationship classification.
   - Implement a confusion matrix to evaluate the accuracy of relationship predictions.

3. **Advanced Pedigree Reconstruction:**
   - Implement a more sophisticated version of simulated annealing for pedigree reconstruction.
   - Add support for incorporating known relationships from documentary sources.

4. **Case Study Development:**
   - Choose a specific genealogical scenario and develop a detailed approach using the Bonsai algorithm.
   - Document the challenges and proposed solutions for your chosen scenario.

5. **Ethical Framework:**
   - Develop a set of ethical guidelines for the application of computational genealogy.
   - Create a privacy-preserving version of the Bonsai algorithm that minimizes exposure of sensitive relationship information.