# Lab 3: IBD Data Formats and Preprocessing (ibd.py)

## Overview

This lab focuses on the IBD data formats used in Bonsai v3 and the preprocessing steps that transform raw IBD detector output into usable data structures. You'll learn:

1. The structure and meaning of phased and unphased IBD formats
2. How to convert between different IBD representations
3. How to extract meaningful statistics from IBD segments
4. Key functions in Bonsai v3's `ibd.py` module

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import inspect
import importlib
from IPython.display import display, HTML, Markdown

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\n## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code
        from IPython.display import display, Markdown
        display(Markdown(f"```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

## Introduction to IBD Data Formats

Bonsai v3 works with two primary IBD data formats:

1. **Unphased Format**: Output from IBD detectors like IBIS, containing segment information without haplotype specificity
2. **Phased Format**: Contains haplotype-specific information about which chromosomal copy contains each IBD segment

Let's examine these formats in detail.

### Unphased IBD Format

The unphased IBD format is the primary input to Bonsai v3 and typically comes from IBD detection tools. It has the following structure:

```
[id1, id2, chromosome, start_bp, end_bp, is_full_ibd, seg_cm]
```

Where:
- `id1`, `id2`: Individual identifiers
- `chromosome`: Chromosome number (1-22)
- `start_bp`, `end_bp`: Segment boundaries in base pairs
- `is_full_ibd`: Boolean indicating IBD1 (0) or IBD2 (1)
- `seg_cm`: Segment length in centiMorgans

Let's create a sample dataset in this format:

In [None]:
# Create a sample unphased IBD dataset
def create_sample_unphased_ibd():
    """Create a small sample of unphased IBD segments"""
    np.random.seed(42)  # For reproducibility
    
    # Create sample individuals
    individuals = [f"ind_{i}" for i in range(1, 6)]
    
    # Generate IBD segments
    unphased_segments = []
    
    # For each pair of individuals
    for i in range(len(individuals)):
        for j in range(i+1, len(individuals)):
            id1 = individuals[i]
            id2 = individuals[j]
            
            # Determine number of segments (more for closer relationships)
            if abs(i - j) == 1:  # Adjacent IDs - simulate close relationship
                num_segments = np.random.randint(5, 15)
                full_ibd_prob = 0.2  # Probability of IBD2 segment
            else:  # Non-adjacent IDs - simulate distant relationship
                num_segments = np.random.randint(1, 5)
                full_ibd_prob = 0.05  # Lower probability of IBD2 segment
            
            # Generate segments
            for _ in range(num_segments):
                chromosome = np.random.randint(1, 23)
                
                # Generate segment position (in Mb and cM)
                start_mb = np.random.randint(10, 200)
                length_mb = np.random.randint(1, 20) 
                end_mb = start_mb + length_mb
                
                # Convert to bp
                start_bp = start_mb * 1_000_000
                end_bp = end_mb * 1_000_000
                
                # Approximate cM (assuming 1 cM ≈ 1 Mb on average)
                seg_cm = length_mb * (0.8 + 0.4 * np.random.random())  # Add some variability to the cM/Mb ratio
                
                # Determine if this is a full IBD segment
                is_full_ibd = 1 if np.random.random() < full_ibd_prob else 0
                
                # Create the segment
                segment = [id1, id2, chromosome, start_bp, end_bp, is_full_ibd, seg_cm]
                unphased_segments.append(segment)
    
    return unphased_segments

# Generate sample data
unphased_ibd_segments = create_sample_unphased_ibd()

# Create a DataFrame for easier viewing
unphased_df = pd.DataFrame(unphased_ibd_segments, 
                          columns=['id1', 'id2', 'chromosome', 'start_bp', 'end_bp', 'is_full_ibd', 'seg_cm'])

# Display the first few rows
print(f"Generated {len(unphased_df)} unphased IBD segments")
unphased_df.head(10)

### Phased IBD Format

The phased IBD format contains haplotype-specific information, indicating which copies of the chromosome contain the shared segment. It has the following structure:

```
[id1, id2, hap1, hap2, chromosome, start_cm, end_cm, seg_cm]
```

Where:
- `id1`, `id2`: Individual identifiers
- `hap1`, `hap2`: Specific haplotypes that match (0 or 1)
- `chromosome`: Chromosome number (1-22)
- `start_cm`, `end_cm`: Segment boundaries in centiMorgans
- `seg_cm`: Segment length in centiMorgans

Let's create a sample dataset in this format as well:

In [None]:
# Create a sample phased IBD dataset
def create_sample_phased_ibd():
    """Create a small sample of phased IBD segments"""
    np.random.seed(43)  # For reproducibility
    
    # Create sample individuals
    individuals = [f"ind_{i}" for i in range(1, 6)]
    
    # Generate IBD segments
    phased_segments = []
    
    # For each pair of individuals
    for i in range(len(individuals)):
        for j in range(i+1, len(individuals)):
            id1 = individuals[i]
            id2 = individuals[j]
            
            # Determine number of segments (more for closer relationships)
            if abs(i - j) == 1:  # Adjacent IDs - simulate close relationship
                num_segments = np.random.randint(10, 20)
            else:  # Non-adjacent IDs - simulate distant relationship
                num_segments = np.random.randint(2, 8)
            
            # Generate segments
            for _ in range(num_segments):
                chromosome = np.random.randint(1, 23)
                
                # Randomly select haplotypes
                hap1 = np.random.randint(0, 2)
                hap2 = np.random.randint(0, 2)
                
                # Generate segment position (in cM)
                start_cm = np.random.randint(10, 150)
                length_cm = np.random.randint(1, 20) 
                end_cm = start_cm + length_cm
                seg_cm = length_cm
                
                # Create the segment
                segment = [id1, id2, hap1, hap2, chromosome, start_cm, end_cm, seg_cm]
                phased_segments.append(segment)
    
    return phased_segments

# Generate sample data
phased_ibd_segments = create_sample_phased_ibd()

# Create a DataFrame for easier viewing
phased_df = pd.DataFrame(phased_ibd_segments, 
                         columns=['id1', 'id2', 'hap1', 'hap2', 'chromosome', 'start_cm', 'end_cm', 'seg_cm'])

# Display the first few rows
print(f"Generated {len(phased_df)} phased IBD segments")
phased_df.head(10)

## Exploring Bonsai v3's ibd.py Module

Now that we understand the IBD data formats, let's explore the `ibd.py` module in Bonsai v3 to see how it processes these formats.

In [None]:
try:
    # Display functions in the ibd module
    display_module_functions('bonsaitree.bonsaitree.v3.ibd')
except Exception as e:
    print(f"Could not display ibd module functions: {e}")
    print("\nKey functions in the ibd.py module include:")
    print("1. get_phased_to_unphased(): Convert phased IBD segments to unphased format")
    print("2. get_unphased_to_phased(): Convert unphased IBD segments to phased format")
    print("3. get_ibd_stats_unphased(): Extract statistical summaries from unphased IBD data")
    print("4. filter_ibd_segments(): Apply quality filters to IBD segments")
    print("5. normalize_ibd_segments(): Standardize IBD segment representations")

Let's look at each of the key functions in more detail:

In [None]:
try:
    # View source of the get_phased_to_unphased function
    view_function_source('bonsaitree.bonsaitree.v3.ibd', 'get_phased_to_unphased')
except Exception as e:
    print(f"Could not display function source: {e}")
    print("\nThe get_phased_to_unphased() function converts phased IBD segments to unphased format.")
    print("It takes a list of phased segments and combines segments that overlap on different haplotypes.")
    print("This is needed because unphased IBD detectors don't distinguish between haplotypes.")

In [None]:
try:
    # View source of the get_unphased_to_phased function
    view_function_source('bonsaitree.bonsaitree.v3.ibd', 'get_unphased_to_phased')
except Exception as e:
    print(f"Could not display function source: {e}")
    print("\nThe get_unphased_to_phased() function creates pseudo-phased segments from unphased IBD data.")
    print("It assigns segments to random haplotypes since the true haplotype information is missing.")
    print("This is useful when algorithms require phased data but only unphased data is available.")

In [None]:
try:
    # View source of the get_ibd_stats_unphased function
    view_function_source('bonsaitree.bonsaitree.v3.ibd', 'get_ibd_stats_unphased')
except Exception as e:
    print(f"Could not display function source: {e}")
    print("\nThe get_ibd_stats_unphased() function extracts statistical summaries from unphased IBD data.")
    print("It computes key statistics like total IBD length, segment counts, and maximum segment length.")
    print("These statistics are used for relationship inference and pedigree construction.")

## Implementing IBD Format Conversion

Now that we've seen how Bonsai v3 handles IBD data conversion, let's implement simplified versions of these conversion functions and test them on our sample data.

### Phased to Unphased Conversion

The conversion from phased to unphased format involves identifying segments that represent the same region on different haplotypes and merging them appropriately.

In [None]:
def phased_to_unphased(phased_segments):
    """Convert phased IBD segments to unphased format
    
    Args:
        phased_segments: List of phased segments in format
            [id1, id2, hap1, hap2, chromosome, start_cm, end_cm, seg_cm]
    
    Returns:
        List of unphased segments in format
            [id1, id2, chromosome, start_bp, end_bp, is_full_ibd, seg_cm]
    """
    # Group segments by individual pair and chromosome
    segment_groups = {}
    
    for segment in phased_segments:
        id1, id2, hap1, hap2, chrom, start_cm, end_cm, seg_cm = segment
        
        # Ensure id1 < id2 for consistent ordering
        if id1 > id2:
            id1, id2 = id2, id1
            hap1, hap2 = hap2, hap1
        
        # Create a key for grouping
        key = (id1, id2, chrom)
        
        # Initialize group if it doesn't exist
        if key not in segment_groups:
            segment_groups[key] = []
        
        # Add segment to group
        segment_groups[key].append((hap1, hap2, start_cm, end_cm, seg_cm))
    
    # Process each group to identify overlaps and merge segments
    unphased_segments = []
    
    for (id1, id2, chrom), segments in segment_groups.items():
        # Sort segments by start position
        segments.sort(key=lambda x: x[2])
        
        # Process each segment
        i = 0
        while i < len(segments):
            hap1, hap2, start_cm, end_cm, seg_cm = segments[i]
            
            # Check for overlapping segments on different haplotypes
            full_ibd_regions = []
            j = i + 1
            while j < len(segments):
                next_hap1, next_hap2, next_start, next_end, next_seg_cm = segments[j]
                
                # Check if segments overlap
                if next_start <= end_cm and next_end >= start_cm:
                    # Check if they're on different haplotypes
                    if (hap1 != next_hap1 or hap2 != next_hap2):
                        # Calculate overlap region
                        overlap_start = max(start_cm, next_start)
                        overlap_end = min(end_cm, next_end)
                        full_ibd_regions.append((overlap_start, overlap_end))
                
                # If next segment starts beyond current end, no more overlaps possible
                if next_start > end_cm:
                    break
                    
                j += 1
            
            # Simplistic approach: mark the entire segment as full IBD if there's significant overlap
            is_full_ibd = 0
            if full_ibd_regions:
                # Calculate total length of full IBD regions
                full_ibd_length = sum(end - start for start, end in full_ibd_regions)
                if full_ibd_length / seg_cm > 0.5:  # If more than 50% is full IBD
                    is_full_ibd = 1
            
            # Convert cM positions to bp (simplified conversion - in practice would use genetic map)
            # Assuming 1 cM ≈ a0.01 Mb (very approximate)
            start_bp = int(start_cm * 1_000_000)
            end_bp = int(end_cm * 1_000_000)
            
            # Create unphased segment
            unphased_segment = [id1, id2, chrom, start_bp, end_bp, is_full_ibd, seg_cm]
            unphased_segments.append(unphased_segment)
            
            i += 1
    
    return unphased_segments

In [None]:
# Test the phased to unphased conversion
converted_unphased = phased_to_unphased(phased_ibd_segments)

# Create a DataFrame for the converted segments
converted_unphased_df = pd.DataFrame(converted_unphased, 
                                    columns=['id1', 'id2', 'chromosome', 'start_bp', 'end_bp', 'is_full_ibd', 'seg_cm'])

# Display the first few rows
print(f"Converted {len(converted_unphased_df)} phased segments to unphased format")
converted_unphased_df.head(10)

### Unphased to Phased Conversion

Converting from unphased to phased format is less straightforward since we don't know which haplotypes the segments belong to. We'll implement a simplified version that assigns segments to random haplotypes.

In [None]:
def unphased_to_phased(unphased_segments):
    """Convert unphased IBD segments to phased format (using random haplotype assignment)
    
    Args:
        unphased_segments: List of unphased segments in format
            [id1, id2, chromosome, start_bp, end_bp, is_full_ibd, seg_cm]
    
    Returns:
        List of phased segments in format
            [id1, id2, hap1, hap2, chromosome, start_cm, end_cm, seg_cm]
    """
    np.random.seed(42)  # For reproducibility
    phased_segments = []
    
    for segment in unphased_segments:
        id1, id2, chrom, start_bp, end_bp, is_full_ibd, seg_cm = segment
        
        # Convert bp positions to cM (simplified conversion)
        # Assuming 1 Mb ≈ 1 cM (very approximate)
        start_cm = start_bp / 1_000_000
        end_cm = end_bp / 1_000_000
        
        if is_full_ibd:  # IBD2 - both haplotypes match
            # Create two phased segments (one for each haplotype pair)
            phased_segments.append([id1, id2, 0, 0, chrom, start_cm, end_cm, seg_cm])
            phased_segments.append([id1, id2, 1, 1, chrom, start_cm, end_cm, seg_cm])
        else:  # IBD1 - only one haplotype matches
            # Randomly assign to haplotypes
            hap1 = np.random.randint(0, 2)
            hap2 = np.random.randint(0, 2)
            phased_segments.append([id1, id2, hap1, hap2, chrom, start_cm, end_cm, seg_cm])
    
    return phased_segments

In [None]:
# Test the unphased to phased conversion
converted_phased = unphased_to_phased(unphased_ibd_segments)

# Create a DataFrame for the converted segments
converted_phased_df = pd.DataFrame(converted_phased, 
                                  columns=['id1', 'id2', 'hap1', 'hap2', 'chromosome', 'start_cm', 'end_cm', 'seg_cm'])

# Display the first few rows
print(f"Converted {len(converted_phased_df)} unphased segments to phased format")
converted_phased_df.head(10)

## Extracting IBD Statistics

One of the most important functions in the `ibd.py` module is `get_ibd_stats_unphased()`, which extracts statistical summaries from IBD segments. Let's implement a simplified version of this function.

In [None]:
def get_ibd_stats(unphased_segments):
    """Extract key statistics from unphased IBD segments
    
    Args:
        unphased_segments: List of unphased segments in format
            [id1, id2, chromosome, start_bp, end_bp, is_full_ibd, seg_cm]
    
    Returns:
        Dictionary mapping (id1, id2) pairs to dictionaries of statistics
    """
    stats_dict = {}
    
    for segment in unphased_segments:
        id1, id2, chrom, start_bp, end_bp, is_full_ibd, seg_cm = segment
        
        # Ensure id1 < id2 for consistent ordering
        if id1 > id2:
            id1, id2 = id2, id1
        
        # Create key for the pair
        key = (id1, id2)
        
        # Initialize statistics if this is the first segment for this pair
        if key not in stats_dict:
            stats_dict[key] = {
                'total_half': 0.0,  # Total IBD1 length in cM
                'total_full': 0.0,  # Total IBD2 length in cM
                'num_half': 0,      # Number of IBD1 segments
                'num_full': 0,      # Number of IBD2 segments
                'max_seg_cm': 0.0   # Length of largest segment
            }
        
        # Update statistics
        if is_full_ibd:  # IBD2 segment
            stats_dict[key]['total_full'] += seg_cm
            stats_dict[key]['num_full'] += 1
        else:  # IBD1 segment
            stats_dict[key]['total_half'] += seg_cm
            stats_dict[key]['num_half'] += 1
        
        # Update max segment length
        stats_dict[key]['max_seg_cm'] = max(stats_dict[key]['max_seg_cm'], seg_cm)
    
    return stats_dict

In [None]:
# Extract statistics from our unphased segments
ibd_stats = get_ibd_stats(unphased_ibd_segments)

# Convert the stats to a DataFrame for easier viewing
stats_rows = []
for (id1, id2), stats in ibd_stats.items():
    stats_row = {'id1': id1, 'id2': id2}
    stats_row.update(stats)
    stats_rows.append(stats_row)

stats_df = pd.DataFrame(stats_rows)

# Add a total IBD column (half + 2 * full)
stats_df['total_ibd'] = stats_df['total_half'] + 2 * stats_df['total_full']

# Sort by total IBD (descending)
stats_df = stats_df.sort_values('total_ibd', ascending=False)

# Display the results
stats_df

### Visualizing IBD Statistics

Let's create some visualizations to better understand the IBD statistics we've extracted.

In [None]:
# Plot total IBD sharing between pairs
plt.figure(figsize=(12, 6))

# Create pair labels
pair_labels = [f"{row['id1']}-{row['id2']}" for _, row in stats_df.iterrows()]

# Create a stacked bar chart of IBD1 and IBD2 sharing
plt.bar(pair_labels, stats_df['total_half'], label='IBD1 (Half-identical)')
plt.bar(pair_labels, 2 * stats_df['total_full'], bottom=stats_df['total_half'], label='IBD2 (Fully identical)')

plt.xlabel('Individual Pair')
plt.ylabel('Total IBD Sharing (cM)')
plt.title('IBD Sharing Between Individual Pairs')
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Plot segment count distribution
plt.figure(figsize=(12, 6))

# Create a stacked bar chart of IBD1 and IBD2 segment counts
plt.bar(pair_labels, stats_df['num_half'], label='IBD1 Segments')
plt.bar(pair_labels, stats_df['num_full'], bottom=stats_df['num_half'], label='IBD2 Segments')

plt.xlabel('Individual Pair')
plt.ylabel('Number of Segments')
plt.title('IBD Segment Counts Between Individual Pairs')
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Plot maximum segment length
plt.figure(figsize=(12, 6))

plt.bar(pair_labels, stats_df['max_seg_cm'])

plt.xlabel('Individual Pair')
plt.ylabel('Length (cM)')
plt.title('Maximum IBD Segment Length Between Individual Pairs')
plt.xticks(rotation=45, ha='right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## Using IBD Statistics for Relationship Inference

The IBD statistics extracted by `get_ibd_stats()` form the foundation for relationship inference in Bonsai v3. Let's see how these statistics correlate with different relationship types.

In [None]:
# Create a reference table of expected IBD sharing for different relationships
relationship_sharing = [
    {"relationship": "Identical twins", "meiotic_distance": 0, "expected_total_ibd": 6800.0, "expected_segments": 1, "expected_max_segment": 6800.0},
    {"relationship": "Parent-Child", "meiotic_distance": 1, "expected_total_ibd": 3400.0, "expected_segments": 23, "expected_max_segment": 279.0},
    {"relationship": "Full siblings", "meiotic_distance": 2, "expected_total_ibd": 2550.0, "expected_segments": 50, "expected_max_segment": 160.0},
    {"relationship": "Half-siblings", "meiotic_distance": 2, "expected_total_ibd": 1700.0, "expected_segments": 28, "expected_max_segment": 160.0},
    {"relationship": "Grandparent-Grandchild", "meiotic_distance": 2, "expected_total_ibd": 1700.0, "expected_segments": 28, "expected_max_segment": 160.0},
    {"relationship": "Uncle/Aunt-Nephew/Niece", "meiotic_distance": 3, "expected_total_ibd": 1700.0, "expected_segments": 28, "expected_max_segment": 154.0},
    {"relationship": "First cousins", "meiotic_distance": 4, "expected_total_ibd": 850.0, "expected_segments": 15, "expected_max_segment": 123.0},
    {"relationship": "First cousins once removed", "meiotic_distance": 5, "expected_total_ibd": 425.0, "expected_segments": 8, "expected_max_segment": 98.0},
    {"relationship": "Second cousins", "meiotic_distance": 6, "expected_total_ibd": 212.5, "expected_segments": 4, "expected_max_segment": 78.2},
    {"relationship": "Second cousins once removed", "meiotic_distance": 7, "expected_total_ibd": 106.3, "expected_segments": 2, "expected_max_segment": 62.4},
    {"relationship": "Third cousins", "meiotic_distance": 8, "expected_total_ibd": 53.1, "expected_segments": 1, "expected_max_segment": 49.8},
    {"relationship": "Fourth cousins", "meiotic_distance": 10, "expected_total_ibd": 13.3, "expected_segments": 0.3, "expected_max_segment": 31.8},
]

# Convert to DataFrame
rel_df = pd.DataFrame(relationship_sharing)

# Display the table
rel_df

In [None]:
# Infer approximate relationships from IBD statistics
def infer_relationship(total_ibd, num_segments, max_segment):
    """Infer an approximate relationship based on IBD statistics
    
    Args:
        total_ibd: Total IBD sharing in cM
        num_segments: Number of IBD segments
        max_segment: Length of the largest segment in cM
        
    Returns:
        Tuple of (relationship, confidence)
    """
    # Calculate differences from expected values for each relationship
    # Note: This is a very simplified approach compared to Bonsai's sophisticated models
    differences = []
    
    for _, row in rel_df.iterrows():
        # Calculate relative differences
        total_diff = abs(total_ibd - row['expected_total_ibd']) / max(row['expected_total_ibd'], 1.0)
        segment_diff = abs(num_segments - row['expected_segments']) / max(row['expected_segments'], 1.0)
        max_diff = abs(max_segment - row['expected_max_segment']) / max(row['expected_max_segment'], 1.0)
        
        # Combine differences (weighted)
        combined_diff = 0.5 * total_diff + 0.3 * segment_diff + 0.2 * max_diff
        
        differences.append((row['relationship'], combined_diff))
    
    # Sort by difference (ascending)
    differences.sort(key=lambda x: x[1])
    
    # Return best match and a confidence score
    best_relationship, best_diff = differences[0]
    confidence = max(0, 1 - best_diff)  # Simple confidence score
    
    return best_relationship, confidence

# Apply to our IBD statistics
inferred_relationships = []
for _, row in stats_df.iterrows():
    total_ibd = row['total_half'] + 2 * row['total_full']
    num_segments = row['num_half'] + row['num_full']
    max_segment = row['max_seg_cm']
    
    relationship, confidence = infer_relationship(total_ibd, num_segments, max_segment)
    
    inferred_relationships.append({
        'id1': row['id1'],
        'id2': row['id2'],
        'total_ibd': total_ibd,
        'num_segments': num_segments,
        'max_segment': max_segment,
        'inferred_relationship': relationship,
        'confidence': confidence
    })

# Convert to DataFrame
inferred_df = pd.DataFrame(inferred_relationships)

# Sort by confidence (descending)
inferred_df = inferred_df.sort_values('confidence', ascending=False)

# Display the results
inferred_df

## Summary

In this lab, we've explored IBD data formats and preprocessing in Bonsai v3:

1. **IBD Data Formats**: We examined unphased and phased IBD formats, understanding their structure and meaning
2. **Format Conversion**: We implemented simplified versions of Bonsai v3's functions for converting between phased and unphased formats
3. **IBD Statistics**: We extracted key statistical summaries from IBD segments, including total length, segment counts, and maximum segment length
4. **Relationship Inference**: We used IBD statistics to infer approximate relationships between individuals

These concepts form the foundation for Bonsai v3's more sophisticated relationship inference and pedigree reconstruction algorithms, which we'll explore in subsequent labs.