# Lab 15: Model Calibration in Bonsai

Building upon our exploration of the data structures in Bonsai in Lab 14, we now examine the process of model calibration in pedigree reconstruction. This lab focuses on how to adjust Bonsai's statistical models to better match real-world genetic inheritance patterns.

> **Why This Matters:** The accuracy of pedigree reconstruction depends critically on how well statistical models match the biological reality of genetic inheritance. Model calibration bridges the gap between theoretical expectations and observed data patterns, ensuring that Bonsai can handle real-world genetic data with its inherent noise and biases.

**Learning Objectives**:
- Understand the importance of model calibration in pedigree reconstruction
- Master techniques for calibrating IBD detection thresholds
- Learn how to estimate parameters for relationship likelihood models
- Develop methods to validate and refine calibration using known relationships
- Apply cross-validation techniques to evaluate calibration effectiveness
- Implement custom calibration approaches for specific populations or datasets

## Environment Setup

In [None]:
!poetry install --no-root

In [None]:
import os
import sys
import math
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from scipy.stats import poisson, expon, norm, multivariate_normal
from collections import defaultdict, deque
from pathlib import Path
from IPython.display import display, HTML
from dotenv import load_dotenv
import json

In [None]:
# Environment setup code removed for JupyterLite compatibility
# In JupyterLite, files are accessed directly from the files directory


## 1. The Need for Model Calibration

Genetic data in the real world rarely follows perfectly the theoretical distributions we expect. Several factors introduce systematic biases that must be accounted for:

- **IBD Detection Errors:** False positives and false negatives in IBD detection
- **Population-Specific Recombination:** Variation in recombination rates across populations
- **Genotyping Technology:** Systematic biases introduced by different genotyping platforms
- **Background Relatedness:** Low-level shared ancestry in endogamous populations
- **Ascertainment Bias:** Non-random sampling of individuals for analysis

Calibration adjusts Bonsai's internal models to account for these discrepancies, ensuring accurate relationship inference despite these real-world complications.

Let's visualize how calibration adjusts theoretical models to match observed data:

In [None]:
# Simulate theoretical vs. observed IBD sharing
np.random.seed(42)

# Theoretical expectations for segment counts by relationship
relationships = ['Parent-Child', 'Full Siblings', 'Half Siblings', 'First Cousins', 'Second Cousins']
theoretical_counts = [38, 26, 15, 7, 3]  # theoretical expected segment counts

# Simulated "observed" data with systematic biases
observed_counts = [34, 22, 12, 5, 2]  # systematically lower due to false negatives
observed_var = [3, 4, 3, 2, 1]  # variability in observations

# Create data points for each relationship
observed_data = []
for i, rel in enumerate(relationships):
    # Create multiple observations for each relationship
    observations = np.random.normal(observed_counts[i], observed_var[i], 20)
    for obs in observations:
        observed_data.append({'relationship': rel, 'segment_count': max(1, int(obs))})

# Convert to DataFrame
observed_df = pd.DataFrame(observed_data)

# Calculate calibration factors
calibration_factors = [theoretical_counts[i] / observed_counts[i] for i in range(len(relationships))]

# Apply calibration to get calibrated observations
observed_df['calibrated_count'] = observed_df.apply(
    lambda row: row['segment_count'] * calibration_factors[relationships.index(row['relationship'])],
    axis=1
)

# Plot the comparison
plt.figure(figsize=(12, 6))

# Boxplots of observed data
plt.subplot(1, 2, 1)
sns.boxplot(x='relationship', y='segment_count', data=observed_df, order=relationships)
# Add theoretical counts as red X marks
for i, count in enumerate(theoretical_counts):
    plt.scatter(i, count, color='red', marker='x', s=100)
plt.title('Before Calibration')
plt.ylabel('IBD Segment Count')
plt.xticks(rotation=45)

# Boxplots of calibrated data
plt.subplot(1, 2, 2)
sns.boxplot(x='relationship', y='calibrated_count', data=observed_df, order=relationships)
# Add theoretical counts as red X marks
for i, count in enumerate(theoretical_counts):
    plt.scatter(i, count, color='red', marker='x', s=100)
plt.title('After Calibration')
plt.ylabel('Calibrated IBD Segment Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 2. Calibrating IBD Detection Thresholds

The first step in Bonsai calibration is determining appropriate thresholds for IBD segment detection.

### Minimum Segment Length Calibration

The choice of minimum segment length threshold balances sensitivity and specificity:

In [None]:
class IBDSegment:
    def __init__(self, ind1, ind2, chrom, start_pos, end_pos, is_ibd2, length_cm, snp_count=None, lod_score=None):
        self.ind1 = ind1          # First individual ID
        self.ind2 = ind2          # Second individual ID
        self.chrom = chrom        # Chromosome number
        self.start_pos = start_pos  # Start position (base pairs)
        self.end_pos = end_pos    # End position (base pairs)
        self.is_ibd2 = is_ibd2    # Whether this is an IBD2 segment
        self.length_cm = length_cm  # Genetic length in centiMorgans
        self.snp_count = snp_count  # Number of SNPs in the segment
        self.lod_score = lod_score  # LOD score for the segment
    
    def __repr__(self):
        return f"IBDSegment({self.ind1}, {self.ind2}, chr{self.chrom}, {self.length_cm:.2f}cM, {'IBD2' if self.is_ibd2 else 'IBD1'})"

def calibrate_min_segment_length(ibd_segments, known_relationships=None):
    """Calibrate the minimum segment length threshold.
    
    Args:
        ibd_segments: List of detected IBD segments
        known_relationships: Dictionary mapping pairs to known relationship types (optional)
        
    Returns:
        Recommended minimum segment length threshold
    """
    # Approach 1: ROC curve analysis (if known relationships available)
    if known_relationships:
        thresholds = np.arange(3, 15, 0.5)  # Test thresholds from 3-15 cM
        results = []
        
        for threshold in thresholds:
            # Filter segments by threshold
            filtered_segments = [seg for seg in ibd_segments if seg.length_cm >= threshold]
            
            # Create pair-based index
            pair_index = {}
            for seg in filtered_segments:
                pair = tuple(sorted([seg.ind1, seg.ind2]))
                if pair not in pair_index:
                    pair_index[pair] = []
                pair_index[pair].append(seg)
            
            # Evaluate accuracy on known relationships
            true_positives = 0
            false_positives = 0
            false_negatives = 0
            
            # Count pairs with segments above threshold
            detected_pairs = set(pair_index.keys())
            
            # Count pairs with known relationships
            known_pairs = set(known_relationships.keys())
            
            # Calculate metrics
            true_positives = len(detected_pairs & known_pairs)
            false_positives = len(detected_pairs - known_pairs)
            false_negatives = len(known_pairs - detected_pairs)
            
            precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
            recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
            f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
            
            results.append({
                'threshold': threshold,
                'precision': precision,
                'recall': recall,
                'f1_score': f1_score
            })
        
        # Find threshold with best F1 score
        best_threshold = max(results, key=lambda x: x['f1_score'])['threshold']
        return best_threshold, results
    
    # Approach 2: Distribution-based calibration (if no known relationships)
    else:
        # Analyze the distribution of segment lengths
        lengths = [seg.length_cm for seg in ibd_segments]
        
        # Look for the inflection point in the distribution
        # Simple heuristic: 5th percentile
        return max(7, np.percentile(lengths, 5)), None

Let's generate some synthetic IBD segments and known relationships to test the calibration function:

In [None]:
# Generate synthetic IBD segments
np.random.seed(42)

# Create a set of individuals
individuals = list(range(1000, 1020))

# Create some known relationships
known_relationships = {
    (1000, 1001): 'parent-child',
    (1000, 1002): 'parent-child',
    (1001, 1002): 'spouses',       # No IBD expected
    (1003, 1004): 'siblings',
    (1000, 1003): 'grandparent',
    (1005, 1006): 'first-cousins',
    (1007, 1008): 'second-cousins'
}

# Generate synthetic IBD segments
synthetic_segments = []

# Helper function to generate segments for a pair with noise
def generate_segments_for_pair(ind1, ind2, relationship, num_segments, length_mean, length_std, noise=0.2):
    segments = []
    for _ in range(int(np.random.normal(num_segments, num_segments * noise))):
        length = max(1, np.random.normal(length_mean, length_std))
        chrom = np.random.randint(1, 23)
        start_pos = np.random.randint(1, 200000000)
        end_pos = start_pos + int(length * 1000000)  # Rough bp conversion
        is_ibd2 = relationship == 'siblings' and np.random.random() < 0.25  # 25% chance for siblings
        snp_count = int(length * 100)  # Approximate SNP count
        lod_score = np.random.normal(length/2, 1)  # Simple LOD score approximation
        
        segments.append(IBDSegment(ind1, ind2, chrom, start_pos, end_pos, is_ibd2, length, snp_count, lod_score))
    return segments

# Generate segments for known relationships
for (ind1, ind2), rel in known_relationships.items():
    if rel == 'parent-child':
        synthetic_segments.extend(generate_segments_for_pair(ind1, ind2, rel, 38, 90, 30))
    elif rel == 'siblings':
        synthetic_segments.extend(generate_segments_for_pair(ind1, ind2, rel, 26, 70, 25))
    elif rel == 'grandparent':
        synthetic_segments.extend(generate_segments_for_pair(ind1, ind2, rel, 15, 40, 15))
    elif rel == 'first-cousins':
        synthetic_segments.extend(generate_segments_for_pair(ind1, ind2, rel, 7, 25, 10))
    elif rel == 'second-cousins':
        synthetic_segments.extend(generate_segments_for_pair(ind1, ind2, rel, 3, 15, 5))
    # No segments for spouses

# Add some random segments for unrelated individuals (false positives)
for _ in range(50):
    # Choose random pairs
    ind1, ind2 = np.random.choice(individuals, 2, replace=False)
    pair = tuple(sorted([ind1, ind2]))
    if pair not in known_relationships:
        # Generate a small number of short segments
        synthetic_segments.extend(generate_segments_for_pair(ind1, ind2, 'unrelated', 2, 6, 2))

# Display summary of generated segments
print(f"Generated {len(synthetic_segments)} synthetic IBD segments")
print(f"Length range: {min([s.length_cm for s in synthetic_segments]):.2f} - {max([s.length_cm for s in synthetic_segments]):.2f} cM")

In [None]:
# Apply the calibration function
best_threshold, results = calibrate_min_segment_length(synthetic_segments, known_relationships)

print(f"Best minimum segment length threshold: {best_threshold:.2f} cM")

# Visualize the precision-recall tradeoff
if results:
    results_df = pd.DataFrame(results)
    
    plt.figure(figsize=(12, 5))
    
    # Plot precision, recall, and F1 score
    plt.subplot(1, 2, 1)
    plt.plot(results_df['threshold'], results_df['precision'], 'b-', label='Precision')
    plt.plot(results_df['threshold'], results_df['recall'], 'r-', label='Recall')
    plt.plot(results_df['threshold'], results_df['f1_score'], 'g-', label='F1 Score')
    plt.axvline(x=best_threshold, color='black', linestyle='--', label=f'Best Threshold: {best_threshold:.2f} cM')
    plt.xlabel('Minimum Segment Length Threshold (cM)')
    plt.ylabel('Score')
    plt.title('Precision-Recall Tradeoff')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Plot precision vs recall
    plt.subplot(1, 2, 2)
    plt.plot(results_df['recall'], results_df['precision'], 'b-')
    # Mark points for various thresholds
    for t in [3, 5, 7, 10, 12, 14]:
        row = results_df[results_df['threshold'] == t].iloc[0] if not results_df[results_df['threshold'] == t].empty else None
        if row is not None:
            plt.scatter(row['recall'], row['precision'], color='red')
            plt.annotate(f"{t} cM", (row['recall'], row['precision']), textcoords="offset points", xytext=(0,10), ha='center')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision vs Recall for Different Thresholds')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

Appropriate thresholds can vary based on:
- The IBD detection algorithm used (IBIS, Refined-IBD, HAP-IBD, etc.)
- Population density of the dataset
- Genotyping density and quality
- The specific relationships of interest

### False Positive Control

Additional filtering can be applied to reduce false positives:

In [None]:
def filter_ibd_segments(segments):
    """Filter IBD segments to reduce false positives."""
    filtered = []
    
    for seg in segments:
        # Filter by minimum genetic length
        if seg.length_cm < 7:
            continue
            
        # Filter by minimum physical length
        if (seg.end_pos - seg.start_pos) < 500000:  # 500kb minimum
            continue
            
        # Filter by SNP density
        if hasattr(seg, 'snp_count') and seg.snp_count < 400:  # Minimum SNPs for confidence
            continue
            
        # Filter by LOD score if available
        if hasattr(seg, 'lod_score') and seg.lod_score < 5:
            continue
            
        # Filter by chromosome (e.g., exclude problematic regions)
        if str(seg.chrom) in ['X', 'Y', 'MT']:  # Special handling for sex/mitochondrial chromosomes
            # Apply more stringent criteria
            if seg.length_cm < 10:  # Higher threshold for sex chromosomes
                continue
        
        filtered.append(seg)
    
    return filtered

# Apply filtering to our synthetic segments
filtered_segments = filter_ibd_segments(synthetic_segments)

print(f"Before filtering: {len(synthetic_segments)} segments")
print(f"After filtering: {len(filtered_segments)} segments")
print(f"Removed {len(synthetic_segments) - len(filtered_segments)} segments ({(1 - len(filtered_segments)/len(synthetic_segments))*100:.1f}%)")