# Lab 4: IBD Statistics Extraction and Analysis

## Overview

In this lab, we'll explore how Bonsai v3 extracts and analyzes statistics from Identity-by-Descent (IBD) segments. These statistics form the foundation for relationship inference and pedigree reconstruction. 

Key topics include:

1. Extracting core IBD statistics (total sharing, segment counts)
2. Calculating IBD1 and IBD2 proportions
3. Analyzing segment length distributions
4. Building IBD networks and community detection
5. Using IBD statistics for relationship inference

By the end of this lab, you'll understand how raw IBD segment data is transformed into meaningful statistics that can be used to infer relationships between individuals.

In [None]:
# 🧬 Google Colab Setup - Run this cell first!
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown

def is_colab():
    '''Check if running in Google Colab'''
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    print("🔬 Setting up Google Colab environment...")
    
    # Install dependencies
    print("📦 Installing packages...")
    !pip install -q pysam biopython scikit-allel networkx pygraphviz seaborn plotly
    !apt-get update -qq && apt-get install -qq samtools bcftools tabix graphviz-dev
    
    # Create directories
    !mkdir -p /content/class_data /content/results
    
    # Download essential class data
    print("📥 Downloading class data...")
    S3_BASE = "https://computational-genetic-genealogy.s3.us-east-2.amazonaws.com/class_data/"
    data_files = [
        "pedigree.fam", "pedigree.def", 
        "merged_opensnps_autosomes_ped_sim.seg",
        "merged_opensnps_autosomes_ped_sim-everyone.fam",
        "ped_sim_run2.seg", "ped_sim_run2-everyone.fam"
    ]
    
    for file in data_files:
        !wget -q -O /content/class_data/{file} {S3_BASE}{file}
        print(f"  ✅ {file}")
    
    # Define utility functions
    def setup_environment():
        return "/content/class_data", "/content/results"
    
    def save_results(dataframe, filename, description="results"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        dataframe.to_csv(full_path, index=False)
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e3f2fd; border-left: 4px solid #2196f3; margin: 10px 0;">
            <p><strong>💾 Results saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    def save_plot(plt, filename, description="plot"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        plt.savefig(full_path, dpi=300, bbox_inches='tight')
        plt.show()
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e8f5e8; border-left: 4px solid #4caf50; margin: 10px 0;">
            <p><strong>📊 Plot saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    print("✅ Colab setup complete! Ready to explore genetic genealogy.")
    
else:
    print("🏠 Local environment detected")
    def setup_environment():
        return "class_data", "results"
    def save_results(df, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        df.to_csv(path, index=False)
        return path
    def save_plot(plt, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        plt.savefig(path, dpi=300, bbox_inches='tight')
        plt.show()
        return path

# Set up paths and configure visualization
DATA_DIR, RESULTS_DIR = setup_environment()
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\n## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\nMethods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\n## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code
        from IPython.display import display, Markdown
        display(Markdown(f"```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Part 1: Understanding IBD Statistics

Identity-by-Descent (IBD) segments are the fundamental units of genetic relatedness in computational genetic genealogy. Before we can infer relationships or build pedigrees, we need to extract meaningful statistics from these segments.

Let's explore the key IBD statistics used in Bonsai v3:

### Core IBD Statistics

Bonsai v3 calculates five core IBD statistics for each pair of individuals:

1. **Total Half-IBD (IBD1)**: Total genetic length (in cM) of segments where individuals share one allele
2. **Total Full-IBD (IBD2)**: Total genetic length (in cM) of segments where individuals share both alleles
3. **Number of Half-IBD Segments**: Count of IBD1 segments
4. **Number of Full-IBD Segments**: Count of IBD2 segments
5. **Maximum Segment Length**: Length of the longest IBD segment (in cM)

Let's create a few synthetic IBD segments to demonstrate how these statistics are calculated:

In [None]:
# Create synthetic IBD segments for demonstration
# Format: [id1, id2, chromosome, start_position, end_position, is_full_ibd, segment_cm]
synthetic_segments = [
    # Parent-child relationship (1001-1002)
    [1001, 1002, 1, 10000, 50000000, False, 50.0],  # IBD1, chromosome 1
    [1001, 1002, 2, 5000,  60000000, False, 55.0],  # IBD1, chromosome 2
    [1001, 1002, 3, 20000, 40000000, False, 40.0],  # IBD1, chromosome 3
    
    # Full siblings relationship (1003-1004)
    [1003, 1004, 1, 10000, 30000000, False, 30.0],  # IBD1, chromosome 1
    [1003, 1004, 1, 40000, 50000000, True,  20.0],  # IBD2, chromosome 1
    [1003, 1004, 2, 5000,  25000000, False, 25.0],  # IBD1, chromosome 2
    [1003, 1004, 3, 20000, 35000000, False, 25.0],  # IBD1, chromosome 3
    [1003, 1004, 4, 15000, 45000000, True,  30.0],  # IBD2, chromosome 4
    
    # First cousins relationship (1005-1006)
    [1005, 1006, 1, 15000, 25000000, False, 15.0],  # IBD1, chromosome 1
    [1005, 1006, 2, 10000, 15000000, False, 10.0],  # IBD1, chromosome 2
    [1005, 1006, 3, 20000, 30000000, False, 12.0],  # IBD1, chromosome 3
]

# Define a function to calculate IBD statistics (similar to Bonsai's implementation)
def get_ibd_stats_unphased(unphased_ibd_segs):
    """Get IBD statistics from unphased segment data (v3 format)"""
    ibd_stats = defaultdict(lambda: {
        'total_half': 0,
        'total_full': 0,
        'num_half': 0,
        'num_full': 0,
        'max_seg_cm': 0})

    for s in unphased_ibd_segs:
        id1, id2, chromosome, start, end, is_full_ibd, seg_cm = s
        key = frozenset({id1, id2})
        ibd_stats[key]['total_half'] += (seg_cm if not is_full_ibd else 0)
        ibd_stats[key]['total_full'] += (seg_cm if is_full_ibd else 0)
        ibd_stats[key]['num_half'] += int(not is_full_ibd)
        ibd_stats[key]['num_full'] += int(is_full_ibd)
        ibd_stats[key]['max_seg_cm'] = max(ibd_stats[key]['max_seg_cm'], seg_cm)

    return dict(ibd_stats)

# Calculate IBD statistics for our synthetic segments
ibd_stats = get_ibd_stats_unphased(synthetic_segments)

# Display the IBD statistics in a DataFrame for easier viewing
ibd_stats_rows = []
for pair, stats in ibd_stats.items():
    id_pair = list(pair)
    row = {
        'Pair': f"{id_pair[0]}-{id_pair[1]}",
        'Total IBD1 (cM)': stats['total_half'],
        'Total IBD2 (cM)': stats['total_full'],
        'IBD1 Segments': stats['num_half'],
        'IBD2 Segments': stats['num_full'],
        'Max Segment (cM)': stats['max_seg_cm'],
        'Total IBD (cM)': stats['total_half'] + stats['total_full'],
    }
    ibd_stats_rows.append(row)

ibd_stats_df = pd.DataFrame(ibd_stats_rows)
display(ibd_stats_df)

### Analyzing IBD Statistics for Different Relationships

As you can see, different relationship types produce characteristic IBD statistics patterns:

1. **Parent-Child**: High total IBD1 (around 3400 cM), no IBD2
2. **Full Siblings**: Moderate IBD1, presence of IBD2 (approximately 25% of the genome)
3. **First Cousins**: Lower IBD1, no IBD2

Now let's use these statistics to visualize the relationships:

In [None]:
# Create a visualization of the different relationship types based on IBD statistics
plt.figure(figsize=(12, 8))

# Add expected values for known relationships for comparison
relationship_examples = ibd_stats_rows.copy()
expected_relationships = [
    {'Pair': 'Parent-Child (Expected)', 'Total IBD1 (cM)': 3400, 'Total IBD2 (cM)': 0, 'IBD1 Segments': 'Many', 'IBD2 Segments': 0},
    {'Pair': 'Full Siblings (Expected)', 'Total IBD1 (cM)': 2550, 'Total IBD2 (cM)': 850, 'IBD1 Segments': 'Many', 'IBD2 Segments': 'Many'},
    {'Pair': 'First Cousins (Expected)', 'Total IBD1 (cM)': 850, 'Total IBD2 (cM)': 0, 'IBD1 Segments': 'Several', 'IBD2 Segments': 0},
]

# Prepare data for plotting
pairs = [row['Pair'] for row in relationship_examples]
ibd1_values = [row['Total IBD1 (cM)'] for row in relationship_examples]
ibd2_values = [row['Total IBD2 (cM)'] for row in relationship_examples]

exp_pairs = [row['Pair'] for row in expected_relationships]
exp_ibd1 = [row['Total IBD1 (cM)'] for row in expected_relationships]
exp_ibd2 = [row['Total IBD2 (cM)'] for row in expected_relationships]

# Create plot
width = 0.35
x = np.arange(len(pairs))
ex = np.arange(len(pairs), len(pairs) + len(exp_pairs))

plt.bar(x, ibd1_values, width, label='IBD1 (Half-identical)', color='#3182bd')
plt.bar(x, ibd2_values, width, bottom=ibd1_values, label='IBD2 (Fully identical)', color='#6baed6')

plt.bar(ex, exp_ibd1, width, color='#31a354', alpha=0.7)
plt.bar(ex, exp_ibd2, width, bottom=exp_ibd1, color='#74c476', alpha=0.7)

# Add labels and formatting
plt.xlabel('Related Individual Pairs')
plt.ylabel('Total IBD Sharing (cM)')
plt.title('IBD Sharing Patterns for Different Relationships')
plt.xticks(np.concatenate([x, ex]), np.concatenate([pairs, exp_pairs]), rotation=45, ha='right')
plt.legend()
plt.grid(axis='y', alpha=0.3)

# Add a line for the human genome size (about 3400 cM)
plt.axhline(y=3400, color='red', linestyle='--', alpha=0.7, label='Total Human Genome (≈3400 cM)')

plt.tight_layout()
plt.show()

### Implementing an IBD Index for Efficient Statistics Access

When working with large datasets, Bonsai v3 uses efficient data structures to store and access IBD statistics. Let's implement a simplified version of the `IBDIndex` class from Bonsai v3:

In [None]:
class IBDIndex:
    """Class for efficiently indexing and retrieving IBD statistics."""
    
    def __init__(self, segments):
        """Initialize the IBD index with a list of IBD segments."""
        self.segments = segments
        self.segments_by_pair = defaultdict(list)
        self.pair_stats = {}
        
        # Index segments by pair
        for seg in segments:
            id1, id2 = seg[0], seg[1]
            pair_key = frozenset({id1, id2})
            self.segments_by_pair[pair_key].append(seg)
        
        # Compute statistics for each pair
        self._compute_pair_stats()
    
    def _compute_pair_stats(self):
        """Compute IBD statistics for all pairs."""
        self.pair_stats = get_ibd_stats_unphased(self.segments)
    
    def get_stats_for_pair(self, id1, id2):
        """Get IBD statistics for a specific pair of individuals."""
        pair_key = frozenset({id1, id2})
        return self.pair_stats.get(pair_key, {
            'total_half': 0,
            'total_full': 0,
            'num_half': 0,
            'num_full': 0,
            'max_seg_cm': 0
        })
    
    def get_segments_for_pair(self, id1, id2):
        """Get all IBD segments for a specific pair of individuals."""
        pair_key = frozenset({id1, id2})
        return self.segments_by_pair.get(pair_key, [])
    
    def get_all_pairs(self):
        """Get all pairs of individuals with IBD segments."""
        return list(self.pair_stats.keys())
    
    def get_total_ibd_between_id_sets(self, id_set1, id_set2):
        """Calculate total IBD shared between two sets of IDs."""
        total_ibd = 0
        
        for id1 in id_set1:
            for id2 in id_set2:
                if id1 == id2:
                    continue
                
                pair_key = frozenset({id1, id2})
                if pair_key in self.pair_stats:
                    stats = self.pair_stats[pair_key]
                    total_ibd += stats['total_half'] + stats['total_full']
        
        return total_ibd

# Create an IBD index for our synthetic segments
ibd_index = IBDIndex(synthetic_segments)

# Test the IBD index functions
print("All IBD pairs in the dataset:")
for pair in ibd_index.get_all_pairs():
    pair_list = list(pair)
    print(f"- {pair_list[0]}-{pair_list[1]}")

print("\nIBD statistics for pair 1003-1004 (full siblings):")
stats_1003_1004 = ibd_index.get_stats_for_pair(1003, 1004)
for stat, value in stats_1003_1004.items():
    print(f"- {stat}: {value}")

print("\nIBD segments for pair 1003-1004:")
segments_1003_1004 = ibd_index.get_segments_for_pair(1003, 1004)
for i, seg in enumerate(segments_1003_1004):
    seg_type = "IBD2" if seg[5] else "IBD1"
    print(f"- Segment {i+1}: {seg_type}, Chromosome {seg[2]}, Length {seg[6]} cM")

## Part 2: IBD Segment Length Distributions

The distributions of IBD segment lengths provide valuable information for relationship inference. Different relationships have characteristic distributions of segment lengths.

Let's generate more realistic synthetic IBD segments for different relationships and analyze their length distributions:

In [None]:
# Function to generate synthetic IBD segments for a relationship
def generate_synthetic_segments(id1, id2, relationship_type, num_segments=20, seed=None):
    """Generate synthetic IBD segments for a specific relationship type."""
    if seed is not None:
        random.seed(seed)
    
    segments = []
    
    # Set parameters based on relationship type
    if relationship_type == "parent-child":
        # Parent-child: long segments, covering entire genome, all IBD1
        total_cm = 3400
        seg_params = {
            'min_length': 80,
            'max_length': 180,
            'ibd2_prob': 0  # No IBD2 for parent-child
        }
    elif relationship_type == "full-siblings":
        # Full siblings: mix of IBD1 and IBD2, covering about 75% of genome
        total_cm = 2550  # 75% of genome is IBD1 or IBD2
        seg_params = {
            'min_length': 20,
            'max_length': 150,
            'ibd2_prob': 0.25  # 25% of shared segments are IBD2
        }
    elif relationship_type == "half-siblings":
        # Half siblings: only IBD1, covering about 25% of genome
        total_cm = 850  # 25% of genome
        seg_params = {
            'min_length': 15,
            'max_length': 80,
            'ibd2_prob': 0  # No IBD2
        }
    elif relationship_type == "first-cousins":
        # First cousins: only IBD1, covering about 12.5% of genome
        total_cm = 425  # 12.5% of genome
        seg_params = {
            'min_length': 10,
            'max_length': 50,
            'ibd2_prob': 0  # No IBD2
        }
    elif relationship_type == "second-cousins":
        # Second cousins: only IBD1, covering about 3.125% of genome
        total_cm = 106  # 3.125% of genome
        seg_params = {
            'min_length': 8,
            'max_length': 30,
            'ibd2_prob': 0  # No IBD2
        }
    else:
        raise ValueError(f"Unknown relationship type: {relationship_type}")
    
    # Calculate average segment length to ensure correct total
    avg_length = total_cm / num_segments
    
    # Generate segments
    current_total = 0
    for i in range(num_segments):
        # Determine chromosome (1-22)
        chromosome = random.randint(1, 22)
        
        # Determine if this is an IBD2 segment
        is_full_ibd = random.random() < seg_params['ibd2_prob']
        
        # Generate segment length (last segment adjusted to match total)
        if i < num_segments - 1:
            length_range = (seg_params['min_length'], seg_params['max_length'])
            segment_cm = random.uniform(*length_range)
            current_total += segment_cm
        else:
            # Last segment - adjust to make the total correct
            segment_cm = max(seg_params['min_length'], total_cm - current_total)
        
        # Generate arbitrary start and end positions
        start_pos = random.randint(10000, 1000000)
        end_pos = start_pos + int(segment_cm * 1000000)  # Rough conversion from cM to bp
        
        # Create the segment
        segment = [id1, id2, chromosome, start_pos, end_pos, is_full_ibd, segment_cm]
        segments.append(segment)
    
    return segments

# Generate segments for different relationship types
all_segments = []

# Generate for parent-child (1001-1002)
parent_child_segments = generate_synthetic_segments(1001, 1002, "parent-child", num_segments=30, seed=42)
all_segments.extend(parent_child_segments)

# Generate for full siblings (1003-1004)
full_siblings_segments = generate_synthetic_segments(1003, 1004, "full-siblings", num_segments=40, seed=43)
all_segments.extend(full_siblings_segments)

# Generate for half siblings (1005-1006)
half_siblings_segments = generate_synthetic_segments(1005, 1006, "half-siblings", num_segments=25, seed=44)
all_segments.extend(half_siblings_segments)

# Generate for first cousins (1007-1008)
first_cousins_segments = generate_synthetic_segments(1007, 1008, "first-cousins", num_segments=20, seed=45)
all_segments.extend(first_cousins_segments)

# Generate for second cousins (1009-1010)
second_cousins_segments = generate_synthetic_segments(1009, 1010, "second-cousins", num_segments=15, seed=46)
all_segments.extend(second_cousins_segments)

# Create an IBD index for all segments
all_ibd_index = IBDIndex(all_segments)

# Display summary statistics
print("Relationship IBD Statistics:")
for pair in all_ibd_index.get_all_pairs():
    pair_list = list(pair)
    stats = all_ibd_index.get_stats_for_pair(pair_list[0], pair_list[1])
    total_ibd = stats['total_half'] + stats['total_full']
    relationship = "Unknown"
    if pair_list[0] == 1001 and pair_list[1] == 1002:
        relationship = "Parent-Child"
    elif pair_list[0] == 1003 and pair_list[1] == 1004:
        relationship = "Full Siblings"
    elif pair_list[0] == 1005 and pair_list[1] == 1006:
        relationship = "Half Siblings"
    elif pair_list[0] == 1007 and pair_list[1] == 1008:
        relationship = "First Cousins"
    elif pair_list[0] == 1009 and pair_list[1] == 1010:
        relationship = "Second Cousins"
        
    print(f"\n{relationship} ({pair_list[0]}-{pair_list[1]})")
    print(f"- Total IBD: {total_ibd:.1f} cM")
    print(f"- IBD1: {stats['total_half']:.1f} cM ({stats['num_half']} segments)")
    print(f"- IBD2: {stats['total_full']:.1f} cM ({stats['num_full']} segments)")
    print(f"- Longest segment: {stats['max_seg_cm']:.1f} cM")

In [None]:
# Analyze segment length distributions
def plot_segment_length_distributions(ibd_index):
    """Plot segment length distributions for different relationships."""
    plt.figure(figsize=(14, 10))
    
    # Define relationships and their corresponding IDs
    relationships = [
        {"name": "Parent-Child", "ids": (1001, 1002), "color": "#1f77b4"},
        {"name": "Full Siblings", "ids": (1003, 1004), "color": "#ff7f0e"},
        {"name": "Half Siblings", "ids": (1005, 1006), "color": "#2ca02c"},
        {"name": "First Cousins", "ids": (1007, 1008), "color": "#d62728"},
        {"name": "Second Cousins", "ids": (1009, 1010), "color": "#9467bd"}
    ]
    
    # Plot segment length distributions
    plt.subplot(2, 1, 1)
    
    for rel in relationships:
        id1, id2 = rel["ids"]
        segments = ibd_index.get_segments_for_pair(id1, id2)
        
        # Extract segment lengths
        lengths = [seg[6] for seg in segments]
        
        # Plot histogram
        sns.histplot(lengths, bins=15, alpha=0.5, label=rel["name"], color=rel["color"], kde=True)
    
    plt.xlabel("Segment Length (cM)")
    plt.ylabel("Frequency")
    plt.title("IBD Segment Length Distributions by Relationship")
    plt.legend()
    plt.grid(alpha=0.3)
    
    # Plot segment count vs total IBD
    plt.subplot(2, 1, 2)
    
    counts = []
    totals = []
    labels = []
    colors = []
    
    for rel in relationships:
        id1, id2 = rel["ids"]
        stats = ibd_index.get_stats_for_pair(id1, id2)
        
        total = stats["total_half"] + stats["total_full"]
        count = stats["num_half"] + stats["num_full"]
        
        counts.append(count)
        totals.append(total)
        labels.append(rel["name"])
        colors.append(rel["color"])
    
    plt.scatter(counts, totals, s=100, c=colors)
    
    # Add labels to points
    for i, label in enumerate(labels):
        plt.annotate(label, (counts[i], totals[i]), 
                     textcoords="offset points", xytext=(0,10), ha='center')
    
    plt.xlabel("Number of IBD Segments")
    plt.ylabel("Total IBD (cM)")
    plt.title("Relationship Patterns: Total IBD vs. Segment Count")
    plt.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Plot the segment length distributions
plot_segment_length_distributions(all_ibd_index)

### Observations from the Segment Length Distributions

The plots above reveal several important patterns that Bonsai v3 leverages for relationship inference:

1. **Segment Length Patterns**:
   - Parent-child relationships have the longest segments (typically 80-180 cM)
   - Siblings have a mixture of medium and long segments
   - More distant relationships (cousins) have progressively shorter segments

2. **Segment Count vs Total IBD**:
   - Close relationships have both high segment counts and high total IBD
   - As relationships become more distant, both metrics decrease
   - The relationship between these two metrics is not linear

These patterns allow Bonsai to distinguish between relationship types when inferring pedigrees.

## Part 3: Building IBD Networks and Community Detection

An important application of IBD statistics is the construction of IBD networks for community detection. Bonsai v3 uses these networks to identify groups of related individuals before detailed pedigree reconstruction.

Let's build an IBD network from our synthetic segments:

In [None]:
# Function to build an IBD network
def build_ibd_network(ibd_index, min_total_ibd=10):
    """Build a network from IBD sharing data."""
    G = nx.Graph()
    
    # Add nodes for all individuals
    all_individuals = set()
    for pair in ibd_index.get_all_pairs():
        all_individuals.update(pair)
    
    for ind in all_individuals:
        G.add_node(ind)
    
    # Add edges for IBD sharing above threshold
    for pair in ibd_index.get_all_pairs():
        pair_list = list(pair)
        id1, id2 = pair_list[0], pair_list[1]
        
        stats = ibd_index.get_stats_for_pair(id1, id2)
        total_ibd = stats['total_half'] + stats['total_full']
        
        if total_ibd >= min_total_ibd:
            G.add_edge(id1, id2, weight=total_ibd, stats=stats)
    
    return G

# Function to visualize the IBD network
def visualize_ibd_network(G, title="IBD Network"):
    """Visualize the IBD network."""
    plt.figure(figsize=(12, 10))
    
    # Compute layout
    pos = nx.spring_layout(G, seed=42)
    
    # Get edge weights for line thickness
    edge_weights = [G[u][v]['weight'] / 100 for u, v in G.edges()]
    
    # Draw the network
    nx.draw_networkx_nodes(G, pos, node_size=300, node_color='lightblue')
    nx.draw_networkx_edges(G, pos, width=edge_weights, alpha=0.7)
    nx.draw_networkx_labels(G, pos, font_size=10)
    
    # Add edge labels with IBD amounts
    edge_labels = {(u, v): f"{G[u][v]['weight']:.0f} cM" for u, v in G.edges()}
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=8)
    
    plt.title(title)
    plt.axis('off')
    plt.tight_layout()
    plt.show()

# Build and visualize the IBD network
ibd_network = build_ibd_network(all_ibd_index, min_total_ibd=10)
visualize_ibd_network(ibd_network, title="IBD Sharing Network (min 10 cM)")

### Community Detection in IBD Networks

Bonsai v3 uses community detection algorithms to identify clusters of related individuals in IBD networks. Let's apply community detection to our synthetic network:

In [None]:
# Function to detect communities in the IBD network
def detect_communities(G):
    """Detect communities in the IBD network using Louvain algorithm."""
    # Import community detection algorithm
    try:
        import community as community_louvain
        
        # Apply Louvain method
        partition = community_louvain.best_partition(G)
        
        # Group nodes by community
        communities = defaultdict(list)
        for node, comm_id in partition.items():
            communities[comm_id].append(node)
        
        return list(communities.values())
    except ImportError:
        print("Community detection library not available.")
        print("Using a simple approach based on connected components.")
        
        # Simple approach using connected components
        return list(nx.connected_components(G))

# Function to visualize communities in the IBD network
def visualize_communities(G, communities, title="IBD Network Communities"):
    """Visualize communities in the IBD network."""
    plt.figure(figsize=(12, 10))
    
    # Compute layout
    pos = nx.spring_layout(G, seed=42)
    
    # Generate colors for communities
    color_map = plt.cm.get_cmap('tab10', len(communities))
    
    # Draw each community with a different color
    for i, community in enumerate(communities):
        nx.draw_networkx_nodes(G, pos, nodelist=list(community), 
                            node_color=[color_map(i)], node_size=300)
    
    # Draw edges
    edge_weights = [G[u][v]['weight'] / 100 for u, v in G.edges()]
    nx.draw_networkx_edges(G, pos, width=edge_weights, alpha=0.7)
    
    # Draw labels
    nx.draw_networkx_labels(G, pos, font_size=10)
    
    # Create a legend for communities
    import matplotlib.patches as mpatches
    patches = [mpatches.Patch(color=color_map(i), label=f"Community {i+1}") 
               for i in range(len(communities))]
    plt.legend(handles=patches)
    
    plt.title(title)
    plt.axis('off')
    plt.tight_layout()
    plt.show()

# Add more relationships to create additional communities
# Community 1: Include distant cousins of our first group
additional_segments = generate_synthetic_segments(1002, 1011, "second-cousins", num_segments=10, seed=47)
all_segments.extend(additional_segments)
additional_segments = generate_synthetic_segments(1011, 1012, "half-siblings", num_segments=15, seed=48)
all_segments.extend(additional_segments)

# Community 2: Create a separate family
additional_segments = generate_synthetic_segments(2001, 2002, "parent-child", num_segments=25, seed=49)
all_segments.extend(additional_segments)
additional_segments = generate_synthetic_segments(2001, 2003, "parent-child", num_segments=25, seed=50)
all_segments.extend(additional_segments)
additional_segments = generate_synthetic_segments(2002, 2003, "full-siblings", num_segments=35, seed=51)
all_segments.extend(additional_segments)
additional_segments = generate_synthetic_segments(2003, 2004, "first-cousins", num_segments=18, seed=52)
all_segments.extend(additional_segments)

# Re-create IBD index and network with all segments
expanded_ibd_index = IBDIndex(all_segments)
expanded_ibd_network = build_ibd_network(expanded_ibd_index, min_total_ibd=10)

# Detect and visualize communities
communities = detect_communities(expanded_ibd_network)
visualize_communities(expanded_ibd_network, communities, title="IBD Network Communities")

### Community Analysis

Let's analyze the IBD sharing patterns within and between communities:

In [None]:
# Function to analyze IBD patterns within and between communities
def analyze_community_ibd(G, communities):
    """Analyze IBD sharing patterns within and between communities."""
    within_ibd = []
    between_ibd = []
    
    for u, v, data in G.edges(data=True):
        # Check if edge is within a community or between communities
        is_within = False
        for community in communities:
            if u in community and v in community:
                within_ibd.append(data['weight'])
                is_within = True
                break
        
        if not is_within:
            between_ibd.append(data['weight'])
    
    # Plot the distributions
    plt.figure(figsize=(12, 8))
    
    plt.subplot(2, 1, 1)
    bins = np.linspace(0, max(within_ibd + between_ibd) if within_ibd and between_ibd else 100, 20)
    plt.hist(within_ibd, bins=bins, alpha=0.7, label='Within Communities')
    plt.hist(between_ibd, bins=bins, alpha=0.7, label='Between Communities')
    plt.xlabel('Total IBD Sharing (cM)')
    plt.ylabel('Frequency')
    plt.title('Distribution of IBD Sharing Within vs. Between Communities')
    plt.legend()
    plt.grid(alpha=0.3)
    
    # Create a table with statistics
    plt.subplot(2, 1, 2)
    plt.axis('off')
    
    data = [
        ['Within Communities', len(within_ibd), 
         f"{np.mean(within_ibd):.1f}" if within_ibd else "N/A", 
         f"{np.min(within_ibd):.1f}" if within_ibd else "N/A", 
         f"{np.max(within_ibd):.1f}" if within_ibd else "N/A"],
        ['Between Communities', len(between_ibd), 
         f"{np.mean(between_ibd):.1f}" if between_ibd else "N/A", 
         f"{np.min(between_ibd):.1f}" if between_ibd else "N/A", 
         f"{np.max(between_ibd):.1f}" if between_ibd else "N/A"]
    ]
    
    table = plt.table(cellText=data, colLabels=['', 'Count', 'Mean IBD (cM)', 'Min IBD (cM)', 'Max IBD (cM)'],
                     loc='center', cellLoc='center')
    table.auto_set_font_size(False)
    table.set_fontsize(12)
    table.scale(1, 2)
    
    plt.tight_layout()
    plt.show()

    # Print community compositions
    print("Community Compositions:")
    for i, community in enumerate(communities):
        print(f"\nCommunity {i+1}: {len(community)} individuals")
        print(", ".join([str(id) for id in sorted(community)]))

# Analyze IBD patterns in communities
analyze_community_ibd(expanded_ibd_network, communities)

## Part 4: Using IBD Statistics for Relationship Inference

The final step in IBD statistics analysis is using these statistics to infer relationships. Bonsai v3 implements sophisticated statistical models in the `PwLogLike` class to calculate relationship likelihoods.

Let's implement a simplified version of this class to demonstrate the concept:

In [None]:
class SimplifiedPwLogLike:
    """Simplified class for computing pairwise relationship likelihoods."""
    
    def __init__(self, ibd_index):
        """Initialize with an IBD index."""
        self.ibd_index = ibd_index
        
        # Define expected IBD statistics for different relationships
        # Format: [expected_total_ibd, expected_ibd2_proportion, expected_segments, std_dev]
        self.relationship_params = {
            # (up, down, num_ancs) -> [total_ibd, ibd2_prop, num_segments, std_dev]
            (0, 0, 2): [3400, 1.0, 1, 0],           # Self
            (0, 1, 1): [3400, 0.0, 30, 200],        # Parent-child
            (1, 0, 1): [3400, 0.0, 30, 200],        # Child-parent
            (1, 1, 2): [3400, 0.25, 40, 300],       # Full siblings
            (1, 1, 1): [1700, 0.0, 25, 200],        # Half siblings
            (0, 2, 1): [1700, 0.0, 25, 200],        # Grandparent-grandchild
            (2, 0, 1): [1700, 0.0, 25, 200],        # Grandchild-grandparent
            (2, 2, 2): [850, 0.0, 20, 150],         # Full first cousins
            (2, 2, 1): [425, 0.0, 15, 100],         # Half first cousins
            (3, 3, 2): [212, 0.0, 10, 75],          # Full second cousins
            (3, 3, 1): [106, 0.0, 8, 50],           # Half second cousins
        }
    
    def get_relationship_options(self, max_deg=3):
        """Get all possible relationship options up to a given degree."""
        options = []
        
        for up in range(max_deg + 1):
            for down in range(max_deg + 1):
                if up == 0 and down == 0:
                    # Self relationship
                    options.append((0, 0, 2))
                elif up == 0 or down == 0:
                    # Direct lineage - only one common ancestor
                    if up + down <= max_deg:
                        options.append((up, down, 1))
                else:
                    # Not direct lineage - could be full or half
                    if up + down <= max_deg * 2:
                        options.append((up, down, 2))  # Full relationship
                        options.append((up, down, 1))  # Half relationship
        
        return options
    
    def get_relationship_likelihood(self, id1, id2, rel_tuple):
        """Compute the likelihood of a specific relationship between two individuals."""
        # Get IBD statistics for the pair
        stats = self.ibd_index.get_stats_for_pair(id1, id2)
        total_ibd = stats['total_half'] + stats['total_full']
        ibd2_prop = stats['total_full'] / total_ibd if total_ibd > 0 else 0
        num_segments = stats['num_half'] + stats['num_full']
        
        # Get expected parameters for this relationship type
        if rel_tuple not in self.relationship_params:
            return float('-inf')  # Unsupported relationship
        
        exp_total, exp_ibd2_prop, exp_segments, std_dev = self.relationship_params[rel_tuple]
        
        # Compute log likelihood based on deviation from expected values
        total_ibd_ll = -((total_ibd - exp_total) ** 2) / (2 * std_dev ** 2)
        ibd2_prop_ll = -((ibd2_prop - exp_ibd2_prop) ** 2) / 0.02  # Smaller std dev for proportion
        segments_ll = -((num_segments - exp_segments) ** 2) / (2 * (exp_segments / 2) ** 2)
        
        # Combine likelihoods (simple weighted sum in log space)
        combined_ll = total_ibd_ll + ibd2_prop_ll + segments_ll
        
        return combined_ll
    
    def infer_relationship(self, id1, id2, max_deg=3):
        """Infer the most likely relationship between two individuals."""
        # Get all possible relationship options
        options = self.get_relationship_options(max_deg)
        
        # Compute likelihood for each option
        likelihoods = []
        for rel_tuple in options:
            ll = self.get_relationship_likelihood(id1, id2, rel_tuple)
            likelihoods.append((rel_tuple, ll))
        
        # Sort by likelihood (highest first)
        likelihoods.sort(key=lambda x: x[1], reverse=True)
        
        return likelihoods

# Create a utility function to describe relationships
def describe_relationship(rel_tuple):
    """Convert a relationship tuple to a human-readable description."""
    up, down, num_ancs = rel_tuple
    
    if up == 0 and down == 0 and num_ancs == 2:
        return "Self"
    elif up == 0 and down == 1 and num_ancs == 1:
        return "Parent"
    elif up == 1 and down == 0 and num_ancs == 1:
        return "Child"
    elif up == 1 and down == 1 and num_ancs == 2:
        return "Full Sibling"
    elif up == 1 and down == 1 and num_ancs == 1:
        return "Half Sibling"
    elif up == 0 and down == 2 and num_ancs == 1:
        return "Grandparent"
    elif up == 2 and down == 0 and num_ancs == 1:
        return "Grandchild"
    elif up == 1 and down == 2 and num_ancs == 1:
        return "Aunt/Uncle"
    elif up == 2 and down == 1 and num_ancs == 1:
        return "Niece/Nephew"
    elif up == 2 and down == 2 and num_ancs == 2:
        return "Full First Cousin"
    elif up == 2 and down == 2 and num_ancs == 1:
        return "Half First Cousin"
    elif up == 3 and down == 3 and num_ancs == 2:
        return "Full Second Cousin"
    elif up == 3 and down == 3 and num_ancs == 1:
        return "Half Second Cousin"
    else:
        return f"Complex Relationship ({up}, {down}, {num_ancs})"

# Create a simplified likelihood calculator
pw_likelihood = SimplifiedPwLogLike(expanded_ibd_index)

# Test relationship inference on several pairs
pairs_to_test = [
    (1001, 1002),  # Parent-child
    (1003, 1004),  # Full siblings
    (1005, 1006),  # Half siblings
    (1007, 1008),  # First cousins
    (1009, 1010),  # Second cousins
    (1001, 1011),  # Unknown (distant relation)
    (1001, 2001),  # Unrelated
]

print("Relationship Inference Results:")
for id1, id2 in pairs_to_test:
    print(f"\nRelationship between {id1} and {id2}:")
    
    # Get relationship likelihoods
    likelihoods = pw_likelihood.infer_relationship(id1, id2)
    
    # Display top 3 most likely relationships
    print("Top 3 most likely relationships:")
    for i, (rel_tuple, ll) in enumerate(likelihoods[:3]):
        rel_desc = describe_relationship(rel_tuple)
        print(f"{i+1}. {rel_desc} ({rel_tuple}): Log-likelihood = {ll:.2f}")

## Summary

In this lab, we've explored how Bonsai v3 extracts and analyzes statistics from IBD segments to infer relationships and build pedigrees. Key takeaways include:

1. **Core IBD Statistics**: Bonsai tracks five key statistics for each pair of individuals: total IBD1, total IBD2, number of IBD1 segments, number of IBD2 segments, and maximum segment length.

2. **Relationship Patterns**: Different relationship types show characteristic patterns of IBD sharing, with both the amount and distribution of segments providing information about the relationship.

3. **IBD Networks**: Building networks based on IBD sharing helps identify communities of related individuals, which can serve as starting points for pedigree reconstruction.

4. **Relationship Inference**: By comparing observed IBD statistics to expected patterns for different relationships, Bonsai can compute likelihoods for different possible relationships.

5. **Statistical Models**: The `PwLogLike` class in Bonsai v3 implements sophisticated statistical models that account for the stochastic nature of genetic inheritance.

In the next lab, we'll explore how Bonsai builds on these IBD statistics to implement probabilistic relationship inference using statistical models.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab04_IBD_Statistics_Extraction.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive