# Lab 12: Fundamentals of Genetic Genealogy and IBD Segments

In this lab, we will explore the foundational concepts of Identity-By-Descent (IBD) segments and their role in genetic genealogy and pedigree reconstruction. Building upon the introduction to Bonsai in Lab 11, we will dive deeper into the theoretical aspects of IBD segments, analyze their statistical properties, and understand how they relate to genealogical relationships.

**Learning Objectives**:
- Develop a comprehensive understanding of IBD segments and their role in genetic genealogy
- Differentiate between IBD1 and IBD2 segments and understand their significance
- Analyze the mathematical models that describe IBD segment inheritance patterns
- Explore the statistical distributions of IBD segments across different relationship degrees
- Apply visualization techniques to interpret IBD sharing patterns
- Connect theoretical IBD segment models to practical pedigree reconstruction applications

## Environment Setup

In [None]:
!poetry install --no-root

In [None]:
import os
from collections import Counter
import logging
import sys
from pathlib import Path
import subprocess
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import IPython
import pandas as pd
import boto3
import importlib.util
import ast
import numpy as np
import networkx as nx
from scipy.stats import poisson, expon
import json
import pygraphviz as pgv
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

from dotenv import load_dotenv

In [None]:
def find_comp_gen_dir():
    """Find the computational_genetic_genealogy directory by searching up from current directory."""
    current = Path.cwd()
    
    # Search up through parent directories
    while current != current.parent:
        # Check if target directory exists in current path
        target = current / 'computational_genetic_genealogy'
        if target.is_dir():
            return target
        # Move up one directory
        current = current.parent
    
    raise FileNotFoundError("Could not find computational_genetic_genealogy directory")

def load_env_file():
    """Find and load the .env file from the computational_genetic_genealogy directory."""
    try:
        # Find the computational_genetic_genealogy directory
        comp_gen_dir = find_comp_gen_dir()
        
        # Look for .env file
        env_path = comp_gen_dir / '.env'
        if not env_path.exists():
            print(f"Warning: No .env file found in {comp_gen_dir}")
            return None
        
        # Load the .env file
        load_dotenv(env_path, override=True)
        print(f"Loaded environment variables from: {env_path}")
        return env_path
        
    except FileNotFoundError as e:
        print(f"Error: {e}")
        return None

# Use the function
env_path = load_env_file()

working_directory = os.getenv('PROJECT_WORKING_DIR', default=None)
data_directory = os.getenv('PROJECT_DATA_DIR', default=None)
references_directory = os.getenv('PROJECT_REFERENCES_DIR', default=None)
results_directory = os.getenv('PROJECT_RESULTS_DIR', default=None)
utils_directory = os.getenv('PROJECT_UTILS_DIR', default=None)

os.environ["WORKING_DIRECTORY"] = working_directory
os.environ["DATA_DIRECTORY"] = data_directory
os.environ["REFERENCES_DIRECTORY"] = references_directory
os.environ["RESULTS_DIRECTORY"] = results_directory
os.environ["UTILS_DIRECTORY"] = utils_directory

print(f"Working Directory: {working_directory}")
print(f"Data Directory: {data_directory}")
print(f"References Directory: {references_directory}")
print(f"Results Directory: {results_directory}")
print(f"Utils Directory: {utils_directory}")

os.chdir(working_directory)
print(f"The current directory is {os.getcwd()}")

## 1. Understanding IBD Segments

Identity-By-Descent (IBD) segments are stretches of DNA that are identical between two individuals because they inherited this segment from a common ancestor. These segments form the foundation for computational genetic genealogy and pedigree reconstruction.

### Loading IBD Segment Data

Let's start by loading the IBD segments we detected in previous labs. We'll use these segments to analyze IBD patterns and understand their relationship to genealogical connections.

In [None]:
# Load IBD segments from our previous detection
seg_file = os.path.join(data_directory, "class_data/ped_sim_run2.seg")
seg_df = pd.read_csv(seg_file, sep="\t", header=None)
seg_df.columns = ["sample1", "sample2", "chrom", "phys_start", "phys_end", "ibd_type", "gen_start", "gen_end", "gen_seg_len"]

# Display first few rows
seg_df.head()

In [None]:
# Simply extract the numeric part for statistics
seg_df['ibd_type_numeric'] = seg_df['ibd_type'].str.extract(r'IBD(\d+)').astype(int)

# Basic statistics
print(f"Total number of IBD segments: {len(seg_df)}")
print(f"Number of IBD1 segments: {len(seg_df[seg_df['ibd_type_numeric'] == 1])}")
print(f"Number of IBD2 segments: {len(seg_df[seg_df['ibd_type_numeric'] == 2])}")
print(f"Average segment length: {seg_df['gen_seg_len'].mean():.2f} cM")
print(f"Median segment length: {seg_df['gen_seg_len'].median():.2f} cM")
print(f"Min segment length: {seg_df['gen_seg_len'].min():.2f} cM")
print(f"Max segment length: {seg_df['gen_seg_len'].max():.2f} cM")

### Segment Length Distribution

The distribution of IBD segment lengths provides valuable information about the relationship between individuals. Let's visualize this distribution.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(seg_df['gen_seg_len'], bins=50, kde=True)
plt.title('Distribution of IBD Segment Lengths')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Frequency')
plt.axvline(x=7, color='red', linestyle='--', label='Common 7 cM threshold')
plt.legend()
plt.xlim(0, 200)
plt.show()

In [None]:
# Create a new label column based on the numeric values
seg_df['ibd_type_label'] = seg_df['ibd_type_numeric'].map({1: 'IBD1', 2: 'IBD2'})

# Check if we have data for both IBD1 and IBD2
print(f"Number of IBD1 segments: {len(seg_df[seg_df['ibd_type_numeric'] == 1])}")
print(f"Number of IBD2 segments: {len(seg_df[seg_df['ibd_type_numeric'] == 2])}")

# Only create the plot if we have data to plot
if len(seg_df['ibd_type_label'].unique()) > 1:
    # Standard histogram plot
    plt.figure(figsize=(10, 6))
    sns.histplot(data=seg_df, x='gen_seg_len', hue='ibd_type_label', 
                bins=50, kde=True, alpha=0.6)
    plt.title('Distribution of IBD1 vs IBD2 Segment Lengths')
    plt.xlabel('Segment Length (cM)')
    plt.ylabel('Frequency')
    plt.xlim(0, 200)
    plt.show()
else:
    # Alternative: If you only have one IBD type, just plot that one
    ibd_type = seg_df['ibd_type_label'].unique()[0]
    plt.figure(figsize=(10, 6))
    sns.histplot(data=seg_df, x='gen_seg_len', bins=50, kde=True, alpha=0.6, color='blue')
    plt.title(f'Distribution of {ibd_type} Segment Lengths')
    plt.xlabel('Segment Length (cM)')
    plt.ylabel('Frequency')
    plt.xlim(0, 200)
    plt.show()
    
    print(f"NOTE: Only {ibd_type} segments were found in the data.")

## 2. Mathematical Models of IBD Segment Inheritance

The inheritance of IBD segments follows specific mathematical models. Let's implement and visualize these models to understand how IBD segments are distributed for different relationship types.

### The Exponential Model of Segment Length

For distant relationships, the length of IBD segments follows an exponential distribution. The probability that an IBD segment has length greater than x centiMorgans is:

$P(L > x) = e^{-rx}$

where:
- L is the length of an IBD segment (in centiMorgans)
- x is a specific length threshold
- r is the number of meioses (i.e., twice the number of generations to the common ancestor)

Let's implement this model and visualize the expected segment length distributions for different relationship types.

In [None]:
def segment_length_pdf(x, r):
    """Probability density function for IBD segment length.
    
    Args:
        x: Segment length in centiMorgans
        r: Number of meioses (2 * generations to common ancestor)
    
    Returns:
        Probability density at length x
    """
    # According to Bonsai v3 likelihoods.py, exponential distribution
    # with rate r/100 is used for distant relationships
    return (r/100) * np.exp(-(r/100) * x)

def expected_number_segments(relationship_coefficient, min_segment_length=7, genome_length=3500):
    """Calculate expected number of IBD segments above min_segment_length.
    
    This follows the Bonsai v3 implementation (likelihoods.py).
    
    Args:
        relationship_coefficient: Coefficient of relatedness (e.g., 0.5 for parent-child)
        min_segment_length: Minimum segment length in cM to consider
        genome_length: Length of the genome in cM
    
    Returns:
        Expected number of segments
    """
    # r represents the number of meioses
    r = -np.log(relationship_coefficient) / np.log(2)
    
    # For close relationships (r <= 6), Bonsai uses empirically determined values
    if r == 1:  # Parent-child or full siblings
        return 22  # This will be overridden for full siblings in cell 15
    elif r == 2:  # Grandparent/Half-siblings
        return 17
    elif r == 3:  # First cousins
        return 10
    elif r == 4:  # First cousins once removed
        return 5
    elif r == 5:  # Second cousins
        return 3
    elif r == 6:  # Second cousins once removed
        return 1
    else:
        # For distant relationships, Bonsai uses this formula derived from theory
        # Bonsai constants: R=35.5, C=22, where R is recombination events per meiosis 
        # and C is number of autosomes
        R = 35.5
        C = 22
        a = 2 if relationship_coefficient == 0.5 else 1  # Number of common ancestors
        expected_segments = (2**(1-r)) * a * (R*r + C) * np.exp(-r * min_segment_length/100)
        return expected_segments

def expected_total_length(relationship_coefficient, min_segment_length=7):
    """Calculate expected total length of IBD segments.
    
    This follows the Bonsai v3 implementation (likelihoods.py).
    
    Args:
        relationship_coefficient: Coefficient of relatedness
        min_segment_length: Minimum segment length in cM to consider
    
    Returns:
        Expected total length in cM
    """
    # r represents the number of meioses
    r = -np.log(relationship_coefficient) / np.log(2)
    genome_length = 3545  # From Bonsai v3 constants.py AUTO_GENOME_LENGTH
    
    # Different expectations for different relationship types
    if r == 1:  # Parent-child (Full siblings will be overridden in cell 15)
        return genome_length  # Full autosomal genome, ~3545 cM
    elif r == 2:  # Grandparent/Half-siblings
        # Mean segment length is ~38cM for r=2 (from Bonsai)
        # 17 segments * 38cM = 646cM
        return 646
    elif r == 3:  # First cousins
        return 280  # 10 segments * 28cM
    elif r == 4:  # First cousins once removed
        return 110  # 5 segments * 22cM
    elif r == 5:  # Second cousins
        return 54   # 3 segments * 18cM
    elif r == 6:  # Second cousins once removed
        return 15   # 1 segment * 15cM
    else:
        # For distant relationships (r > 6):
        # Expected total length = expected number of segments * mean segment length
        # Mean segment length is approximately 100/r cM
        return expected_number_segments(relationship_coefficient) * (100/r)

In [None]:
# Define relationship types and their coefficients
relationships = {
    'Parent-Child': 0.5,
    'Full Siblings': 0.5,
    'Grandparent-Grandchild': 0.25,
    'Half Siblings': 0.25,
    'First Cousins': 0.125,
    'First Cousins Once Removed': 0.0625,
    'Second Cousins': 0.03125,
    'Third Cousins': 0.0078125,
}

# Calculate expected segments and total length for each relationship
results = []
for rel_type, coef in relationships.items():
    # Get base calculations from our functions
    expected_segs = expected_number_segments(coef)
    expected_len = expected_total_length(coef)
    
    # Override for full siblings (since they have the same coefficient as parent-child)
    if rel_type == 'Full Siblings':
        expected_segs = 22  # Similar to parent-child due to recombination patterns
        expected_len = 2600  # ~2600 cM, less than parent-child due to regions of no sharing
    
    r = -np.log(coef) / np.log(2)
    results.append({
        'Relationship': rel_type,
        'Coefficient': coef,
        'Meioses': r,
        'Expected Segments (>7cM)': expected_segs,
        'Expected Total Length (cM)': expected_len
    })

# Create a DataFrame and display the results
results_df = pd.DataFrame(results)
results_df

In [None]:
# Visualize segment length distributions for different relationships
plt.figure(figsize=(12, 8))

# Define range of segment lengths
x = np.linspace(0, 100, 1000)

# Plot PDF for each relationship type
for rel_type, coef in list(relationships.items())[:5]:  # Plot just the first 5 for clarity
    r = -np.log(coef) / np.log(2)
    y = segment_length_pdf(x, r)
    plt.plot(x, y, label=f"{rel_type} (r={r:.1f})")

plt.title('IBD Segment Length Distributions by Relationship Type')
plt.xlabel('Segment Length (cM)')
plt.ylabel('Probability Density')
plt.axvline(x=7, color='black', linestyle='--', label='Common 7 cM threshold')
plt.legend()
plt.xlim(0, 100)
plt.grid(True, alpha=0.3)
plt.show()

### Expected vs. Observed IBD Sharing

Let's compare the theoretical expectations with our observed data. We'll need to first identify pairs of individuals with known relationships in our dataset.

In [None]:
# Load the fam file to get pedigree information
fam_file = os.path.join(data_directory, "class_data/ped_sim_run2-everyone.fam")
fam_df = pd.read_csv(fam_file, sep='\s+', header=None)
fam_df.columns = ["family_id", "individual_id", "father_id", "mother_id", "sex", "phenotype"]
fam_df.head()

In [None]:
# Map individual IDs to Bonsai IDs
dict_file = os.path.join(data_directory, "class_data/ped_sim_run2.seg_dict.txt")
id_map_df = pd.read_csv(dict_file, sep='\t', header=None)
id_map_df.columns = ["individual_id", "bonsai_id"]
id_map_df.head()

In [None]:
# Create dictionaries for mapping between IDs
individual_to_bonsai = dict(zip(id_map_df['individual_id'], id_map_df['bonsai_id']))
bonsai_to_individual = dict(zip(id_map_df['bonsai_id'], id_map_df['individual_id']))

# Add Bonsai IDs to the fam DataFrame
fam_df['bonsai_id'] = fam_df['individual_id'].map(individual_to_bonsai)
fam_df['father_bonsai_id'] = fam_df['father_id'].map(individual_to_bonsai)
fam_df['mother_bonsai_id'] = fam_df['mother_id'].map(individual_to_bonsai)

# Replace missing values with NaN
fam_df['father_bonsai_id'] = fam_df['father_bonsai_id'].replace('0', np.nan)
fam_df['mother_bonsai_id'] = fam_df['mother_bonsai_id'].replace('0', np.nan)

fam_df.head()

In [None]:
def identify_relationships(fam_df):
    """Identify different types of relationships in the pedigree.
    
    Args:
        fam_df: DataFrame with pedigree information
        
    Returns:
        Dictionary with relationship pairs grouped by type
    """
    relationships = {
        'Parent-Child': [],
        'Full Siblings': [],
        'Grandparent-Grandchild': [],
        'Half Siblings': [],
        'First Cousins': [],
    }
    
    # Create a directed graph from the pedigree
    G = nx.DiGraph()
    
    # Add nodes and edges
    for _, row in fam_df.iterrows():
        indiv_id = row['bonsai_id']
        if pd.notna(indiv_id):
            G.add_node(indiv_id)
            
            # Add edges from parents to child
            if pd.notna(row['father_bonsai_id']):
                G.add_edge(row['father_bonsai_id'], indiv_id)
            if pd.notna(row['mother_bonsai_id']):
                G.add_edge(row['mother_bonsai_id'], indiv_id)
    
    # Find parent-child relationships
    for edge in G.edges():
        parent, child = edge
        relationships['Parent-Child'].append((parent, child))
    
    # Find siblings (full and half)
    for node in G.nodes():
        # Get parents of this node
        parents = list(G.predecessors(node))
        if len(parents) == 0:
            continue
            
        # Get other children of these parents
        for parent in parents:
            siblings = [child for child in G.successors(parent) if child != node]
            for sibling in siblings:
                # Check if they share both parents or just one
                sibling_parents = list(G.predecessors(sibling))
                common_parents = set(parents) & set(sibling_parents)
                
                if len(common_parents) == 2:  # Full siblings
                    # Only add once (avoid duplicates)
                    pair = tuple(sorted([node, sibling]))
                    if pair not in relationships['Full Siblings']:
                        relationships['Full Siblings'].append(pair)
                elif len(common_parents) == 1:  # Half siblings
                    pair = tuple(sorted([node, sibling]))
                    if pair not in relationships['Half Siblings']:
                        relationships['Half Siblings'].append(pair)
    
    # Find grandparent-grandchild relationships
    for grandparent in G.nodes():
        children = list(G.successors(grandparent))
        for child in children:
            grandchildren = list(G.successors(child))
            for grandchild in grandchildren:
                relationships['Grandparent-Grandchild'].append((grandparent, grandchild))
    
    # Find first cousins (share grandparents)
    for indiv1 in G.nodes():
        # Get parents of individual 1
        parents1 = list(G.predecessors(indiv1))
        
        # Get grandparents of individual 1
        grandparents1 = []
        for parent in parents1:
            grandparents1.extend(list(G.predecessors(parent)))
        
        if not grandparents1:
            continue
            
        # For each grandparent, find other grandchildren
        for grandparent in grandparents1:
            # Get children of grandparent (aunts/uncles)
            children = list(G.successors(grandparent))
            for child in children:
                # Make sure this is not a parent of indiv1
                if child in parents1:
                    continue
                    
                # Get children of the aunt/uncle (the cousins)
                cousins = list(G.successors(child))
                for cousin in cousins:
                    # Avoid duplicates
                    pair = tuple(sorted([indiv1, cousin]))
                    if pair not in relationships['First Cousins'] and indiv1 != cousin:
                        relationships['First Cousins'].append(pair)
    
    return relationships

# Identify relationship pairs
relationship_pairs = identify_relationships(fam_df)

# Print the first few pairs of each relationship type
for rel_type, pairs in relationship_pairs.items():
    print(f"{rel_type}: {len(pairs)} pairs found")
    if pairs:
        print(f"  Example pairs: {pairs[:3]}")
    print()

In [None]:
def calculate_ibd_sharing(pairs, seg_df, min_cm=0):
    """Calculate IBD sharing statistics for each pair.
    
    Args:
        pairs: List of pairs (tuples) to analyze
        seg_df: DataFrame with IBD segment information
        min_cm: Minimum segment size to consider
    
    Returns:
        DataFrame with IBD sharing statistics for each pair
    """
    pair_stats = []
    
    for pair in pairs:
        id1, id2 = pair
        
        # Find all segments between this pair
        pair_segments = seg_df[
            ((seg_df['sample1'] == id1) & (seg_df['sample2'] == id2)) |
            ((seg_df['sample1'] == id2) & (seg_df['sample2'] == id1))
        ]
        
        # Filter by minimum size if needed
        if min_cm > 0:
            pair_segments = pair_segments[pair_segments['gen_seg_len'] >= min_cm]
        
        if len(pair_segments) == 0:
            continue
        
        # Calculate statistics
        total_segments = len(pair_segments)
        total_length = pair_segments['gen_seg_len'].sum()
        avg_length = pair_segments['gen_seg_len'].mean() if total_segments > 0 else 0
        
        # Count IBD1 and IBD2 segments
        ibd1_segments = len(pair_segments[pair_segments['ibd_type'] == 1])
        ibd2_segments = len(pair_segments[pair_segments['ibd_type'] == 2])
        
        pair_stats.append({
            'ID1': id1,
            'ID2': id2,
            'Total Segments': total_segments,
            'Total Length (cM)': total_length,
            'Average Length (cM)': avg_length,
            'IBD1 Segments': ibd1_segments,
            'IBD2 Segments': ibd2_segments
        })
    
    return pd.DataFrame(pair_stats)

# Calculate statistics for each relationship type
relationship_stats = {}
for rel_type, pairs in relationship_pairs.items():
    if pairs:
        stats = calculate_ibd_sharing(pairs, seg_df, min_cm=7)
        stats['Relationship'] = rel_type
        relationship_stats[rel_type] = stats

# Combine all statistics
all_stats = pd.concat(relationship_stats.values())
all_stats.head()

In [None]:
# Visualize the IBD sharing by relationship type
plt.figure(figsize=(12, 8))

sns.boxplot(x='Relationship', y='Total Length (cM)', data=all_stats)
plt.title('Total IBD Sharing by Relationship Type')
plt.ylabel('Total IBD Length (cM)')
plt.xlabel('Relationship Type')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Segment count by relationship type
plt.figure(figsize=(12, 8))

sns.boxplot(x='Relationship', y='Total Segments', data=all_stats)
plt.title('Number of IBD Segments by Relationship Type')
plt.ylabel('Number of Segments (>7cM)')
plt.xlabel('Relationship Type')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Compare Observed vs. Expected IBD Sharing

Let's compare our observed IBD sharing with the theoretical expectations for each relationship type.

In [None]:
# Aggregate statistics by relationship type
agg_stats = all_stats.groupby('Relationship').agg({
    'Total Segments': ['mean', 'median', 'std'],
    'Total Length (cM)': ['mean', 'median', 'std'],
    'Average Length (cM)': ['mean', 'median']
}).reset_index()

# Flatten the column names
agg_stats.columns = ['_'.join(col).strip('_') for col in agg_stats.columns.values]
agg_stats

In [None]:
# Merge with expected values
expected_stats = results_df[['Relationship', 'Expected Segments (>7cM)', 'Expected Total Length (cM)']]
comparison = pd.merge(agg_stats, expected_stats, on='Relationship', how='left')

# Calculate ratio of observed to expected
comparison['Segments_Ratio'] = comparison['Total Segments_mean'] / comparison['Expected Segments (>7cM)']
comparison['Length_Ratio'] = comparison['Total Length (cM)_mean'] / comparison['Expected Total Length (cM)']

comparison[['Relationship', 'Total Segments_mean', 'Expected Segments (>7cM)', 'Segments_Ratio',
           'Total Length (cM)_mean', 'Expected Total Length (cM)', 'Length_Ratio']]

In [None]:
# Visualize comparison of observed vs expected
plt.figure(figsize=(12, 6))

# Plot observed vs expected segment counts
plt.subplot(1, 2, 1)
plt.bar(comparison['Relationship'], comparison['Total Segments_mean'], alpha=0.6, label='Observed')
plt.bar(comparison['Relationship'], comparison['Expected Segments (>7cM)'], alpha=0.6, label='Expected')
plt.title('IBD Segment Count: Observed vs Expected')
plt.ylabel('Number of Segments (>7cM)')
plt.xlabel('Relationship Type')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)

# Plot observed vs expected total length
plt.subplot(1, 2, 2)
plt.bar(comparison['Relationship'], comparison['Total Length (cM)_mean'], alpha=0.6, label='Observed')
plt.bar(comparison['Relationship'], comparison['Expected Total Length (cM)'], alpha=0.6, label='Expected')
plt.title('Total IBD Length: Observed vs Expected')
plt.ylabel('Total Length (cM)')
plt.xlabel('Relationship Type')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Predicting Relationships from IBD Patterns

Let's create a simple function that predicts relationship types based on IBD sharing patterns.

In [None]:
def predict_relationship(total_segments, total_length, include_ibd2=False, ibd2_segments=0):
    """Predict relationship type based on IBD sharing statistics.
    
    Args:
        total_segments: Number of IBD segments (>7cM)
        total_length: Total IBD sharing in cM
        include_ibd2: Whether to include IBD2 in the prediction
        ibd2_segments: Number of IBD2 segments
    
    Returns:
        Predicted relationship type and confidence score
    """
    # First, check for parent-child relationship (characterized by IBD1 across entire genome)
    if 15 <= total_segments <= 30 and 3300 <= total_length <= 3700 and ibd2_segments < 5:
        return "Parent-Child", 0.95
    
    # Check for full siblings (mix of IBD0, IBD1, and IBD2)
    if include_ibd2 and ibd2_segments > 10 and 15 <= total_segments <= 30 and 2200 <= total_length <= 3000:
        return "Full Siblings", 0.9
    
    # Other relationships based on Bonsai v3 expected values
    if 2200 <= total_length <= 2800:
        return "Full Siblings", 0.8
    elif 550 <= total_length <= 750:
        return "Grandparent-Grandchild/Half Siblings", 0.75
    elif 250 <= total_length <= 350:
        return "First Cousins", 0.7
    elif 100 <= total_length <= 150:
        return "First Cousins Once Removed", 0.65
    elif 50 <= total_length <= 70:
        return "Second Cousins", 0.6
    elif 30 <= total_length <= 45:
        return "Third Cousins", 0.5
    else:
        return "Unknown or More Distant", 0.3

# Test the prediction function on some known relationships
test_pairs = []
for rel_type, pairs in relationship_pairs.items():
    if pairs:
        # Sample up to 5 pairs from each relationship type
        for pair in pairs[:5]:
            test_pairs.append((pair, rel_type))

# Predict relationships for test pairs
prediction_results = []
for (id1, id2), true_rel in test_pairs:
    # Get IBD statistics for this pair
    pair_segments = seg_df[
        ((seg_df['sample1'] == id1) & (seg_df['sample2'] == id2)) |
        ((seg_df['sample1'] == id2) & (seg_df['sample2'] == id1))
    ]
    
    # Filter by minimum size
    pair_segments = pair_segments[pair_segments['gen_seg_len'] >= 7]
    
    if len(pair_segments) == 0:
        continue
    
    # Calculate statistics
    total_segments = len(pair_segments)
    total_length = pair_segments['gen_seg_len'].sum()
    ibd2_segments = len(pair_segments[pair_segments['ibd_type_numeric'] == 2])
    
    # Predict relationship
    predicted_rel, confidence = predict_relationship(
        total_segments, total_length, include_ibd2=True, ibd2_segments=ibd2_segments
    )
    
    prediction_results.append({
        'ID1': id1,
        'ID2': id2,
        'True Relationship': true_rel,
        'Predicted Relationship': predicted_rel,
        'Confidence': confidence,
        'Total Segments': total_segments,
        'Total Length (cM)': total_length,
        'IBD2 Segments': ibd2_segments
    })

# Create a DataFrame and display the results
pred_df = pd.DataFrame(prediction_results)
pred_df

In [None]:
# Calculate prediction accuracy
def evaluate_predictions(pred_df):
    """Evaluate relationship prediction accuracy."""
    # For simplicity, exact match or match within the predicted category
    correct = 0
    partial = 0
    incorrect = 0
    
    for _, row in pred_df.iterrows():
        true_rel = row['True Relationship']
        pred_rel = row['Predicted Relationship']
        
        if true_rel == pred_rel:
            correct += 1
        elif true_rel in pred_rel or any(r in pred_rel for r in true_rel.split('/')):
            partial += 1
        else:
            incorrect += 1
    
    total = len(pred_df)
    print(f"Exact matches: {correct} ({correct/total:.1%})")
    print(f"Partial matches: {partial} ({partial/total:.1%})")
    print(f"Incorrect: {incorrect} ({incorrect/total:.1%})")
    print(f"Overall accuracy: {(correct + partial)/total:.1%}")

evaluate_predictions(pred_df)

## 4. Connecting to Bonsai: The IBD Segment Foundation

Now let's see how these IBD concepts relate to the Bonsai algorithm for pedigree reconstruction.

In [None]:
# Import Bonsai from utils
import sys
sys.path.append(utils_directory)
from bonsaitree.bonsaitree.v3 import bonsai

# View the docstring for the build_pedigree function
help(bonsai.build_pedigree)

### IBD Segments as Input to Bonsai

The Bonsai algorithm requires IBD segments as input for pedigree reconstruction. Let's prepare a subset of our data for Bonsai processing.

In [None]:
# Get a subset of individuals for analysis (e.g., first 20 individuals)
unique_individuals = set(seg_df["sample1"]).union(set(seg_df["sample2"]))
subset_individuals = sorted(list(unique_individuals))[:20]
print(f"Analyzing {len(subset_individuals)} individuals: {subset_individuals}")

In [None]:
# Filter IBD segments to include only the subset individuals
subset_segments = seg_df[
    (seg_df['sample1'].isin(subset_individuals)) & 
    (seg_df['sample2'].isin(subset_individuals))
]
print(f"Using {len(subset_segments)} IBD segments between these individuals")

In [None]:
# Create bioinfo for Bonsai
# This would typically include age and sex information
import random

bioinfo = []
for indiv_id in subset_individuals:
    # Assign random age and sex for demonstration
    age = random.randint(20, 80)
    sex = random.choice(['M', 'F'])
    bioinfo.append({'genotype_id': int(indiv_id), 'age': age, 'sex': sex})

# Convert to unphased IBD segment list format
def create_unphased_ibd_seg_list(segments):
    """Creates an unphased IBD segment list from the given DataFrame."""
    unphased_ibd_seg_list = []

    for _, row in segments.iterrows():
        try:
            id1 = int(row['sample1'])
            id2 = int(row['sample2'])
            chrom = str(row['chrom'])  # Convert chromosome to string if necessary
            start_bp = float(row['phys_start'])
            end_bp = float(row['phys_end'])
            is_full = row['ibd_type'] == 2  # Assuming IBD2 indicates "full"
            len_cm = float(row['gen_seg_len'])

            unphased_ibd_seg_list.append([id1, id2, chrom, start_bp, end_bp, is_full, len_cm])
        except Exception as e:
            print(f"Error processing row: {e}")

    return unphased_ibd_seg_list

unphased_ibd_seg_list = create_unphased_ibd_seg_list(subset_segments)

print(f"First 5 segments in Bonsai format:")
for i in range(min(5, len(unphased_ibd_seg_list))):
    print(unphased_ibd_seg_list[i])

In [None]:
# Run Bonsai on this subset (this may take a few minutes)
min_segment_length = 7  # Use 7cM as minimum segment length

# Note: We're only running this as a demonstration
# In a real analysis, you would use more individuals and tune the parameters
from utils.bonsaitree.bonsaitree.v3 import bonsai

try:
    # Run with a timeout to avoid running too long
    import signal
    class TimeoutException(Exception): pass
    
    def timeout_handler(signum, frame):
        raise TimeoutException("Timed out!")
    
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(300)  # 5 minute timeout
    
    up_dict_log_like_list = bonsai.build_pedigree(
        bio_info=bioinfo,
        unphased_ibd_seg_list=unphased_ibd_seg_list,
        min_seg_len=min_segment_length
    )
    
    signal.alarm(0)  # Cancel the alarm
    
    # Display the results
    if up_dict_log_like_list:
        for i, (pedigree, log_like) in enumerate(up_dict_log_like_list):
            print(f"Pedigree {i+1} log likelihood: {log_like}")
            
            # Count types of nodes
            real_individuals = [node for node in pedigree.keys() if isinstance(node, int) and node > 0]
            inferred_ancestors = [node for node in pedigree.keys() if isinstance(node, int) and node < 0]
            
            print(f"  Real individuals: {len(real_individuals)}")
            print(f"  Inferred ancestors: {len(inferred_ancestors)}")
except TimeoutException:
    print("Bonsai execution timed out. This is expected in the notebook demonstration.")
    print("For a full analysis, consider running Bonsai with more carefully selected parameters.")
except Exception as e:
    print(f"Error running Bonsai: {e}")

## 5. Exercises

Complete the following exercises to deepen your understanding of IBD segments and their role in pedigree reconstruction.

### Exercise 1: Segment Length Distributions by Relationship

Create histograms of segment length distributions for each relationship type (Parent-Child, Full Siblings, etc.) and discuss the differences you observe.

In [None]:
# Your code for Exercise 1
# Hint: Group segments by relationship type and plot histograms

### Exercise 2: IBD Segment Count vs Length Correlation

Create a scatter plot showing the relationship between the number of segments and total IBD length for different relationship types. Discuss what this reveals about how IBD is distributed in different relationships.

In [None]:
# Your code for Exercise 2
# Hint: Use the all_stats DataFrame and create a scatter plot with one point per relationship

### Exercise 3: Improve the Relationship Prediction Function

Modify the `predict_relationship` function to improve its accuracy. Consider using more sophisticated methods such as decision trees or logistic regression.

In [None]:
# Your code for Exercise 3
# Hint: You could use scikit-learn to build a classifier

### Exercise 4: IBD2 Analysis

The presence of IBD2 segments (where both chromosomes are shared IBD) provides strong evidence for certain relationships. Analyze the distribution of IBD2 segments in your dataset and discuss how they can help in relationship inference.

In [None]:
# Your code for Exercise 4
# Hint: Filter for IBD2 segments and analyze their distribution

### Exercise 5: Simulate IBD Patterns

Use the mathematical models we discussed to simulate IBD patterns for different relationship types. Compare these simulations with your observed data.

In [None]:
# Your code for Exercise 5
# Hint: Use the exponential distribution to simulate segment lengths

## Conclusion

In this lab, we explored the fundamental concepts of IBD segments and their role in genetic genealogy and pedigree reconstruction. We analyzed the mathematical models that describe IBD segment inheritance, visualized IBD sharing patterns, and connected these theoretical concepts to practical pedigree reconstruction using the Bonsai algorithm.

Key takeaways:
- IBD segments follow predictable statistical distributions based on relationship type
- The length and number of IBD segments provide complementary information for relationship inference
- Understanding IBD segment patterns is essential for effective pedigree reconstruction
- The Bonsai algorithm leverages these patterns to infer pedigree structures from genetic data

In the next lab, we will delve deeper into the mathematical foundations of Bonsai, exploring how it calculates likelihoods and optimizes pedigree structures.