# Lab 1: Introduction to Genetic Genealogy and IBD Concepts

## Overview

This lab introduces the foundational concepts of genetic genealogy and Identity-by-Descent (IBD) that underpin the Bonsai v3 system. Through practical exercises, we'll explore:

1. The biological foundations of genetic inheritance
2. Identity-by-Descent (IBD) concepts and detection
3. How IBD patterns reveal genealogical relationships
4. The fundamental principles of pedigree reconstruction

In [ ]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib

sys.path.append(os.path.dirname(os.getcwd()))

# Check if running in Google Colab
def is_colab():
    try:
        import google.colab
        return True
    except ImportError:
        return False

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [ ]:
# Environment-specific setup
if is_colab():
    print("Running in Google Colab environment")
    
    # Mount Google Drive (optional)
    from google.colab import drive
    
    # Uncomment to mount Google Drive
    # drive.mount('/content/drive')
    
    # Create local directories
    !mkdir -p class_data
    !mkdir -p results
    
    # Download sample data from S3
    S3_BASE = "https://computational-genetic-genealogy.s3.us-east-2.amazonaws.com/class_data/"
    !wget -q -O class_data/pedigree.fam {S3_BASE}pedigree.fam
    print("Downloaded pedigree.fam from S3")
    
elif is_jupyterlite():
    print("Running in JupyterLite environment - using pre-bundled data")
else:
    print("Running in local environment - using persistent storage in", RESULTS_DIR)

# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

## Introduction to Genetic Genealogy

Genetic genealogy is the application of genetic analysis to traditional genealogical methods. By analyzing patterns of DNA sharing between individuals, we can infer their biological relationships and reconstruct family trees.

The key principle underlying genetic genealogy is that DNA is inherited from our biological parents, with specific patterns of transmission that allow us to identify and quantify relationships.

### Biological Foundations: DNA Inheritance

Let's visualize the basic pattern of DNA inheritance through generations:

In [None]:
# Create a simple visualization of DNA inheritance through generations
plt.figure(figsize=(12, 8))

# Colors for different ancestral contributions
colors = ['#E41A1C', '#377EB8', '#4DAF4A', '#984EA3', '#FF7F00', '#FFFF33', '#A65628', '#F781BF']

# Set up the generations
generations = 4
individuals_per_gen = [2**(generations-i) for i in range(1, generations+1)]
total_individuals = sum(individuals_per_gen)

# Create a DataFrame to store the ancestral DNA contribution
contribution_data = []

# Generate data
for gen in range(generations):
    for i in range(individuals_per_gen[gen]):
        person_id = f"G{gen+1}_{i+1}"
        
        if gen == 0:  # First generation (great-grandparents)
            # Each ancestor contributes their unique DNA
            contributions = {f"Ancestor_{i+1}": 1.0}
        else:  # Later generations
            # Compute parent indices
            parent1_idx = i // 2
            parent2_idx = parent1_idx + individuals_per_gen[gen-1] // 2
            
            # Get parents' contribution data
            parent1_data = next(d for d in contribution_data if d["person_id"] == f"G{gen}__{parent1_idx+1}")
            parent2_data = next(d for d in contribution_data if d["person_id"] == f"G{gen}__{parent2_idx+1}")
            
            # Combine parents' contributions (half from each parent)
            contributions = {}
            for ancestor, contrib in parent1_data["contributions"].items():
                contributions[ancestor] = contrib / 2
            for ancestor, contrib in parent2_data["contributions"].items():
                contributions[ancestor] = contributions.get(ancestor, 0) + contrib / 2
        
        contribution_data.append({
            "person_id": person_id,
            "generation": gen+1,
            "individual": i+1,
            "contributions": contributions
        })

# Plot the data
for gen in range(generations):
    y_pos = generations - gen - 1
    
    # Get data for this generation
    gen_data = [d for d in contribution_data if d["generation"] == gen+1]
    
    # Plot each individual
    for i, person in enumerate(gen_data):
        x_start = i / len(gen_data)
        x_width = 1 / len(gen_data)
        
        # Sort contributions for consistent coloring
        sorted_contribs = sorted(person["contributions"].items())
        
        # Plot the DNA segments
        x_offset = 0
        for j, (ancestor, contrib) in enumerate(sorted_contribs):
            color_idx = int(ancestor.split('_')[1]) - 1
            plt.bar(x_start + x_width/2, 0.8, width=x_width * contrib, bottom=y_pos - 0.4, 
                   color=colors[color_idx % len(colors)], align='center')
            x_offset += x_width * contrib
        
        # Add labels
        if gen == generations - 1:  # Only label the last generation
            plt.text(x_start + x_width/2, y_pos - 0.7, f"Person {i+1}", ha='center')

# Add generation labels
for gen in range(generations):
    y_pos = generations - gen - 1
    gen_name = ["Great-grandparents", "Grandparents", "Parents", "Children"][gen]
    plt.text(-0.05, y_pos, gen_name, ha='right', va='center', fontweight='bold')

# Create a legend for ancestors
ancestor_patches = [plt.Rectangle((0, 0), 1, 1, color=colors[i % len(colors)]) for i in range(individuals_per_gen[0])]
plt.legend(ancestor_patches, [f"Ancestor {i+1}" for i in range(individuals_per_gen[0])], 
          loc='center left', bbox_to_anchor=(1, 0.5))

# Set up the plot
plt.xlim(-0.1, 1.1)
plt.ylim(-1, generations)
plt.axis('off')
plt.title('DNA Inheritance Through Generations', fontsize=16)

plt.tight_layout()
plt.show()

### Key Principles of Genetic Inheritance:

1. **Autosomal DNA**: Each person inherits approximately 50% of their autosomal DNA from each parent
2. **Recombination**: During meiosis, chromosomes exchange segments, creating unique combinations of ancestral DNA
3. **Random Inheritance**: Which 50% you inherit from each parent is largely random
4. **Dilution Over Generations**: The amount of DNA you share with an ancestor is approximately halved each generation back

These principles create predictable patterns of DNA sharing between relatives that form the basis for genetic genealogy.

## Identity by Descent (IBD)

Identity by Descent (IBD) refers to segments of DNA that two individuals have inherited from a common ancestor. These segments are identical because they are copies of the exact same ancestral DNA segment.

IBD is the fundamental unit of genetic relatedness detection and is central to Bonsai's approach to pedigree reconstruction.

### IBD Types: IBD1 vs IBD2

There are two important types of IBD sharing:

1. **IBD1 (Half-Identical Regions)**: DNA segments where individuals share one chromosome of a pair (one allele at each position)
2. **IBD2 (Fully Identical Regions)**: DNA segments where individuals share both chromosomes of a pair (both alleles at each position)

Let's visualize these two types of IBD:

In [None]:
# Create a visualization of IBD1 vs IBD2
plt.figure(figsize=(12, 6))

# Define the chromosome length
chrom_length = 100

# Set up the plots
plt.subplot(2, 1, 1)

# Create person A's chromosomes
plt.barh(2, chrom_length, height=0.4, color='#E41A1C', alpha=0.8)
plt.barh(1, chrom_length, height=0.4, color='#377EB8', alpha=0.8)
plt.text(-5, 1.5, "Person A", ha='right', va='center', fontweight='bold')

# Create person B's chromosomes with IBD1 region
plt.barh(0, chrom_length, height=0.4, color='#4DAF4A', alpha=0.8)
plt.barh(-1, chrom_length, height=0.4, color='#377EB8', alpha=0.8)  # Shared with Person A (IBD1)
plt.text(-5, -0.5, "Person B", ha='right', va='center', fontweight='bold')

# Highlight the IBD1 region
ibd1_start = 25
ibd1_length = 30
plt.barh(-1, ibd1_length, left=ibd1_start, height=0.4, color='#377EB8', alpha=1.0, edgecolor='black', linewidth=2)
plt.barh(1, ibd1_length, left=ibd1_start, height=0.4, color='#377EB8', alpha=1.0, edgecolor='black', linewidth=2)

# Add text labels
plt.text(ibd1_start + ibd1_length/2, 0, "IBD1 (Half-Identical Region)", ha='center', va='center', bbox=dict(facecolor='white', alpha=0.7))

plt.xlim(-10, chrom_length + 10)
plt.ylim(-1.5, 2.5)
plt.xticks([])
plt.yticks([])
plt.title('IBD1: Half-Identical Region')

# Second plot for IBD2
plt.subplot(2, 1, 2)

# Create person C's chromosomes
plt.barh(2, chrom_length, height=0.4, color='#E41A1C', alpha=0.8)
plt.barh(1, chrom_length, height=0.4, color='#377EB8', alpha=0.8)
plt.text(-5, 1.5, "Person C", ha='right', va='center', fontweight='bold')

# Create person D's chromosomes with IBD2 region
plt.barh(0, chrom_length, height=0.4, color='#E41A1C', alpha=0.8)  # Shared with Person C (IBD2)
plt.barh(-1, chrom_length, height=0.4, color='#377EB8', alpha=0.8)  # Shared with Person C (IBD2)
plt.text(-5, -0.5, "Person D", ha='right', va='center', fontweight='bold')

# Highlight the IBD2 region
ibd2_start = 35
ibd2_length = 25
plt.barh(0, ibd2_length, left=ibd2_start, height=0.4, color='#E41A1C', alpha=1.0, edgecolor='black', linewidth=2)
plt.barh(2, ibd2_length, left=ibd2_start, height=0.4, color='#E41A1C', alpha=1.0, edgecolor='black', linewidth=2)
plt.barh(-1, ibd2_length, left=ibd2_start, height=0.4, color='#377EB8', alpha=1.0, edgecolor='black', linewidth=2)
plt.barh(1, ibd2_length, left=ibd2_start, height=0.4, color='#377EB8', alpha=1.0, edgecolor='black', linewidth=2)

# Add text labels
plt.text(ibd2_start + ibd2_length/2, 0, "IBD2 (Fully Identical Region)", ha='center', va='center', bbox=dict(facecolor='white', alpha=0.7))

plt.xlim(-10, chrom_length + 10)
plt.ylim(-1.5, 2.5)
plt.xticks([])
plt.yticks([])
plt.title('IBD2: Fully Identical Region')

plt.tight_layout()
plt.show()

### Measuring IBD: centiMorgans (cM)

IBD segments are typically measured in **centiMorgans (cM)**, a unit of genetic distance that accounts for recombination rates. This measurement is more relevant for inheritance analysis than physical length measures like base pairs.

Key characteristics of centiMorgans:
- 1 cM represents approximately a 1% chance of recombination in a single generation
- The human genome is about 3400 cM in length
- cM distances vary across the genome based on recombination hotspots

In [None]:
# Visualize the relationship between physical distance (Mb) and genetic distance (cM)
# Using simplified synthetic data based on real chromosome patterns

# Create synthetic data for chromosome 1
physical_pos = np.linspace(0, 250, 1000)  # Approximate length of chromosome 1 in Mb
np.random.seed(42)  # For reproducibility

# Create genetic positions with variable recombination rates
base_rate = 1  # Base recombination rate (cM/Mb)
hotspots = np.random.choice(len(physical_pos) - 50, 10, replace=False)  # Positions of recombination hotspots

# Generate genetic positions using variable recombination rates
genetic_pos = np.zeros_like(physical_pos)
for i in range(1, len(physical_pos)):
    # Check if we're in a hotspot region
    in_hotspot = any(abs(i - h) < 20 for h in hotspots)
    
    # Apply appropriate recombination rate
    if in_hotspot:
        rate = base_rate * 5  # Hotspot has 5x the recombination rate
    else:
        rate = base_rate
    
    genetic_pos[i] = genetic_pos[i-1] + rate * (physical_pos[i] - physical_pos[i-1])

# Plot the data
plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
plt.plot(physical_pos, genetic_pos, 'b-')
plt.xlabel('Physical Position (Mb)')
plt.ylabel('Genetic Position (cM)')
plt.title('Relationship Between Physical and Genetic Distance')
plt.grid(alpha=0.3)

# Plot recombination rate
plt.subplot(2, 1, 2)
recomb_rate = np.zeros_like(physical_pos[:-1])
for i in range(len(physical_pos) - 1):
    recomb_rate[i] = (genetic_pos[i+1] - genetic_pos[i]) / (physical_pos[i+1] - physical_pos[i])

plt.plot(physical_pos[:-1], recomb_rate, 'r-')
plt.xlabel('Physical Position (Mb)')
plt.ylabel('Recombination Rate (cM/Mb)')
plt.title('Recombination Rate Across the Chromosome')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### IBD Detection in Practice

IBD segments are detected using specialized algorithms that analyze DNA data from genotyping arrays or sequencing. Common IBD detection tools include:

1. **IBIS**: Fast and accurate IBD segment detection
2. **Refined IBD**: Precise detection with phased data
3. **HapIBD**: Modern IBD segment detector with good accuracy

Bonsai v3 is designed to work with the output from these IBD detectors, using the detected segments to infer relationships.

## IBD Patterns in Different Relationships

Different types of relationships show characteristic patterns of IBD sharing. Let's explore the expected IBD sharing for various relationship types:

In [None]:
# Create a table of expected IBD sharing for different relationships
relationships = [
    {"relationship": "Identical twins", "meiotic_distance": 0, "expected_sharing": 100.0, "ibd1_pct": 0, "ibd2_pct": 100},
    {"relationship": "Parent-Child", "meiotic_distance": 1, "expected_sharing": 50.0, "ibd1_pct": 100, "ibd2_pct": 0},
    {"relationship": "Full siblings", "meiotic_distance": 2, "expected_sharing": 50.0, "ibd1_pct": 50, "ibd2_pct": 25},
    {"relationship": "Half-siblings", "meiotic_distance": 2, "expected_sharing": 25.0, "ibd1_pct": 50, "ibd2_pct": 0},
    {"relationship": "Grandparent-Grandchild", "meiotic_distance": 2, "expected_sharing": 25.0, "ibd1_pct": 50, "ibd2_pct": 0},
    {"relationship": "Uncle/Aunt-Nephew/Niece", "meiotic_distance": 3, "expected_sharing": 25.0, "ibd1_pct": 50, "ibd2_pct": 0},
    {"relationship": "First cousins", "meiotic_distance": 4, "expected_sharing": 12.5, "ibd1_pct": 25, "ibd2_pct": 0},
    {"relationship": "First cousins once removed", "meiotic_distance": 5, "expected_sharing": 6.25, "ibd1_pct": 12.5, "ibd2_pct": 0},
    {"relationship": "Second cousins", "meiotic_distance": 6, "expected_sharing": 3.125, "ibd1_pct": 6.25, "ibd2_pct": 0},
    {"relationship": "Second cousins once removed", "meiotic_distance": 7, "expected_sharing": 1.563, "ibd1_pct": 3.125, "ibd2_pct": 0},
    {"relationship": "Third cousins", "meiotic_distance": 8, "expected_sharing": 0.781, "ibd1_pct": 1.563, "ibd2_pct": 0},
    {"relationship": "Fourth cousins", "meiotic_distance": 10, "expected_sharing": 0.195, "ibd1_pct": 0.391, "ibd2_pct": 0},
]

# Convert to DataFrame
rel_df = pd.DataFrame(relationships)

# Display the table
rel_df

In [None]:
# Visualize the expected total IBD sharing by relationship
plt.figure(figsize=(14, 8))

# Plot total sharing
plt.subplot(2, 1, 1)
bars = plt.bar(rel_df['relationship'], rel_df['expected_sharing'], color='#4292c6')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Expected Total IBD Sharing (%)')
plt.title('Expected IBD Sharing by Relationship Type')
plt.grid(axis='y', alpha=0.3)

# Add labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
            f'{height:.2f}%', ha='center', va='bottom')

# Plot IBD1 vs IBD2 distribution
plt.subplot(2, 1, 2)
width = 0.35
x = np.arange(len(rel_df))

plt.bar(x - width/2, rel_df['ibd1_pct'], width, label='IBD1 (Half-identical)', color='#6baed6')
plt.bar(x + width/2, rel_df['ibd2_pct'], width, label='IBD2 (Fully identical)', color='#08519c')

plt.xlabel('Relationship Type')
plt.ylabel('Genome Coverage (%)')
plt.title('IBD1 vs IBD2 Distribution by Relationship Type')
plt.xticks(x, rel_df['relationship'], rotation=45, ha='right')
plt.legend()
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### Stochasticity in IBD Sharing

While we've described the expected IBD sharing for different relationships, it's important to understand that actual IBD sharing is stochastic due to the random nature of recombination and inheritance.

Let's simulate the distribution of IBD sharing for first cousins to illustrate this variability:

In [None]:
# Simulate IBD sharing for first cousins
np.random.seed(42)
num_simulations = 10000

# Simulate IBD sharing for different relationships
relationships_to_simulate = [
    ("Full siblings", 0.5, 0.1),  # mean=50%, std=10%
    ("Half-siblings/Grandparent-Grandchild/Avuncular", 0.25, 0.05),  # mean=25%, std=5%
    ("First cousins", 0.125, 0.035)  # mean=12.5%, std=3.5%
]

# Create a figure to display distributions
plt.figure(figsize=(14, 10))

# Plot each relationship's distribution
for i, (rel_name, mean, std) in enumerate(relationships_to_simulate):
    plt.subplot(len(relationships_to_simulate), 1, i+1)
    
    # Generate simulated sharing values (using gamma distribution for positive skew)
    shape = (mean / std) ** 2
    scale = std ** 2 / mean
    sharing = np.random.gamma(shape, scale, num_simulations)
    
    # Plot the distribution
    sns.histplot(sharing * 100, bins=50, kde=True)
    plt.axvline(mean * 100, color='red', linestyle='--', label=f'Expected: {mean*100:.1f}%')
    
    # Add 95% interval
    lower_95 = np.percentile(sharing * 100, 2.5)
    upper_95 = np.percentile(sharing * 100, 97.5)
    plt.axvline(lower_95, color='green', linestyle=':', label=f'95% Interval: [{lower_95:.1f}%, {upper_95:.1f}%]')
    plt.axvline(upper_95, color='green', linestyle=':')
    
    plt.title(f'Distribution of IBD Sharing for {rel_name}')
    plt.xlabel('IBD Sharing (%)')
    plt.ylabel('Frequency')
    plt.legend()
    plt.grid(alpha=0.3)
    
plt.tight_layout()
plt.show()

### Using Bonsai v3 for Relationship Inference

Bonsai v3 uses sophisticated statistical models to infer relationships from observed IBD patterns. Let's briefly explore how Bonsai handles relationship inference:

In [None]:
try:
    from utils.bonsaitree.bonsaitree.v3 import likelihoods
    print("✅ Successfully imported Bonsai v3 likelihoods module")
    
    # Display module information
    print("\nBonsai v3 uses the PwLogLike class for pairwise relationship likelihood calculation.")
    print("This class implements sophisticated statistical models that account for:")
    print("  - IBD segment counts and lengths")
    print("  - Age differences between individuals")
    print("  - Background IBD sharing in the population")
    print("  - Variance in IBD sharing for different relationships")
    
    # Show methods in PwLogLike class if available
    if hasattr(likelihoods, 'PwLogLike'):
        print("\nKey methods in the PwLogLike class:")
        methods = [m for m in dir(likelihoods.PwLogLike) if not m.startswith('_')]
        for method in methods[:10]:  # Show up to 10 methods
            print(f"  - {method}")
        if len(methods) > 10:
            print(f"  - ... and {len(methods) - 10} more methods")
except ImportError as e:
    print(f"❌ Could not import Bonsai v3 likelihoods module: {e}")
    print("This section requires access to the Bonsai v3 codebase.")
    print("\nBonsai v3 uses sophisticated statistical models for relationship inference, including:")
    print("  - Expected IBD segment length distributions for different relationship types")
    print("  - Expected IBD segment counts for different relationship types")
    print("  - Age difference distributions for different relationship types")
    print("  - Combined likelihood models that integrate genetic and demographic evidence")

## Pedigree Reconstruction Fundamentals

Pedigree reconstruction is the process of inferring family tree structures from genetic data. This is a challenging computational problem due to:

1. The combinatorial explosion of possible pedigree structures
2. The stochastic nature of genetic inheritance
3. The presence of missing or incomplete data
4. The need to integrate multiple types of evidence

Bonsai v3 implements a sophisticated approach to pedigree reconstruction that addresses these challenges through:

1. A hierarchical bottom-up construction approach
2. Statistical modeling of relationship likelihoods
3. Efficient data structures for pedigree representation
4. Optimized algorithms for pedigree operations

### Bonsai v3 Pedigree Representation: The Up-Node Dictionary

A key innovation in Bonsai v3 is its efficient representation of pedigrees using the "up-node dictionary" data structure. This structure represents parent-child relationships in a compact, computationally efficient format:

In [None]:
# Example of a simple pedigree using Bonsai's up-node dictionary structure
example_pedigree = {
    1000: {1001: 1, 1002: 1},  # Individual 1000 has parents 1001 and 1002
    1003: {1001: 1, 1002: 1},  # Individual 1003 has the same parents (siblings)
    1004: {-1: 1, -2: 1},      # Individual 1004 has inferred parents -1 and -2
    -1: {1005: 1, 1006: 1},    # Inferred individual -1 has parents 1005 and 1006
    # Empty dictionaries represent founder individuals with no recorded parents
    1001: {},
    1002: {},
    1005: {},
    1006: {},
    -2: {}
}

# Create a visualization of this pedigree
def visualize_up_node_dict(up_node_dict):
    """Visualize a pedigree from an up-node dictionary"""
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add nodes and edges
    for child, parents in up_node_dict.items():
        # Add the child node
        is_inferred = child < 0
        G.add_node(child, inferred=is_inferred)
        
        # Add parent nodes and edges
        for parent in parents:
            is_parent_inferred = parent < 0
            G.add_node(parent, inferred=is_parent_inferred)
            G.add_edge(parent, child)  # Edge from parent to child
    
    # Position nodes using hierarchical layout
    pos = nx.nx_agraph.graphviz_layout(G, prog='dot')
    
    # Draw the pedigree
    plt.figure(figsize=(10, 8))
    
    # Draw regular and inferred nodes differently
    regular_nodes = [node for node in G.nodes() if not G.nodes[node]['inferred']]
    inferred_nodes = [node for node in G.nodes() if G.nodes[node]['inferred']]
    
    nx.draw_networkx_nodes(G, pos, nodelist=regular_nodes, node_color='skyblue', 
                         node_size=1000, node_shape='o')
    nx.draw_networkx_nodes(G, pos, nodelist=inferred_nodes, node_color='lightgray', 
                         node_size=1000, node_shape='s')
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, arrows=True, arrowsize=20)
    
    # Add labels
    labels = {node: f"ID: {node}" for node in G.nodes()}
    nx.draw_networkx_labels(G, pos, labels=labels, font_size=10)
    
    plt.axis('off')
    plt.title('Bonsai v3 Pedigree Representation (Up-Node Dictionary)')
    
    # Add a legend
    plt.figtext(0.15, 0.02, "● Regular individual (observed)", color='black', backgroundcolor='skyblue')
    plt.figtext(0.55, 0.02, "■ Inferred individual (latent)", color='black', backgroundcolor='lightgray')
    
    plt.tight_layout()
    plt.show()

# Visualize the example pedigree
try:
    visualize_up_node_dict(example_pedigree)
except ImportError:
    print("⚠️ Could not visualize pedigree: required libraries not available")
    print("This visualization requires the pygraphviz library.")
    print("\nThe up-node dictionary represents the following relationships:")
    for child, parents in example_pedigree.items():
        parent_str = ", ".join([str(p) for p in parents]) if parents else "(founder)"
        print(f"ID: {child} → Parents: {parent_str}")

### The Bonsai v3 Pedigree Reconstruction Workflow

The complete Bonsai v3 workflow follows these steps:

1. Process raw IBD segments from detectors like IBIS
2. Extract IBD statistics (counts, lengths) for all individual pairs
3. Compute pairwise relationship likelihoods using genetic and age data
4. Build small, highly confident pedigrees for closely related individuals
5. Iteratively combine smaller pedigrees into larger ones
6. Add individuals incrementally when building large pedigrees
7. Return the final pedigree with log-likelihood scores

This modular approach allows Bonsai to scale effectively to large datasets while maintaining accuracy.

## Summary and Key Concepts

In this lab, we've explored the foundational concepts of genetic genealogy and Identity-by-Descent (IBD) that underpin the Bonsai v3 system.

Key takeaways include:

1. **DNA Inheritance**: Genetic material is inherited from parents in predictable patterns that allow relationship inference
2. **Identity by Descent (IBD)**: IBD segments are regions of DNA inherited from a common ancestor
3. **IBD Types**: IBD1 (half-identical) and IBD2 (fully identical) regions provide different evidence for relationships
4. **Relationship Patterns**: Different relationship types show characteristic patterns of IBD sharing
5. **Pedigree Representation**: Bonsai v3 uses an efficient up-node dictionary structure to represent family trees
6. **Pedigree Reconstruction**: Bonsai v3 implements a hierarchical approach to building pedigrees from IBD data

In the next lab session, we'll explore the Bonsai v3 architecture in more detail, examining its core components and data structures.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab01_IBD_and_Genealogy_Intro.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive