# Lab 6: Probabilistic Relationship Inference

## Overview

This lab explores the probabilistic models used in Bonsai v3 for relationship inference. We'll focus on how statistical moments of IBD distributions are used to calculate likelihoods for different possible relationships.

Key topics include:

1. The mathematical framework for relationship inference
2. Computing statistical moments for IBD segment distributions
3. Building likelihood functions for different relationship types
4. Handling uncertainty and stochasticity in relationship inference
5. Calibrating models with real-world data

By the end of this lab, you'll understand how Bonsai v3 quantifies the probability of different relationships given observed IBD patterns.

In [None]:
# 🧬 Google Colab Setup - Run this cell first!
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown

def is_colab():
    '''Check if running in Google Colab'''
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    print("🔬 Setting up Google Colab environment...")
    
    # Install dependencies
    print("📦 Installing packages...")
    !pip install -q pysam biopython scikit-allel networkx pygraphviz seaborn plotly
    !apt-get update -qq && apt-get install -qq samtools bcftools tabix graphviz-dev
    
    # Create directories
    !mkdir -p /content/class_data /content/results
    
    # Download essential class data
    print("📥 Downloading class data...")
    S3_BASE = "https://computational-genetic-genealogy.s3.us-east-2.amazonaws.com/class_data/"
    data_files = [
        "pedigree.fam", "pedigree.def", 
        "merged_opensnps_autosomes_ped_sim.seg",
        "merged_opensnps_autosomes_ped_sim-everyone.fam",
        "ped_sim_run2.seg", "ped_sim_run2-everyone.fam"
    ]
    
    for file in data_files:
        !wget -q -O /content/class_data/{file} {S3_BASE}{file}
        print(f"  ✅ {file}")
    
    # Define utility functions
    def setup_environment():
        return "/content/class_data", "/content/results"
    
    def save_results(dataframe, filename, description="results"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        dataframe.to_csv(full_path, index=False)
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e3f2fd; border-left: 4px solid #2196f3; margin: 10px 0;">
            <p><strong>💾 Results saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    def save_plot(plt, filename, description="plot"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        plt.savefig(full_path, dpi=300, bbox_inches='tight')
        plt.show()
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e8f5e8; border-left: 4px solid #4caf50; margin: 10px 0;">
            <p><strong>📊 Plot saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    print("✅ Colab setup complete! Ready to explore genetic genealogy.")
    
else:
    print("🏠 Local environment detected")
    def setup_environment():
        return "class_data", "results"
    def save_results(df, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        df.to_csv(path, index=False)
        return path
    def save_plot(plt, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        plt.savefig(path, dpi=300, bbox_inches='tight')
        plt.show()
        return path

# Set up paths and configure visualization
DATA_DIR, RESULTS_DIR = setup_environment()
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code
        from IPython.display import display, Markdown
        display(Markdown(f"```python\n{source}\n```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
    
    # Check if moments module is available
    try:
        from utils.bonsaitree.bonsaitree.v3 import moments, likelihoods
        print("✅ Successfully imported Bonsai v3 moments and likelihoods modules")
        
        # Display available methods in modules
        print("\nAvailable functions in moments module:")
        for name in dir(moments):
            if not name.startswith('_') and callable(getattr(moments, name)):
                print(f"- {name}")
                
        print("\nAvailable classes in likelihoods module:")
        for name in dir(likelihoods):
            if not name.startswith('_') and inspect.isclass(getattr(likelihoods, name)):
                print(f"- {name}")
                
    except ImportError as e:
        print(f"❌ Failed to import Bonsai v3 moments module: {e}")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Part 1: Relationship Representation and Mathematical Framework

Let's start by exploring how Bonsai v3 represents relationships and the mathematical framework it uses for relationship inference.

### 1.1 Relationship Representation

Bonsai v3 uses a tuple representation for relationships. There are two equivalent formats used in different parts of the code:

1. **Five-value tuple**: `(degree1, removal1, degree2, removal2, half)`
   - `degree1`: Genealogical degree of person 1 (ancestors to common ancestor)
   - `removal1`: Removal for person 1 (generational distance within degree)
   - `degree2`: Genealogical degree of person 2 (ancestors to common ancestor)
   - `removal2`: Removal for person 2 (generational distance within degree)
   - `half`: 1 for half relationships (one common ancestor), 0 for full relationships (two common ancestors)

2. **Three-value tuple**: `(up, down, num_ancs)`
   - `up`: Generations from person 1 to common ancestor
   - `down`: Generations from common ancestor to person 2
   - `num_ancs`: Number of common ancestors (1 for half-relationships, 2 for full relationships)

The three-value representation is more commonly used in the moments module. Let's create a function to map these tuples to familiar relationship terms:

In [None]:
def describe_relationship(rel_tuple):
    """Convert a relationship tuple (up, down, num_ancs) to a human-readable description."""
    up, down, num_ancs = rel_tuple
    
    if up == 0 and down == 0 and num_ancs == 2:
        return "Self"
    elif up == 0 and down == 1 and num_ancs == 1:
        return "Parent"
    elif up == 1 and down == 0 and num_ancs == 1:
        return "Child"
    elif up == 1 and down == 1 and num_ancs == 2:
        return "Full Sibling"
    elif up == 1 and down == 1 and num_ancs == 1:
        return "Half Sibling"
    elif up == 0 and down == 2 and num_ancs == 1:
        return "Grandparent"
    elif up == 2 and down == 0 and num_ancs == 1:
        return "Grandchild"
    elif up == 1 and down == 2 and num_ancs == 1:
        return "Aunt/Uncle"
    elif up == 2 and down == 1 and num_ancs == 1:
        return "Niece/Nephew"
    elif up == 2 and down == 2 and num_ancs == 2:
        return "Full First Cousin"
    elif up == 2 and down == 2 and num_ancs == 1:
        return "Half First Cousin"
    elif up == 3 and down == 3 and num_ancs == 2:
        return "Full Second Cousin"
    elif up == 3 and down == 3 and num_ancs == 1:
        return "Half Second Cousin"
    elif up == 4 and down == 4 and num_ancs == 2:
        return "Full Third Cousin"
    elif up == 4 and down == 4 and num_ancs == 1:
        return "Half Third Cousin"
    elif up == 1 and down == 3 and num_ancs == 1:
        return "First Cousin Once Removed (Aunt/Uncle)"
    elif up == 3 and down == 1 and num_ancs == 1:
        return "First Cousin Once Removed (Niece/Nephew)"
    else:
        return f"Complex Relationship (up={up}, down={down}, num_ancs={num_ancs})"

# Create a table of common relationships and their tuple representations
relationships = [
    ((0, 0, 2), "Self"),
    ((0, 1, 1), "Parent"),
    ((1, 0, 1), "Child"),
    ((1, 1, 2), "Full Sibling"),
    ((1, 1, 1), "Half Sibling"),
    ((0, 2, 1), "Grandparent"),
    ((2, 0, 1), "Grandchild"),
    ((1, 2, 1), "Aunt/Uncle"),
    ((2, 1, 1), "Niece/Nephew"),
    ((2, 2, 2), "Full First Cousin"),
    ((2, 2, 1), "Half First Cousin"),
    ((3, 3, 2), "Full Second Cousin"),
    ((3, 3, 1), "Half Second Cousin"),
    ((4, 4, 2), "Full Third Cousin"),
    ((4, 4, 1), "Half Third Cousin")
]

# Create a DataFrame for easier viewing
rel_df = pd.DataFrame([(rel[0][0], rel[0][1], rel[0][2], rel[1], rel[0][0] + rel[0][1]) 
                       for rel in relationships],
                     columns=['Up', 'Down', 'Num Ancestors', 'Relationship', 'Meiotic Distance'])
display(rel_df)

### 1.2 The Mathematical Framework for Relationship Inference

The core mathematical framework for relationship inference in Bonsai v3 is Bayesian. For a pair of individuals with observed IBD data, we want to find the most likely relationship:

$$P(R|D) = \frac{P(D|R) \cdot P(R)}{P(D)}$$

Where:
- $P(R|D)$ is the probability of relationship $R$ given the observed IBD data $D$
- $P(D|R)$ is the likelihood of observing IBD data $D$ given relationship $R$
- $P(R)$ is the prior probability of relationship $R$
- $P(D)$ is the probability of the observed IBD data across all possible relationships

Since we're comparing different possible relationships for the same IBD data, $P(D)$ is constant and we can focus on:

$$P(R|D) \propto P(D|R) \cdot P(R)$$

Often, we work in log space to avoid numerical issues with very small probabilities:

$$\log P(R|D) \propto \log P(D|R) + \log P(R)$$

Let's implement this framework in simplified form:

In [None]:
def log_prior(relationship_tuple):
    """Calculate the log prior probability of a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        
    Returns:
        Log prior probability
    """
    up, down, num_ancs = relationship_tuple
    
    # Calculate meiotic distance
    m = up + down
    
    # Simple prior model: probability decreases exponentially with meiotic distance
    # and is higher for full relationships than half relationships
    if m == 0:  # Self
        return -10  # Very low prior for self-comparisons
    
    # Base prior depends on meiotic distance
    base_prior = -0.5 * m
    
    # Adjust for full vs half relationships
    if num_ancs == 2:  # Full relationship
        base_prior += 0.5  # Boost for full relationships
    
    # Penalize highly asymmetric relationships (large difference between up and down)
    asymmetry_penalty = -0.2 * abs(up - down)
    
    return base_prior + asymmetry_penalty

# Calculate log priors for common relationships
priors = [(rel[0], rel[1], log_prior(rel[0])) for rel in relationships]

# Convert to DataFrame for easier viewing
prior_df = pd.DataFrame([(p[0][0], p[0][1], p[0][2], p[1], p[2], np.exp(p[2])) 
                         for p in priors],
                      columns=['Up', 'Down', 'Num Ancestors', 'Relationship', 'Log Prior', 'Prior'])
prior_df['Prior'] = prior_df['Prior'] / prior_df['Prior'].sum()  # Normalize
display(prior_df.sort_values('Prior', ascending=False))

# Visualize relationship priors
plt.figure(figsize=(12, 6))
bars = plt.bar(prior_df['Relationship'], prior_df['Prior'] * 100)
plt.xticks(rotation=45, ha='right')
plt.ylabel('Prior Probability (%)')
plt.title('Prior Probabilities for Different Relationships')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Part 2: Statistical Moments for IBD Distributions

The core of Bonsai's relationship inference algorithm lies in calculating statistical moments for IBD distributions. Let's explore the key moments and how they're calculated.

### 2.1 Lambda Parameter (Segment Length Distribution)

The lambda parameter controls the distribution of IBD segment lengths. For a relationship with meiotic distance $m$, the lambda parameter is approximately $\frac{m}{100}$. This means:

- The expected length of IBD segments is $\frac{100}{m}$ cM
- Segments follow an exponential distribution
- The probability that a segment exceeds length $x$ is $e^{-\lambda x} = e^{-\frac{m}{100}x}$

Let's implement a function to calculate lambda values for different relationships:

In [None]:
def get_lambda(relationship_tuple):
    """Calculate the lambda parameter for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        
    Returns:
        Lambda parameter for segment length distribution
    """
    up, down, num_ancs = relationship_tuple
    
    # Calculate meiotic distance
    m = up + down
    
    # Handle special case for self-comparison
    if m == 0 and num_ancs == 2:  # Self
        return 0  # No segments (all DNA is identical)
    
    # Lambda is meiotic distance / 100
    return m / 100

# Calculate lambda values for common relationships
lambda_values = [(rel[0], rel[1], get_lambda(rel[0]), 100 / (rel[0][0] + rel[0][1]) if rel[0][0] + rel[0][1] > 0 else float('inf')) 
                 for rel in relationships]

# Convert to DataFrame for easier viewing
lambda_df = pd.DataFrame([(l[0][0], l[0][1], l[0][2], l[1], l[2], l[3]) 
                         for l in lambda_values],
                       columns=['Up', 'Down', 'Num Ancestors', 'Relationship', 'Lambda', 'Mean Segment Length (cM)'])
display(lambda_df)

# Visualize lambda values and mean segment lengths
plt.figure(figsize=(15, 10))

plt.subplot(2, 1, 1)
bars = plt.bar(lambda_df['Relationship'], lambda_df['Lambda'])
plt.xticks(rotation=45, ha='right')
plt.ylabel('Lambda Parameter')
plt.title('Lambda Parameters for Different Relationships')
plt.grid(axis='y', alpha=0.3)

plt.subplot(2, 1, 2)
# Filter out infinite values (self)
filtered_df = lambda_df[lambda_df['Mean Segment Length (cM)'] < 1000]
bars = plt.bar(filtered_df['Relationship'], filtered_df['Mean Segment Length (cM)'])
plt.xticks(rotation=45, ha='right')
plt.ylabel('Mean Segment Length (cM)')
plt.title('Expected Mean Segment Lengths for Different Relationships')
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### 2.2 Eta Parameter (Segment Count Distribution)

The eta parameter controls the expected number of IBD segments for a relationship. It's calculated as:

$$\eta = \frac{a(rm+c)}{2^{m-1}} \cdot e^{-\frac{m \cdot t}{100}}$$

Where:
- $a$ is the number of common ancestors (1 for half-relationships, 2 for full relationships)
- $r$ is the recombination rate (~34 crossovers per meiosis)
- $m$ is the meiotic distance
- $c$ is the number of chromosomes (22 autosomes)
- $t$ is the minimum detectable segment length

Let's implement a function to calculate eta values:

In [None]:
def get_eta(relationship_tuple, min_seg_len=7, genome_length=3400):
    """Calculate the eta parameter (expected number of segments) for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        min_seg_len: Minimum detectable segment length (cM)
        genome_length: Total genome length (cM)
        
    Returns:
        Eta parameter for segment count distribution
    """
    up, down, num_ancs = relationship_tuple
    
    # Calculate meiotic distance
    m = up + down
    
    # Number of common ancestors
    a = num_ancs
    
    # Handle special case for self-comparison
    if m == 0 and num_ancs == 2:  # Self
        return 0  # No segments (all DNA is identical)
    
    # Parameters
    r = genome_length / 100  # Number of recombinations per meiosis (~34)
    c = 22  # Number of chromosomes
    
    # Calculate the base expected number of segments
    # eta = a * (r*m + c) / (2^(m-1))
    if m <= 1:
        base_eta = a * (r * m + c)
    else:
        base_eta = a * (r * m + c) / (2 ** (m - 1))
    
    # Probability of a segment exceeding the minimum threshold
    p_obs = np.exp(-m * min_seg_len / 100)
    
    # Final expected segment count
    eta = base_eta * p_obs
    
    return eta

# Calculate eta values for common relationships
eta_values = [(rel[0], rel[1], get_eta(rel[0])) for rel in relationships]

# Convert to DataFrame for easier viewing
eta_df = pd.DataFrame([(e[0][0], e[0][1], e[0][2], e[1], e[2]) 
                      for e in eta_values],
                     columns=['Up', 'Down', 'Num Ancestors', 'Relationship', 'Eta (Expected Segments)'])
display(eta_df)

# Visualize eta values
plt.figure(figsize=(12, 6))
bars = plt.bar(eta_df['Relationship'], eta_df['Eta (Expected Segments)'])
plt.xticks(rotation=45, ha='right')
plt.ylabel('Expected Number of Segments')
plt.title('Expected Segment Counts for Different Relationships')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

### 2.3 Total IBD Sharing

The expected total IBD sharing for a relationship is calculated from the eta and lambda parameters. For a relationship with meiotic distance $m$ and expected segment count $\eta$, the expected total IBD sharing is:

$$E[\text{Total IBD}] = \eta \cdot (\frac{100}{m} + t)$$

Where $t$ is the minimum detectable segment length. This formula accounts for the truncation of the exponential distribution at the minimum threshold.

Let's implement a function to calculate expected total IBD sharing:

In [None]:
def get_expected_total_ibd(relationship_tuple, min_seg_len=7, genome_length=3400):
    """Calculate expected total IBD sharing for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        min_seg_len: Minimum detectable segment length (cM)
        genome_length: Total genome length (cM)
        
    Returns:
        Expected total IBD (cM)
    """
    up, down, num_ancs = relationship_tuple
    
    # Calculate meiotic distance
    m = up + down
    
    # Handle special case for self-comparison
    if m == 0 and num_ancs == 2:  # Self
        return genome_length  # All DNA is identical
    
    # Get expected segment count
    eta = get_eta(relationship_tuple, min_seg_len, genome_length)
    
    # Calculate expected total IBD
    # This formula accounts for the truncation of the exponential distribution
    if m > 0:
        expected_total = eta * (100 / m + min_seg_len)
    else:
        expected_total = eta * min_seg_len  # Fallback for m=0
    
    return expected_total

# Calculate expected total IBD for common relationships
total_ibd_values = [(rel[0], rel[1], get_expected_total_ibd(rel[0])) for rel in relationships]

# Convert to DataFrame for easier viewing
total_ibd_df = pd.DataFrame([(t[0][0], t[0][1], t[0][2], t[1], t[2], t[2]/3400*100) 
                           for t in total_ibd_values],
                          columns=['Up', 'Down', 'Num Ancestors', 'Relationship', 'Expected Total IBD (cM)', 'Percent of Genome'])
display(total_ibd_df.sort_values('Expected Total IBD (cM)', ascending=False))

# Visualize expected total IBD
plt.figure(figsize=(12, 8))
bars = plt.bar(total_ibd_df['Relationship'], total_ibd_df['Percent of Genome'])
plt.xticks(rotation=45, ha='right')
plt.ylabel('Expected IBD Sharing (% of Genome)')
plt.title('Expected Total IBD Sharing for Different Relationships')
plt.grid(axis='y', alpha=0.3)

# Add a reference line for the genome length
plt.axhline(100, color='red', linestyle='--', alpha=0.7, label='Total Genome')
plt.legend()

plt.tight_layout()
plt.show()

## Part 3: Building Likelihood Functions

Now let's build likelihood functions that calculate the probability of observed IBD patterns given different relationships. 

### 3.1 Likelihood Function for Segment Count

The number of IBD segments follows a Poisson distribution with parameter $\eta$. The probability of observing $k$ segments given relationship $R$ is:

$$P(k|R) = \frac{\eta^k e^{-\eta}}{k!}$$

Let's implement this likelihood function:

In [None]:
def segment_count_likelihood(observed_count, relationship_tuple, min_seg_len=7, genome_length=3400):
    """Calculate the likelihood of observing a specific segment count given a relationship.
    
    Args:
        observed_count: Number of observed IBD segments
        relationship_tuple: (up, down, num_ancs) tuple
        min_seg_len: Minimum detectable segment length (cM)
        genome_length: Total genome length (cM)
        
    Returns:
        Log-likelihood of the observed count
    """
    up, down, num_ancs = relationship_tuple
    
    # Special case for self-comparison
    if up == 0 and down == 0 and num_ancs == 2:  # Self
        return 0.0 if observed_count == 0 else float('-inf')
    
    # Get expected segment count (eta)
    eta = get_eta(relationship_tuple, min_seg_len, genome_length)
    
    # Calculate log-likelihood using Poisson PMF
    log_likelihood = stats.poisson.logpmf(observed_count, eta)
    
    return log_likelihood

# Test the likelihood function with different segment counts
test_counts = [5, 10, 20, 30, 40, 50]
test_relationships = [
    ((0, 1, 1), "Parent-Child"),
    ((1, 1, 2), "Full Sibling"),
    ((1, 1, 1), "Half Sibling"),
    ((2, 2, 2), "Full First Cousin")
]

# Calculate log-likelihoods for each combination
likelihood_data = []

for count in test_counts:
    row_data = {'observed_count': count}
    
    for rel_tuple, rel_name in test_relationships:
        log_ll = segment_count_likelihood(count, rel_tuple)
        row_data[rel_name] = log_ll
    
    likelihood_data.append(row_data)

# Convert to DataFrame and display
likelihood_df = pd.DataFrame(likelihood_data)
display(likelihood_df)

# Visualize the likelihood function
plt.figure(figsize=(12, 8))

counts = np.arange(0, 100, 1)
for rel_tuple, rel_name in test_relationships:
    log_likelihoods = [segment_count_likelihood(c, rel_tuple) for c in counts]
    
    # Convert to regular likelihoods for better visualization
    likelihoods = np.exp(log_likelihoods)
    
    plt.plot(counts, likelihoods, '-', linewidth=2, label=rel_name)

plt.title("Likelihood of Observed Segment Counts by Relationship")
plt.xlabel("Number of Observed IBD Segments")
plt.ylabel("Likelihood")
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 75)
plt.tight_layout()
plt.show()

### 3.2 Likelihood Function for Segment Lengths

The distribution of segment lengths provides additional information for relationship inference. For a relationship with meiotic distance $m$, the probability density function for a segment of length $x$ (given that it exceeds the minimum threshold $t$) is:

$$f(x|R, x > t) = \frac{m}{100} e^{-\frac{m}{100}(x-t)}$$

Let's implement this likelihood function:

In [None]:
def segment_length_likelihood(observed_lengths, relationship_tuple, min_seg_len=7):
    """Calculate the likelihood of observing specific segment lengths given a relationship.
    
    Args:
        observed_lengths: List of observed segment lengths (cM)
        relationship_tuple: (up, down, num_ancs) tuple
        min_seg_len: Minimum detectable segment length (cM)
        
    Returns:
        Log-likelihood of the observed lengths
    """
    up, down, num_ancs = relationship_tuple
    
    # Calculate meiotic distance
    m = up + down
    
    # Special case for self-comparison
    if m == 0 and num_ancs == 2:  # Self
        return 0.0 if not observed_lengths else float('-inf')
    
    # Calculate rate parameter (lambda)
    rate = m / 100
    
    # Calculate log-likelihood for each segment
    log_likelihoods = []
    for length in observed_lengths:
        # For segments shorter than the minimum threshold, likelihood is zero
        if length < min_seg_len:
            continue
            
        # Log-likelihood for this segment using the truncated exponential distribution
        log_lik = np.log(rate) - rate * (length - min_seg_len)
        log_likelihoods.append(log_lik)
    
    # Sum log-likelihoods to get total log-likelihood
    if log_likelihoods:
        return sum(log_likelihoods)
    else:
        return 0.0  # No segments

# Generate synthetic segment lengths for different relationships
def generate_segments(relationship_tuple, num_segments=20, min_seg_len=7, seed=None):
    """Generate synthetic IBD segments for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        num_segments: Number of segments to generate
        min_seg_len: Minimum segment length (cM)
        seed: Random seed
        
    Returns:
        List of segment lengths (cM)
    """
    up, down, num_ancs = relationship_tuple
    
    # Calculate meiotic distance
    m = up + down
    
    # Set random seed if provided
    if seed is not None:
        np.random.seed(seed)
    
    # Handle special case for self-comparison
    if m == 0 and num_ancs == 2:  # Self
        return []  # No segments
    
    # Calculate rate parameter
    rate = m / 100
    
    # Generate segment lengths using truncated exponential distribution
    segments = []
    for _ in range(num_segments):
        # Generate from truncated exponential distribution
        # First generate from standard exponential
        u = np.random.exponential() / rate
        # Then add the minimum threshold
        length = u + min_seg_len
        segments.append(length)
    
    return segments

# Generate segments for different relationships
segments_by_relationship = {}
for rel_tuple, rel_name in test_relationships:
    # Get expected segment count for this relationship
    eta = get_eta(rel_tuple)
    # Generate roughly the expected number of segments
    num_segments = max(1, int(eta) + 1)
    segments = generate_segments(rel_tuple, num_segments=num_segments, seed=42+len(segments_by_relationship))
    segments_by_relationship[rel_name] = segments

# Calculate log-likelihoods for each set of segments against each relationship
length_likelihood_data = []

for true_rel_name, segments in segments_by_relationship.items():
    row_data = {'true_relationship': true_rel_name, 'num_segments': len(segments)}
    
    for eval_rel_tuple, eval_rel_name in test_relationships:
        log_ll = segment_length_likelihood(segments, eval_rel_tuple)
        row_data[eval_rel_name] = log_ll
    
    length_likelihood_data.append(row_data)

# Convert to DataFrame and display
length_likelihood_df = pd.DataFrame(length_likelihood_data)
display(length_likelihood_df)

# Visualize the segment length distributions
plt.figure(figsize=(12, 8))

for rel_name, segments in segments_by_relationship.items():
    if segments:  # Skip if no segments
        sns.kdeplot(segments, label=rel_name)

plt.title("Synthetic IBD Segment Length Distributions by Relationship")
plt.xlabel("Segment Length (cM)")
plt.ylabel("Density")
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 200)
plt.tight_layout()
plt.show()

### 3.3 Combining Likelihoods: The Get_Log_Seg_PDF Function

In Bonsai v3, the `get_log_seg_pdf` function in the moments module combines the likelihoods from segment count and segment lengths into a single likelihood score. Let's implement a simplified version of this function:

In [None]:
def get_log_seg_pdf(segments, relationship_tuple, min_seg_len=7, genome_length=3400):
    """Calculate the log-likelihood of observing a set of segments given a relationship.
    
    Args:
        segments: List of segment lengths (cM)
        relationship_tuple: (up, down, num_ancs) tuple
        min_seg_len: Minimum detectable segment length (cM)
        genome_length: Total genome length (cM)
        
    Returns:
        Log-likelihood of the observed segments
    """
    # Filter segments below threshold
    segments = [s for s in segments if s >= min_seg_len]
    
    # Get segment count
    observed_count = len(segments)
    
    # Calculate likelihood for segment count
    count_log_likelihood = segment_count_likelihood(observed_count, relationship_tuple, min_seg_len, genome_length)
    
    # Calculate likelihood for segment lengths
    length_log_likelihood = segment_length_likelihood(segments, relationship_tuple, min_seg_len)
    
    # Combine likelihoods
    # In this simplified version, we give equal weight to count and length likelihoods
    combined_log_likelihood = count_log_likelihood + length_log_likelihood
    
    return combined_log_likelihood

# Calculate combined log-likelihoods for each set of segments against each relationship
combined_likelihood_data = []

for true_rel_name, segments in segments_by_relationship.items():
    row_data = {'true_relationship': true_rel_name, 'num_segments': len(segments)}
    
    for eval_rel_tuple, eval_rel_name in test_relationships:
        log_ll = get_log_seg_pdf(segments, eval_rel_tuple)
        row_data[eval_rel_name] = log_ll
    
    combined_likelihood_data.append(row_data)

# Convert to DataFrame and display
combined_likelihood_df = pd.DataFrame(combined_likelihood_data)
display(combined_likelihood_df)

# For each set of segments, highlight the most likely relationship
for i, row in combined_likelihood_df.iterrows():
    true_rel = row['true_relationship']
    
    # Extract log-likelihoods for each relationship
    rel_log_lls = [(rel, row[rel]) for rel in [r[1] for r in test_relationships]]
    
    # Sort by log-likelihood (highest first)
    rel_log_lls.sort(key=lambda x: x[1], reverse=True)
    
    top_rel, top_log_ll = rel_log_lls[0]
    
    print(f"True relationship: {true_rel}")
    print(f"Most likely relationship: {top_rel} (log-likelihood = {top_log_ll:.2f})")
    
    # Calculate evidence ratio (Bayes factor) between top and second relationships
    if len(rel_log_lls) > 1:
        second_rel, second_log_ll = rel_log_lls[1]
        evidence_ratio = np.exp(top_log_ll - second_log_ll)
        print(f"Evidence ratio vs. second-best ({second_rel}): {evidence_ratio:.2f}x")
        
    print()

## Part 4: Handling Uncertainty and Stochasticity

One of the key challenges in genetic relationship inference is handling the inherent stochasticity of genetic inheritance. Let's explore how stochasticity affects relationship inference and how Bonsai v3 addresses this challenge.

### 4.1 Simulating IBD Segment Patterns

Let's generate multiple simulations of IBD segment patterns for different relationships to understand the variability in IBD sharing:

In [None]:
def simulate_ibd_data(relationship_tuple, num_simulations=100, min_seg_len=7, genome_length=3400):
    """Simulate IBD segment data for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple
        num_simulations: Number of simulations to generate
        min_seg_len: Minimum detectable segment length (cM)
        genome_length: Total genome length (cM)
        
    Returns:
        List of simulated segment sets
    """
    # Calculate expected segment count (eta)
    eta = get_eta(relationship_tuple, min_seg_len, genome_length)
    
    simulations = []
    for i in range(num_simulations):
        # Generate random segment count from Poisson distribution
        segment_count = np.random.poisson(eta)
        
        # Generate segment lengths
        segments = generate_segments(relationship_tuple, num_segments=segment_count, min_seg_len=min_seg_len, seed=i)
        
        simulations.append(segments)
    
    return simulations

# Simulate IBD data for different relationships
simulation_results = {}
for rel_tuple, rel_name in test_relationships:
    simulations = simulate_ibd_data(rel_tuple, num_simulations=100)
    simulation_results[rel_name] = simulations

# Calculate statistics for each set of simulations
simulation_stats = []

for rel_name, simulations in simulation_results.items():
    # Calculate statistics across simulations
    segment_counts = [len(sim) for sim in simulations]
    total_lengths = [sum(sim) if sim else 0 for sim in simulations]
    
    simulation_stats.append({
        'relationship': rel_name,
        'mean_segment_count': np.mean(segment_counts),
        'std_segment_count': np.std(segment_counts),
        'mean_total_length': np.mean(total_lengths),
        'std_total_length': np.std(total_lengths),
        'cv_total_length': np.std(total_lengths) / np.mean(total_lengths) if np.mean(total_lengths) > 0 else 0
    })

# Convert to DataFrame and display
simulation_stats_df = pd.DataFrame(simulation_stats)
display(simulation_stats_df)

# Visualize the distribution of total IBD sharing
plt.figure(figsize=(12, 8))

for rel_name, simulations in simulation_results.items():
    # Calculate total length for each simulation
    total_lengths = [sum(sim) if sim else 0 for sim in simulations]
    
    # Plot distribution
    sns.kdeplot(total_lengths, label=rel_name)

plt.title("Distribution of Total IBD Sharing by Relationship")
plt.xlabel("Total IBD (cM)")
plt.ylabel("Density")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 4.2 Impact of Stochasticity on Relationship Inference

Let's assess how often the correct relationship is inferred given the stochasticity of IBD sharing:

In [None]:
def infer_relationship(segments, relationship_options):
    """Infer the most likely relationship from a set of segments.
    
    Args:
        segments: List of segment lengths (cM)
        relationship_options: List of (tuple, name) pairs for possible relationships
        
    Returns:
        Tuple of (most likely relationship tuple, relationship name, log-likelihood)
    """
    # Calculate log-likelihood for each relationship
    log_likelihoods = []
    for rel_tuple, rel_name in relationship_options:
        log_ll = get_log_seg_pdf(segments, rel_tuple)
        log_likelihoods.append((rel_tuple, rel_name, log_ll))
    
    # Sort by log-likelihood (highest first)
    log_likelihoods.sort(key=lambda x: x[2], reverse=True)
    
    # Return the most likely relationship
    return log_likelihoods[0]

# Assess relationship inference accuracy
accuracy_results = []

for true_rel_tuple, true_rel_name in test_relationships:
    # Get simulations for this relationship
    simulations = simulation_results[true_rel_name]
    
    # Count correct inferences
    correct_count = 0
    confusion_counts = defaultdict(int)
    
    for segments in simulations:
        # Infer relationship
        inferred_tuple, inferred_name, log_ll = infer_relationship(segments, test_relationships)
        
        # Check if correct
        if inferred_tuple == true_rel_tuple:
            correct_count += 1
        else:
            confusion_counts[inferred_name] += 1
    
    # Calculate accuracy
    accuracy = correct_count / len(simulations)
    
    # Find the most common confusion
    most_confused = max(confusion_counts.items(), key=lambda x: x[1]) if confusion_counts else ("None", 0)
    
    accuracy_results.append({
        'true_relationship': true_rel_name,
        'accuracy': accuracy,
        'most_confused_with': most_confused[0],
        'confusion_rate': most_confused[1] / len(simulations)
    })

# Convert to DataFrame and display
accuracy_df = pd.DataFrame(accuracy_results)
display(accuracy_df)

# Visualize accuracy
plt.figure(figsize=(12, 6))
bars = plt.bar(accuracy_df['true_relationship'], accuracy_df['accuracy'] * 100)
plt.xlabel('True Relationship')
plt.ylabel('Inference Accuracy (%)')
plt.title('Relationship Inference Accuracy')
plt.grid(axis='y', alpha=0.3)

# Add accuracy labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 1,
             f'{height:.1f}%', ha='center', va='bottom')

# Add confusion information
for i, row in accuracy_df.iterrows():
    if row['confusion_rate'] > 0:
        plt.text(i, row['accuracy'] * 100 / 2,
                 f"Confused with\n{row['most_confused_with']}\n({row['confusion_rate']*100:.1f}%)",
                 ha='center', va='center', color='white', fontweight='bold')

plt.ylim(0, 105)
plt.tight_layout()
plt.show()

### 4.3 Strategies for Handling Uncertainty

Given the stochasticity in IBD sharing, Bonsai v3 uses several strategies to handle uncertainty in relationship inference:

1. **Likelihood Ratios**: Compare the likelihood of the top relationship to alternatives
2. **Confidence Intervals**: Report uncertainty in relationship estimates
3. **Integration with Age Data**: Use demographic information to resolve ambiguity
4. **Multiple Relationship Paths**: Consider all possible ways individuals might be related

Let's implement a simplified version of the likelihood ratio approach:

In [None]:
def calculate_likelihood_ratios(segments, relationship_options):
    """Calculate likelihood ratios for different relationships.
    
    Args:
        segments: List of segment lengths (cM)
        relationship_options: List of (tuple, name) pairs for possible relationships
        
    Returns:
        DataFrame with likelihood ratios
    """
    # Calculate log-likelihood for each relationship
    log_likelihoods = []
    for rel_tuple, rel_name in relationship_options:
        log_ll = get_log_seg_pdf(segments, rel_tuple)
        log_likelihoods.append((rel_tuple, rel_name, log_ll))
    
    # Sort by log-likelihood (highest first)
    log_likelihoods.sort(key=lambda x: x[2], reverse=True)
    
    # Calculate likelihood ratios
    top_log_ll = log_likelihoods[0][2]
    
    likelihood_ratios = []
    for rel_tuple, rel_name, log_ll in log_likelihoods:
        # Calculate likelihood ratio vs. top relationship
        likelihood_ratio = np.exp(log_ll - top_log_ll)
        
        likelihood_ratios.append({
            'relationship': rel_name,
            'log_likelihood': log_ll,
            'likelihood_ratio': likelihood_ratio
        })
    
    return pd.DataFrame(likelihood_ratios)

# Create a test case with ambiguous relationship
# Generate segments from a half-sibling with a high number to create ambiguity
ambiguous_segments = generate_segments((1, 1, 1), num_segments=40, seed=200)

# Calculate likelihood ratios
likelihood_ratios_df = calculate_likelihood_ratios(ambiguous_segments, test_relationships)
display(likelihood_ratios_df)

# Visualize likelihood ratios
plt.figure(figsize=(12, 6))
bars = plt.bar(likelihood_ratios_df['relationship'], likelihood_ratios_df['likelihood_ratio'])
plt.xlabel('Relationship')
plt.ylabel('Likelihood Ratio')
plt.title('Likelihood Ratios for Possible Relationships')
plt.grid(axis='y', alpha=0.3)

# Add likelihood ratio labels
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{height:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## Summary

In this lab, we've explored the probabilistic relationship inference methods used in Bonsai v3. Key takeaways include:

1. **Relationship Representation**: Bonsai v3 uses a tuple representation for relationships, which captures the genealogical structure, meiotic distance, and number of common ancestors.

2. **Statistical Moments**: Three key parameters characterize IBD distributions for different relationships: lambda (segment length distribution), eta (expected segment count), and total IBD sharing.

3. **Likelihood Functions**: Bonsai v3 builds likelihood functions to calculate the probability of observed IBD patterns given different relationships, combining evidence from segment counts and lengths.

4. **Handling Stochasticity**: Due to the random nature of genetic inheritance, relationship inference includes quantifying uncertainty through likelihood ratios and confidence intervals.

5. **Bayesian Framework**: The relationship inference process follows a Bayesian framework, combining likelihoods with priors to calculate posterior probabilities for different relationships.

These probabilistic methods allow Bonsai v3 to make robust inferences about relationships from IBD data, providing the foundation for pedigree reconstruction even with incomplete and noisy genetic data.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab04_IBD_Statistics_Extraction.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive