# Lab 8: Age-Based Relationship Modeling

## Overview

This lab explores how Bonsai v3 incorporates age information to enhance relationship inference. While genetic data provides strong evidence for biological relationships, age information can help resolve ambiguities and improve accuracy.

Key topics include:

1. Age difference distributions for different relationship types
2. Age constraints and biological impossibilities
3. Computing age-based relationship likelihoods
4. Combining age and genetic evidence
5. Handling missing or uncertain age information

By the end of this lab, you'll understand how age modeling complements genetic data to create more accurate pedigree reconstructions.

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib
from scipy import stats
import math
import random
from collections import defaultdict

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def view_source(obj):
    """Display the source code of an object."""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\n{source}\n```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Import Bonsai Modules

Let's start by importing the necessary Bonsai v3 modules, particularly the ones that handle age-based relationship modeling:

In [None]:
try:
    from utils.bonsaitree.bonsaitree.v3 import likelihoods
    print("✅ Successfully imported Bonsai v3 likelihoods module")
    
    # Check if PwLogLike class is available (contains age-based modeling functions)
    if hasattr(likelihoods, 'PwLogLike'):
        print("✅ PwLogLike class is available")
        # Look at age-related methods
        age_methods = [name for name, method in inspect.getmembers(likelihoods.PwLogLike, predicate=inspect.isfunction) 
                      if 'age' in name.lower() and not name.startswith('_')]
        print(f"\nAge-related methods in PwLogLike class:")
        for method in age_methods:
            print(f"- {method}")
    else:
        print("❌ PwLogLike class not found in likelihoods module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai modules: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")

## Part 1: Age Difference Distributions

Different relationship types have characteristic age differences. For example, parents are typically 20-35 years older than their children, while siblings are usually close in age. Bonsai v3 uses these patterns to improve relationship inference.

### 1.1 Age Difference Models

Bonsai v3 models age differences using normal (Gaussian) distributions with specific mean and standard deviation parameters for each relationship type. Let's examine how these distributions are defined and used.

First, let's look at the method in `PwLogLike` that calculates age parameters for different relationships:

In [None]:
# Examine the age-related methods in PwLogLike
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    for method_name in age_methods:
        method = getattr(likelihoods.PwLogLike, method_name)
        print(f"\n## {method_name}")
        view_source(method)

Let's implement a function to get the expected age difference parameters for common relationship types, based on Bonsai's approach:

In [None]:
def get_age_parameters(relationship_tuple):
    """Get expected age difference parameters for a relationship.
    
    Args:
        relationship_tuple: (up, down, num_ancs) tuple representing the relationship
        
    Returns:
        Dictionary with mean, standard deviation, and direction parameters
    """
    up, down, num_ancs = relationship_tuple
    
    # Default values
    result = {
        'mean': 0,
        'std_dev': 10,
        'direction': 0  # 0: no constraint, 1: id1 older, -1: id2 older
    }
    
    # Handle special cases first
    if up == 0 and down == 0 and num_ancs == 2:  # Self
        result['mean'] = 0
        result['std_dev'] = 0.1  # Almost no variation
        return result
    
    # Parent-child relationships
    if up == 0 and down == 1:  # id1 is parent of id2
        result['mean'] = 30
        result['std_dev'] = 10
        result['direction'] = 1  # id1 should be older
        return result
    
    if up == 1 and down == 0:  # id1 is child of id2
        result['mean'] = -30
        result['std_dev'] = 10
        result['direction'] = -1  # id1 should be younger
        return result
    
    # Sibling relationships
    if up == 1 and down == 1:  # siblings (full or half)
        result['mean'] = 0
        result['std_dev'] = 10
        return result
    
    # Grandparent-grandchild relationships
    if up == 0 and down == 2:  # id1 is grandparent of id2
        result['mean'] = 60
        result['std_dev'] = 15
        result['direction'] = 1  # id1 should be older
        return result
    
    if up == 2 and down == 0:  # id1 is grandchild of id2
        result['mean'] = -60
        result['std_dev'] = 15
        result['direction'] = -1  # id1 should be younger
        return result
    
    # Avuncular relationships (aunt/uncle - niece/nephew)
    if up == 1 and down == 2:  # id1 is aunt/uncle of id2
        result['mean'] = 20
        result['std_dev'] = 15
        result['direction'] = 1  # id1 should be older
        return result
    
    if up == 2 and down == 1:  # id1 is niece/nephew of id2
        result['mean'] = -20
        result['std_dev'] = 15
        result['direction'] = -1  # id1 should be younger
        return result
    
    # Cousin relationships
    if up >= 2 and down >= 2:  # cousins of some degree
        result['mean'] = 0
        result['std_dev'] = 20  # Wider variance for cousins
        return result
    
    # For other complex relationships, use a generalization
    # based on total meiotic distance
    total_meiotic = up + down
    if total_meiotic > 0:
        if up > down:  # id1 is in an older generation
            result['mean'] = 30 * (up - down)
            result['direction'] = 1
        elif down > up:  # id2 is in an older generation
            result['mean'] = 30 * (up - down)  # Will be negative
            result['direction'] = -1
        else:  # Same generation
            result['mean'] = 0
        
        result['std_dev'] = 10 + 5 * (total_meiotic - 1)  # Increase std dev for more distant relationships
    
    return result

# Define common relationships to test
relationships = [
    ((0, 0, 2), "Self"),
    ((0, 1, 1), "Parent"),
    ((1, 0, 1), "Child"),
    ((1, 1, 2), "Full Sibling"),
    ((1, 1, 1), "Half Sibling"),
    ((0, 2, 1), "Grandparent"),
    ((2, 0, 1), "Grandchild"),
    ((1, 2, 1), "Aunt/Uncle"),
    ((2, 1, 1), "Niece/Nephew"),
    ((2, 2, 2), "Full First Cousin"),
    ((2, 2, 1), "Half First Cousin"),
    ((3, 3, 2), "Full Second Cousin"),
    ((1, 3, 1), "First Cousin Once Removed (Aunt/Uncle)"),
    ((3, 1, 1), "First Cousin Once Removed (Niece/Nephew)")
]

# Get age parameters for all relationships
age_params = []
for rel_tuple, rel_name in relationships:
    params = get_age_parameters(rel_tuple)
    age_params.append({
        'relationship': rel_name,
        'relationship_tuple': rel_tuple,
        'mean_age_diff': params['mean'],
        'std_dev': params['std_dev'],
        'direction': params['direction']
    })

# Convert to DataFrame and display
age_params_df = pd.DataFrame(age_params)
display(age_params_df)

### 1.2 Visualizing Age Difference Distributions

Let's visualize the age difference distributions for different relationship types:

In [None]:
# Function to get distribution values
def get_age_diff_distribution(relationship_tuple, x_range):
    """Get probability density values for age differences."""
    params = get_age_parameters(relationship_tuple)
    mean = params['mean']
    std_dev = params['std_dev']
    
    # Calculate normal PDF values
    pdf_values = stats.norm.pdf(x_range, mean, std_dev)
    
    # Apply directional constraint if needed
    if params['direction'] == 1:  # id1 should be older
        pdf_values[x_range < 0] = 0
    elif params['direction'] == -1:  # id1 should be younger
        pdf_values[x_range > 0] = 0
    
    return pdf_values

# Create visualization
def plot_age_distributions(relationship_tuples):
    """Plot age difference distributions for multiple relationships."""
    plt.figure(figsize=(14, 8))
    
    # Define x range for age differences
    x_range = np.linspace(-80, 80, 500)
    
    # Plot each relationship
    for rel_tuple, rel_name in relationship_tuples:
        # Skip self relationship (for better scaling)
        if rel_tuple == (0, 0, 2):
            continue
            
        pdf_values = get_age_diff_distribution(rel_tuple, x_range)
        plt.plot(x_range, pdf_values, label=rel_name, linewidth=2)
    
    plt.title('Age Difference Distributions by Relationship Type')
    plt.xlabel('Age Difference (years) - Positive means ID1 is older')
    plt.ylabel('Probability Density')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
    plt.xlim(-80, 80)
    plt.tight_layout()
    plt.show()

# Visualize direct lineage relationships
direct_lineage = [
    ((0, 1, 1), "Parent"),
    ((1, 0, 1), "Child"),
    ((0, 2, 1), "Grandparent"),
    ((2, 0, 1), "Grandchild")
]
print("Direct Lineage Relationships:")
plot_age_distributions(direct_lineage)

# Visualize collateral relationships
collateral = [
    ((1, 1, 2), "Full Sibling"),
    ((1, 1, 1), "Half Sibling"),
    ((1, 2, 1), "Aunt/Uncle"),
    ((2, 1, 1), "Niece/Nephew"),
    ((2, 2, 2), "Full First Cousin"),
    ((3, 3, 2), "Full Second Cousin")
]
print("\nCollateral Relationships:")
plot_age_distributions(collateral)

### 1.3 Age Difference Simulation

Let's simulate age differences for different relationships to see the variation in practice:

In [None]:
def simulate_age_differences(relationship_tuple, num_samples=1000):
    """Simulate age differences for a given relationship."""
    params = get_age_parameters(relationship_tuple)
    mean = params['mean']
    std_dev = params['std_dev']
    direction = params['direction']
    
    # Generate random age differences from normal distribution
    age_diffs = np.random.normal(mean, std_dev, num_samples)
    
    # Apply directional constraint if needed
    if direction == 1:  # id1 should be older
        age_diffs = np.abs(age_diffs)  # Make all positive
    elif direction == -1:  # id1 should be younger
        age_diffs = -np.abs(age_diffs)  # Make all negative
    
    return age_diffs

# Simulate and visualize age differences for select relationships
relationships_to_simulate = [
    ((0, 1, 1), "Parent"),
    ((1, 1, 2), "Full Sibling"),
    ((0, 2, 1), "Grandparent"),
    ((1, 2, 1), "Aunt/Uncle"),
    ((2, 2, 2), "Full First Cousin")
]

plt.figure(figsize=(14, 10))

for i, (rel_tuple, rel_name) in enumerate(relationships_to_simulate, 1):
    # Simulate age differences
    age_diffs = simulate_age_differences(rel_tuple, num_samples=5000)
    
    # Plot distribution
    plt.subplot(len(relationships_to_simulate), 1, i)
    sns.histplot(age_diffs, bins=50, kde=True)
    
    # Add mean and standard deviation lines
    params = get_age_parameters(rel_tuple)
    plt.axvline(params['mean'], color='red', linestyle='--', label=f"Mean: {params['mean']} years")
    plt.axvline(params['mean'] + params['std_dev'], color='green', linestyle=':', 
                label=f"±1 SD: {params['mean']} ± {params['std_dev']} years")
    plt.axvline(params['mean'] - params['std_dev'], color='green', linestyle=':')
    
    plt.title(f"{rel_name} Age Difference Distribution")
    plt.xlabel('Age Difference (years)')
    plt.ylabel('Frequency')
    plt.legend()
    plt.grid(alpha=0.3)
    
plt.tight_layout()
plt.show()

## Part 2: Age-Based Likelihood Calculation

Now that we understand the age difference distributions, let's see how Bonsai v3 uses them to calculate relationship likelihoods based on age information.

### 2.1 Computing Age-Based Likelihoods

Bonsai v3 calculates age-based likelihoods by comparing observed age differences to the expected distributions for different relationships. This is implemented in the `get_pw_age_ll` method of the `PwLogLike` class.

Let's implement a simplified version of this method:

In [None]:
def calculate_age_likelihood(age1, age2, relationship_tuple):
    """Calculate the likelihood of a relationship based on age difference.
    
    Args:
        age1: Age of first individual
        age2: Age of second individual
        relationship_tuple: (up, down, num_ancs) tuple representing the relationship
        
    Returns:
        Log-likelihood score
    """
    # If either age is missing, we can't calculate likelihood
    if age1 is None or age2 is None:
        return 0.0
    
    # Calculate age difference (id1 - id2)
    age_diff = age1 - age2
    
    # Get expected age difference parameters
    params = get_age_parameters(relationship_tuple)
    mean = params['mean']
    std_dev = params['std_dev']
    direction = params['direction']
    
    # Check for biological impossibility
    if (direction == 1 and age_diff < 0) or (direction == -1 and age_diff > 0):
        return float('-inf')  # Biologically impossible
    
    # Calculate log-likelihood using normal distribution
    log_likelihood = stats.norm.logpdf(age_diff, mean, std_dev)
    
    return log_likelihood

# Example individuals with ages
individuals = [
    {'id': 1001, 'age': 70, 'sex': 'M'},
    {'id': 1002, 'age': 40, 'sex': 'F'},
    {'id': 1003, 'age': 45, 'sex': 'M'},
    {'id': 1004, 'age': 15, 'sex': 'F'},
    {'id': 1005, 'age': 10, 'sex': 'M'}
]

# Create a dictionary for easy lookup
id_to_info = {person['id']: person for person in individuals}

# Define pairs to test
test_pairs = [
    (1001, 1002),  # 70 vs 40 (30 year difference)
    (1002, 1003),  # 40 vs 45 (-5 year difference)
    (1002, 1004),  # 40 vs 15 (25 year difference)
    (1001, 1004),  # 70 vs 15 (55 year difference)
    (1004, 1005)   # 15 vs 10 (5 year difference)
]

# Relationships to test
test_relationships = [
    ((0, 1, 1), "Parent"),
    ((1, 0, 1), "Child"),
    ((1, 1, 2), "Full Sibling"),
    ((0, 2, 1), "Grandparent"),
    ((1, 2, 1), "Aunt/Uncle"),
    ((2, 2, 2), "Full First Cousin")
]

# Calculate age likelihoods for each pair
print("Age-Based Relationship Likelihoods:")
for id1, id2 in test_pairs:
    age1 = id_to_info[id1]['age']
    age2 = id_to_info[id2]['age']
    age_diff = age1 - age2
    
    print(f"\nPair {id1}-{id2}: Ages {age1} and {age2} (difference: {age_diff} years)")
    
    # Calculate likelihoods for each test relationship
    likelihoods = []
    for rel_tuple, rel_name in test_relationships:
        log_ll = calculate_age_likelihood(age1, age2, rel_tuple)
        likelihoods.append((rel_tuple, rel_name, log_ll))
    
    # Sort by likelihood (highest first)
    likelihoods.sort(key=lambda x: x[2], reverse=True)
    
    # Display results
    print("  Most likely relationships based on age:")
    for rel_tuple, rel_name, log_ll in likelihoods[:3]:  # Top 3
        # Convert to regular likelihood for better readability
        likelihood = np.exp(log_ll) if log_ll > float('-inf') else 0
        print(f"    {rel_name}: {likelihood:.6f} (log-likelihood: {log_ll:.2f})")

### 2.2 Visualizing Age-Based Likelihoods

Let's visualize how age-based likelihoods change with different age differences for various relationships:

In [None]:
# Function to calculate likelihood curves
def calc_likelihood_curve(relationship_tuple, age_diffs):
    """Calculate likelihood values for a range of age differences."""
    log_lls = []
    for diff in age_diffs:
        # Arbitrary base age of 50
        log_ll = calculate_age_likelihood(50, 50 - diff, relationship_tuple)
        log_lls.append(log_ll)
    
    # Convert to regular likelihoods (capping at a minimum value for plotting)
    min_log_ll = -20  # Set a minimum log-likelihood for better visualization
    log_lls = [max(ll, min_log_ll) for ll in log_lls]
    lls = np.exp(log_lls)
    
    # Normalize for better comparison
    if max(lls) > 0:
        lls = lls / max(lls)
    
    return lls

# Age differences to test
age_diffs = np.linspace(-80, 80, 321)

# Calculate likelihood curves for different relationships
likelihood_curves = {}
for rel_tuple, rel_name in test_relationships:
    likelihood_curves[rel_name] = calc_likelihood_curve(rel_tuple, age_diffs)

# Plot likelihood curves
plt.figure(figsize=(14, 8))

for rel_name, lls in likelihood_curves.items():
    plt.plot(age_diffs, lls, label=rel_name, linewidth=2)

plt.title('Normalized Age-Based Relationship Likelihoods')
plt.xlabel('Age Difference (years) - Positive means ID1 is older')
plt.ylabel('Normalized Likelihood')
plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
plt.grid(alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

## Part 3: Handling Missing and Uncertain Age Information

In real-world applications, age information is often missing, uncertain, or approximate. Let's explore how Bonsai v3 handles these scenarios.

### 3.1 Strategies for Missing Age Data

When age information is missing for one or both individuals, Bonsai v3 uses several strategies:

1. **Neutral Contribution**: If age is missing, the age-based component doesn't contribute to the final likelihood (neither positively nor negatively).

2. **Age Inference**: Sometimes, ages can be inferred from other relationships in the pedigree (e.g., if A is parent of B, and B is 30, then A is likely ~60).

3. **Partial Evidence**: If only one individual's age is known, this can sometimes still provide weak evidence for or against certain relationships.

Let's implement an extended version of our age likelihood function that handles missing data:

In [None]:
def calculate_age_likelihood_robust(age1, age2, relationship_tuple):
    """Calculate relationship likelihood based on age, handling missing data robustly.
    
    Args:
        age1: Age of first individual (can be None)
        age2: Age of second individual (can be None)
        relationship_tuple: (up, down, num_ancs) tuple representing the relationship
        
    Returns:
        Log-likelihood score and confidence flag
    """
    # Get relationship parameters
    params = get_age_parameters(relationship_tuple)
    
    # Case 1: Both ages missing
    if age1 is None and age2 is None:
        return 0.0, "No age data"
    
    # Case 2: One age missing
    if age1 is None or age2 is None:
        # If the relationship has a strong directional constraint, we can still provide weak evidence
        if params['direction'] != 0:
            # Very weak likelihood, but non-zero
            return -5.0, "Partial age data"
        else:
            # No directional constraint, so can't infer anything
            return 0.0, "No age data"
    
    # Case 3: Both ages present - calculate full likelihood
    age_diff = age1 - age2
    
    # Check for biological impossibility
    if (params['direction'] == 1 and age_diff < 0) or (params['direction'] == -1 and age_diff > 0):
        return float('-inf'), "Biological impossibility"
    
    # Calculate log-likelihood using normal distribution
    log_likelihood = stats.norm.logpdf(age_diff, params['mean'], params['std_dev'])
    
    return log_likelihood, "Complete age data"

# Example with some missing ages
individuals_with_missing = [
    {'id': 1001, 'age': 70, 'sex': 'M'},
    {'id': 1002, 'age': 40, 'sex': 'F'},
    {'id': 1003, 'age': None, 'sex': 'M'},  # Missing age
    {'id': 1004, 'age': 15, 'sex': 'F'},
    {'id': 1005, 'age': None, 'sex': 'M'}   # Missing age
]

# Create a dictionary for easy lookup
id_to_info_missing = {person['id']: person for person in individuals_with_missing}

# Define pairs to test
test_pairs_missing = [
    (1001, 1002),  # Both ages known
    (1001, 1003),  # One age missing
    (1003, 1005)   # Both ages missing
]

# Calculate age likelihoods for each pair
print("Age-Based Relationship Likelihoods with Missing Data:")
for id1, id2 in test_pairs_missing:
    age1 = id_to_info_missing[id1]['age']
    age2 = id_to_info_missing[id2]['age']
    
    age1_str = str(age1) if age1 is not None else "Unknown"
    age2_str = str(age2) if age2 is not None else "Unknown"
    
    print(f"\nPair {id1}-{id2}: Ages {age1_str} and {age2_str}")
    
    # Calculate likelihoods for each test relationship
    likelihoods = []
    for rel_tuple, rel_name in test_relationships:
        log_ll, confidence = calculate_age_likelihood_robust(age1, age2, rel_tuple)
        likelihoods.append((rel_tuple, rel_name, log_ll, confidence))
    
    # Sort by likelihood (highest first)
    likelihoods.sort(key=lambda x: x[2], reverse=True)
    
    # Display results
    print("  Most likely relationships based on age:")
    for rel_tuple, rel_name, log_ll, confidence in likelihoods[:3]:  # Top 3
        # Convert to regular likelihood for better readability
        likelihood = np.exp(log_ll) if log_ll > float('-inf') else 0
        print(f"    {rel_name}: {likelihood:.6f} (log-likelihood: {log_ll:.2f}) - {confidence}")

### 3.2 Handling Uncertain Age Information

In addition to missing age information, Bonsai v3 can also handle uncertain or approximate age information, such as birth year ranges or decade estimates. This is typically done by increasing the standard deviation in the likelihood calculation.

Let's implement a version of our age likelihood function that handles uncertain ages:

In [None]:
def calculate_age_likelihood_uncertain(age1, age1_uncertainty, age2, age2_uncertainty, relationship_tuple):
    """Calculate relationship likelihood based on age with uncertainty.
    
    Args:
        age1: Age of first individual
        age1_uncertainty: Uncertainty of first individual's age (standard deviation)
        age2: Age of second individual
        age2_uncertainty: Uncertainty of second individual's age (standard deviation)
        relationship_tuple: (up, down, num_ancs) tuple representing the relationship
        
    Returns:
        Log-likelihood score
    """
    # If either age is missing, we can't calculate likelihood
    if age1 is None or age2 is None:
        return 0.0
    
    # Calculate age difference
    age_diff = age1 - age2
    
    # Get expected age difference parameters
    params = get_age_parameters(relationship_tuple)
    mean = params['mean']
    
    # Combine model standard deviation with age uncertainties
    # Using sum of variances for independent random variables
    combined_var = params['std_dev']**2 + age1_uncertainty**2 + age2_uncertainty**2
    combined_std = np.sqrt(combined_var)
    
    # Check for biological impossibility with more leniency due to uncertainty
    if params['direction'] != 0:
        # Calculate how many standard deviations away from the expected sign
        if params['direction'] == 1 and age_diff < 0:  # Should be positive
            # How many std devs below zero?
            z_score = age_diff / combined_std
            if z_score < -2:  # More than 2 std devs in wrong direction
                return float('-inf')  # Biologically very unlikely
        elif params['direction'] == -1 and age_diff > 0:  # Should be negative
            # How many std devs above zero?
            z_score = age_diff / combined_std
            if z_score > 2:  # More than 2 std devs in wrong direction
                return float('-inf')  # Biologically very unlikely
    
    # Calculate log-likelihood using normal distribution with combined uncertainty
    log_likelihood = stats.norm.logpdf(age_diff, mean, combined_std)
    
    return log_likelihood

# Example with uncertain ages
individuals_uncertain = [
    {'id': 1001, 'age': 70, 'uncertainty': 1},    # Precisely known age
    {'id': 1002, 'age': 40, 'uncertainty': 2},    # Fairly certain age
    {'id': 1003, 'age': 45, 'uncertainty': 5},    # Approximate age
    {'id': 1004, 'age': 15, 'uncertainty': 0.5},  # Very precise age
    {'id': 1005, 'age': 60, 'uncertainty': 10}    # Very uncertain age (e.g., only decade known)
]

# Create a dictionary for easy lookup
id_to_info_uncertain = {person['id']: person for person in individuals_uncertain}

# Define pairs to test
test_pairs_uncertain = [
    (1001, 1002),  # Both ages fairly certain
    (1002, 1003),  # One age more uncertain
    (1001, 1005),  # One age very uncertain
]

# Calculate age likelihoods for each pair
print("Age-Based Relationship Likelihoods with Uncertain Ages:")
for id1, id2 in test_pairs_uncertain:
    info1 = id_to_info_uncertain[id1]
    info2 = id_to_info_uncertain[id2]
    
    age1 = info1['age']
    age2 = info2['age']
    uncertainty1 = info1['uncertainty']
    uncertainty2 = info2['uncertainty']
    
    print(f"\nPair {id1}-{id2}: Ages {age1}±{uncertainty1} and {age2}±{uncertainty2}")
    
    # Calculate likelihoods for each test relationship
    likelihoods = []
    for rel_tuple, rel_name in test_relationships:
        log_ll = calculate_age_likelihood_uncertain(age1, uncertainty1, age2, uncertainty2, rel_tuple)
        likelihoods.append((rel_tuple, rel_name, log_ll))
    
    # Sort by likelihood (highest first)
    likelihoods.sort(key=lambda x: x[2], reverse=True)
    
    # Display results
    print("  Most likely relationships based on age:")
    for rel_tuple, rel_name, log_ll in likelihoods[:3]:  # Top 3
        # Convert to regular likelihood for better readability
        likelihood = np.exp(log_ll) if log_ll > float('-inf') else 0
        print(f"    {rel_name}: {likelihood:.6f} (log-likelihood: {log_ll:.2f})")

## Part 4: Combining Age and Genetic Evidence

In Bonsai v3, age-based likelihoods are combined with genetic likelihoods to provide a more complete picture of relationship probabilities. Let's explore how this combination works.

### 4.1 The Combined Likelihood Approach

Bonsai v3 combines age and genetic evidence using a weighted sum approach in log space:

```python
combined_ll = genetic_ll + age_weight * age_ll
```

Where:
- `genetic_ll` is the log-likelihood based on IBD segments
- `age_ll` is the log-likelihood based on age differences
- `age_weight` is a parameter that controls the influence of age evidence (typically between 0.1 and 0.5)

Let's look at how this combination affects relationship inference:

In [None]:
# Define example genetic log-likelihoods for different relationships
# These would normally come from IBD analysis
example_genetic_lls = {
    ((0, 1, 1), "Parent"): -10,
    ((1, 0, 1), "Child"): -15,
    ((1, 1, 2), "Full Sibling"): -5,  # Strongest genetic evidence for siblings
    ((0, 2, 1), "Grandparent"): -18,
    ((1, 2, 1), "Aunt/Uncle"): -12,
    ((2, 2, 2), "Full First Cousin"): -20
}

# Function to combine genetic and age likelihoods
def combine_likelihoods(genetic_ll, age_ll, age_weight=0.25):
    """Combine genetic and age-based log-likelihoods."""
    # Handle biological impossibilities in age
    if age_ll == float('-inf'):
        return float('-inf')  # Age makes relationship impossible
    
    # Combine likelihoods with weighting
    return genetic_ll + age_weight * age_ll

# Example case: siblings vs parent-child distinction
test_individual_pair = (1002, 1003)  # Ages 40 and 45
id1, id2 = test_individual_pair
age1 = id_to_info[id1]['age']
age2 = id_to_info[id2]['age']

print(f"\nExamining pair {id1}-{id2}: Ages {age1} and {age2} (difference: {age1 - age2} years)")

# Calculate combined likelihoods
combined_results = []
for rel_tuple, rel_name in test_relationships:
    genetic_ll = example_genetic_lls.get((rel_tuple, rel_name), -25)  # Default value for missing relationships
    age_ll = calculate_age_likelihood(age1, age2, rel_tuple)
    
    # Try different age weights
    combined_results.append({
        'relationship': rel_name,
        'genetic_ll': genetic_ll,
        'age_ll': age_ll,
        'combined_ll_w0.1': combine_likelihoods(genetic_ll, age_ll, 0.1),
        'combined_ll_w0.25': combine_likelihoods(genetic_ll, age_ll, 0.25),
        'combined_ll_w0.5': combine_likelihoods(genetic_ll, age_ll, 0.5)
    })

# Convert to DataFrame and display
combined_df = pd.DataFrame(combined_results)
display(combined_df)

# Find the most likely relationship for each weighting
for weight, col in [(0.1, 'combined_ll_w0.1'), (0.25, 'combined_ll_w0.25'), (0.5, 'combined_ll_w0.5')]:
    top_rel = combined_df.sort_values(col, ascending=False).iloc[0]
    print(f"With age weight {weight}: Most likely relationship is {top_rel['relationship']} (combined LL: {top_rel[col]:.2f})")

### 4.2 Impact of Age Evidence on Relationship Disambiguation

One of the key benefits of incorporating age information is the ability to disambiguate relationships that have similar genetic signatures. Let's explore a case where age information helps distinguish between relationships:

In [None]:
# Example case: Relationships with similar genetic signatures
# These relationships can be hard to distinguish based on genetics alone
ambiguous_relationships = [
    ((1, 1, 1), "Half Sibling"),           # ~25% IBD sharing
    ((1, 2, 1), "Aunt/Uncle"),             # ~25% IBD sharing
    ((2, 1, 1), "Niece/Nephew"),           # ~25% IBD sharing
    ((0, 2, 1), "Grandparent"),            # ~25% IBD sharing
    ((2, 0, 1), "Grandchild")              # ~25% IBD sharing
]

# Define similar genetic log-likelihoods for these relationships
ambiguous_genetic_lls = {
    ((1, 1, 1), "Half Sibling"): -8.0,
    ((1, 2, 1), "Aunt/Uncle"): -8.5,
    ((2, 1, 1), "Niece/Nephew"): -8.5,
    ((0, 2, 1), "Grandparent"): -9.0,
    ((2, 0, 1), "Grandchild"): -9.0
}

# Define test cases with different age differences
test_cases = [
    {'id1': 'A', 'id2': 'B', 'age1': 40, 'age2': 40, 'desc': "Same age (0 year difference)"},
    {'id1': 'C', 'id2': 'D', 'age1': 60, 'age2': 40, 'desc': "Older-younger (20 year difference)"},
    {'id1': 'E', 'id2': 'F', 'age1': 70, 'age2': 10, 'desc': "Much older-younger (60 year difference)"}
]

# Analyze each test case
print("\nDisambiguating Relationships with Similar Genetic Signatures:")
for case in test_cases:
    print(f"\nCase: {case['desc']} - Ages {case['age1']} and {case['age2']}")
    
    # Calculate combined likelihoods
    results = []
    for rel_tuple, rel_name in ambiguous_relationships:
        genetic_ll = ambiguous_genetic_lls[(rel_tuple, rel_name)]
        age_ll = calculate_age_likelihood(case['age1'], case['age2'], rel_tuple)
        combined_ll = combine_likelihoods(genetic_ll, age_ll, 0.25)
        
        results.append({
            'relationship': rel_name,
            'genetic_ll': genetic_ll,
            'age_ll': age_ll,
            'combined_ll': combined_ll
        })
    
    # Convert to DataFrame and sort by combined likelihood
    results_df = pd.DataFrame(results).sort_values('combined_ll', ascending=False)
    
    # Display results
    print("  Relationships ranked by combined likelihood:")
    for i, row in results_df.iterrows():
        print(f"    {row['relationship']}: Combined LL = {row['combined_ll']:.2f} (Genetic: {row['genetic_ll']:.2f}, Age: {row['age_ll']:.2f})")

## Summary

In this lab, we've explored how Bonsai v3 uses age information to enhance relationship inference. Key takeaways include:

1. **Age Difference Distributions**: Different relationship types have characteristic age difference distributions, modeled as normal distributions with specific mean and standard deviation parameters.

2. **Age-Based Likelihood Calculation**: Bonsai v3 calculates the likelihood of a relationship based on the observed age difference compared to the expected distribution.

3. **Handling Missing and Uncertain Data**: The system can handle missing age information and incorporates uncertainty in age estimates through adjusted standard deviations.

4. **Combining Evidence**: Age-based likelihoods are combined with genetic likelihoods to provide a more complete assessment of relationship probabilities.

5. **Disambiguation Power**: Age information is particularly valuable for disambiguating relationships with similar genetic signatures, such as half-siblings, avuncular relationships, and grandparent-grandchild relationships.

Age-based relationship modeling allows Bonsai v3 to leverage all available evidence for relationship inference, producing more accurate and comprehensive pedigree reconstructions than would be possible from genetic data alone.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab08_Age_Based_Relationship_Modeling.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive