# Lab 17: Advanced Likelihood Calculations in Bonsai

In this lab, we will explore the mathematical foundation for relationship inference in genetic genealogy using the Bonsai algorithm. Building upon our understanding of Bonsai's architecture from Lab 16, we'll dive deeper into the statistical models and likelihood calculations that power accurate pedigree reconstruction.

## Why This Matters

Understanding the mathematical models behind Bonsai is crucial for computational genetic genealogists who want to:
- Interpret likelihood scores and evaluate confidence in relationships
- Customize likelihood models for specific populations or research questions
- Develop new methods for relationship inference
- Properly handle complex scenarios like endogamy or admixture
- Evaluate competing pedigree hypotheses with statistical rigor

**Learning Objectives**:
- Understand the statistical foundations of likelihood calculations
- Implement core relationship likelihood models for different relationship types
- Apply advanced techniques for more accurate relationship inference
- Calculate pedigree-level likelihoods from pairwise relationships
- Develop specialized models for complex genealogical scenarios
- Interpret likelihood results and confidence scores

## Environment Setup

In [None]:
import os
import math
import logging
import sys
from pathlib import Path
import subprocess
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import pandas as pd
import numpy as np
import networkx as nx
from scipy import stats
from scipy.special import comb
from scipy.stats import poisson, expon, norm, gamma, multivariate_normal
import random
from collections import defaultdict, Counter
import time
from dotenv import load_dotenv

## 1. Statistical Foundations of Likelihood Calculations

At its core, Bonsai uses likelihood functions to evaluate how well different relationship hypotheses explain observed IBD sharing patterns. Let's begin by exploring the statistical foundations of these likelihood calculations.

### 1.1 Likelihood and Maximum Likelihood Estimation

In statistics, the likelihood function measures how well a statistical model explains observed data. For relationship inference, we calculate the likelihood of observing a particular pattern of IBD sharing given different relationship hypotheses.

The maximum likelihood principle states that we should choose the hypothesis that maximizes the likelihood of the observed data. Let's formalize this concept:

In [None]:
def calculate_likelihood(model_parameters, observed_data, probability_function):
    """Calculate the likelihood of observing data given model parameters.
    
    Args:
        model_parameters: Parameters of the model
        observed_data: Observed data points
        probability_function: Function that calculates probability of data given parameters
        
    Returns:
        Likelihood value
    """
    # Initialize likelihood to 1 (we'll multiply probabilities)
    likelihood = 1.0
    
    # For each observed data point
    for data_point in observed_data:
        # Calculate probability of this data point given model parameters
        probability = probability_function(data_point, model_parameters)
        
        # Multiply likelihood by this probability
        likelihood *= probability
    
    return likelihood

def calculate_log_likelihood(model_parameters, observed_data, log_probability_function):
    """Calculate the log-likelihood of observing data given model parameters.
    
    Args:
        model_parameters: Parameters of the model
        observed_data: Observed data points
        log_probability_function: Function that calculates log-probability of data given parameters
        
    Returns:
        Log-likelihood value
    """
    # Initialize log-likelihood to 0 (we'll add log-probabilities)
    log_likelihood = 0.0
    
    # For each observed data point
    for data_point in observed_data:
        # Calculate log-probability of this data point given model parameters
        log_probability = log_probability_function(data_point, model_parameters)
        
        # Add log-probability to log-likelihood
        log_likelihood += log_probability
    
    return log_likelihood

def find_maximum_likelihood_parameters(observed_data, log_probability_function, parameter_space):
    """Find model parameters that maximize likelihood of observed data.
    
    Args:
        observed_data: Observed data points
        log_probability_function: Function that calculates log-probability of data given parameters
        parameter_space: List of possible parameter values to try
        
    Returns:
        Tuple of (best_parameters, maximum_log_likelihood)
    """
    best_parameters = None
    max_log_likelihood = float('-inf')
    
    # Try each set of parameters in parameter space
    for parameters in parameter_space:
        # Calculate log-likelihood for these parameters
        log_likelihood = calculate_log_likelihood(parameters, observed_data, log_probability_function)
        
        # Update best parameters if this is better
        if log_likelihood > max_log_likelihood:
            max_log_likelihood = log_likelihood
            best_parameters = parameters
    
    return best_parameters, max_log_likelihood

### 1.2 Applying Likelihood to Relationship Inference

In the context of genetic genealogy, our observed data consists of IBD segments shared between pairs of individuals. Our hypotheses are different possible relationships between these individuals. Let's illustrate this with a simple example:

In [None]:
# Define a simple model for total IBD sharing in centiMorgans
# Mean and standard deviation for different relationships
ibd_sharing_models = {
    'parent-child': (3500, 100),    # Mean of 3500 cM, SD of 100 cM
    'full-siblings': (2550, 200),   # Mean of 2550 cM, SD of 200 cM
    'half-siblings': (1700, 300),   # Mean of 1700 cM, SD of 300 cM
    'first-cousins': (850, 200),    # Mean of 850 cM, SD of 200 cM
    'second-cousins': (212, 100),   # Mean of 212 cM, SD of 100 cM
    'unrelated': (30, 50)           # Mean of 30 cM, SD of 50 cM
}

def log_probability_total_ibd(observed_total_ibd, relationship):
    """Calculate log-probability of observing total IBD given relationship.
    
    Args:
        observed_total_ibd: Observed total IBD sharing in centiMorgans
        relationship: Relationship type (key in ibd_sharing_models)
        
    Returns:
        Log-probability value
    """
    # Get model parameters for this relationship
    mean, sd = ibd_sharing_models[relationship]
    
    # Calculate log-probability using normal distribution
    log_prob = norm.logpdf(observed_total_ibd, mean, sd)
    
    return log_prob

def infer_relationship_from_total_ibd(observed_total_ibd):
    """Infer most likely relationship based on total IBD sharing.
    
    Args:
        observed_total_ibd: Observed total IBD sharing in centiMorgans
        
    Returns:
        Tuple of (most_likely_relationship, log_likelihood, all_log_likelihoods)
    """
    all_log_likelihoods = {}
    
    # Calculate log-likelihood for each relationship type
    for relationship in ibd_sharing_models.keys():
        log_likelihood = log_probability_total_ibd(observed_total_ibd, relationship)
        all_log_likelihoods[relationship] = log_likelihood
    
    # Find relationship with maximum log-likelihood
    most_likely_relationship = max(all_log_likelihoods.items(), key=lambda x: x[1])[0]
    max_log_likelihood = all_log_likelihoods[most_likely_relationship]
    
    return most_likely_relationship, max_log_likelihood, all_log_likelihoods

# Test with some example values
example_total_ibd_values = [3450, 2600, 1650, 900, 250, 50]

print("Relationship inference from total IBD sharing:")
print("-" * 60)
print(f"{'Total IBD (cM)':<15} {'Inferred Relationship':<25} {'Log-Likelihood':<15}")
print("-" * 60)

for total_ibd in example_total_ibd_values:
    relationship, log_likelihood, all_log_likelihoods = infer_relationship_from_total_ibd(total_ibd)
    print(f"{total_ibd:<15} {relationship:<25} {log_likelihood:<15.2f}")

# Visualize the model distributions
plt.figure(figsize=(12, 6))
x = np.linspace(0, 4000, 1000)

for relationship, (mean, sd) in ibd_sharing_models.items():
    y = norm.pdf(x, mean, sd)
    plt.plot(x, y, label=relationship)

plt.xlabel('Total IBD Sharing (cM)')
plt.ylabel('Probability Density')
plt.title('Relationship Models Based on Total IBD Sharing')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 1.3 Beyond Total IBD: Multiple Features

Using only total IBD sharing is a simplistic approach. In reality, Bonsai uses multiple features derived from IBD segments to improve relationship inference accuracy. These features might include:

1. Total IBD sharing in centiMorgans
2. Number of IBD segments
3. Length distribution of IBD segments
4. Presence of specific chromosomal patterns (e.g., full IBD2 on X chromosome)

Let's extend our model to include both total IBD and number of segments:

In [None]:
# Define a more complex model with two features: total IBD and number of segments
# Format: (mean_total_ibd, sd_total_ibd, mean_num_segments, sd_num_segments)
multi_feature_models = {
    'parent-child': (3500, 100, 40, 5),      # High total IBD, fairly consistent number of segments
    'full-siblings': (2550, 200, 35, 5),     # High total IBD, fairly consistent number of segments
    'half-siblings': (1700, 300, 25, 5),     # Medium total IBD, medium number of segments
    'first-cousins': (850, 200, 15, 5),      # Lower total IBD, fewer segments
    'second-cousins': (212, 100, 5, 3),      # Low total IBD, very few segments
    'unrelated': (30, 50, 1, 1)              # Very low total IBD, typically 0-1 segments
}

def log_probability_multi_feature(observed_features, relationship):
    """Calculate log-probability of observing features given relationship.
    
    Args:
        observed_features: Tuple of (total_ibd, num_segments)
        relationship: Relationship type (key in multi_feature_models)
        
    Returns:
        Log-probability value
    """
    total_ibd, num_segments = observed_features
    
    # Get model parameters for this relationship
    mean_total_ibd, sd_total_ibd, mean_num_segments, sd_num_segments = multi_feature_models[relationship]
    
    # Calculate log-probability for each feature independently
    # (assuming features are independent, which is a simplification)
    log_prob_total_ibd = norm.logpdf(total_ibd, mean_total_ibd, sd_total_ibd)
    log_prob_num_segments = norm.logpdf(num_segments, mean_num_segments, sd_num_segments)
    
    # Sum log-probabilities (equivalent to multiplying probabilities)
    return log_prob_total_ibd + log_prob_num_segments

def infer_relationship_from_multi_feature(observed_features):
    """Infer most likely relationship based on multiple features.
    
    Args:
        observed_features: Tuple of (total_ibd, num_segments)
        
    Returns:
        Tuple of (most_likely_relationship, log_likelihood, all_log_likelihoods)
    """
    all_log_likelihoods = {}
    
    # Calculate log-likelihood for each relationship type
    for relationship in multi_feature_models.keys():
        log_likelihood = log_probability_multi_feature(observed_features, relationship)
        all_log_likelihoods[relationship] = log_likelihood
    
    # Find relationship with maximum log-likelihood
    most_likely_relationship = max(all_log_likelihoods.items(), key=lambda x: x[1])[0]
    max_log_likelihood = all_log_likelihoods[most_likely_relationship]
    
    return most_likely_relationship, max_log_likelihood, all_log_likelihoods

# Test with some example values: (total_ibd, num_segments)
example_multi_feature_values = [
    (3450, 38),  # Likely parent-child
    (2600, 36),  # Likely full-siblings
    (1650, 15),  # Might be half-siblings but fewer segments than expected
    (1650, 24),  # Likely half-siblings with expected number of segments
    (900, 16),   # Likely first-cousins
    (250, 3)     # Likely second-cousins
]

print("Relationship inference from multiple features:")
print("-" * 80)
print(f"{'Total IBD (cM)':<15} {'Num Segments':<15} {'Inferred Relationship':<25} {'Log-Likelihood':<15}")
print("-" * 80)

for features in example_multi_feature_values:
    total_ibd, num_segments = features
    relationship, log_likelihood, all_log_likelihoods = infer_relationship_from_multi_feature(features)
    print(f"{total_ibd:<15} {num_segments:<15} {relationship:<25} {log_likelihood:<15.2f}")

# Visualize the multi-feature model
plt.figure(figsize=(10, 8))

# Create a scatter plot for each relationship type
for relationship, (mean_total_ibd, sd_total_ibd, mean_num_segments, sd_num_segments) in multi_feature_models.items():
    # Generate random samples from this model
    np.random.seed(42 + hash(relationship) % 100)  # Different seed for each relationship
    total_ibd_samples = np.random.normal(mean_total_ibd, sd_total_ibd, 100)
    num_segments_samples = np.random.normal(mean_num_segments, sd_num_segments, 100)
    
    # Ensure values are reasonable (non-negative)
    total_ibd_samples = np.maximum(0, total_ibd_samples)
    num_segments_samples = np.maximum(0, num_segments_samples)
    
    plt.scatter(total_ibd_samples, num_segments_samples, alpha=0.5, label=relationship)

plt.xlabel('Total IBD Sharing (cM)')
plt.ylabel('Number of IBD Segments')
plt.title('Relationship Models Using Multiple Features')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 1.4 Bayesian Approach: Incorporating Prior Knowledge

The likelihood approaches we've discussed so far don't incorporate prior knowledge about relationships. In many cases, we may have prior information (e.g., age constraints, known populations with higher rates of certain relationships) that can improve inference.

Bayesian statistics provides a framework for incorporating prior knowledge through Bayes' Theorem:

$$P(R|D) = \frac{P(D|R) \times P(R)}{P(D)}$$

Where:
- $P(R|D)$ is the posterior probability of relationship $R$ given data $D$
- $P(D|R)$ is the likelihood of observing data $D$ given relationship $R$
- $P(R)$ is the prior probability of relationship $R$
- $P(D)$ is the marginal probability of data $D$

Let's implement a Bayesian relationship inference approach:

In [None]:
# Define prior probabilities for different relationships
# These could be based on demographic information, pedigree structure, etc.
# For simplicity, we'll use a uniform prior here
prior_probabilities = {
    'parent-child': 1/6,
    'full-siblings': 1/6,
    'half-siblings': 1/6,
    'first-cousins': 1/6,
    'second-cousins': 1/6,
    'unrelated': 1/6
}

def infer_relationship_bayesian(observed_features):
    """Infer most likely relationship using Bayesian approach.
    
    Args:
        observed_features: Tuple of (total_ibd, num_segments)
        
    Returns:
        Tuple of (most_likely_relationship, posterior_probability, all_posteriors)
    """
    all_log_likelihoods = {}
    all_posteriors = {}
    
    # Calculate log-likelihood and unnormalized log-posterior for each relationship
    for relationship in multi_feature_models.keys():
        # Calculate log-likelihood
        log_likelihood = log_probability_multi_feature(observed_features, relationship)
        all_log_likelihoods[relationship] = log_likelihood
        
        # Add log-prior to get unnormalized log-posterior
        log_prior = np.log(prior_probabilities[relationship])
        log_posterior = log_likelihood + log_prior
        all_posteriors[relationship] = log_posterior
    
    # Normalize posteriors
    # First convert log-posteriors to posteriors
    max_log_posterior = max(all_posteriors.values())
    posteriors = {r: np.exp(lp - max_log_posterior) for r, lp in all_posteriors.items()}
    
    # Normalize
    sum_posteriors = sum(posteriors.values())
    posteriors = {r: p / sum_posteriors for r, p in posteriors.items()}
    
    # Find relationship with maximum posterior probability
    most_likely_relationship = max(posteriors.items(), key=lambda x: x[1])[0]
    max_posterior = posteriors[most_likely_relationship]
    
    return most_likely_relationship, max_posterior, posteriors

# Define a non-uniform prior that reflects our belief that siblings and cousins are more common
custom_prior = {
    'parent-child': 0.05,    # Relatively uncommon in genetic databases
    'full-siblings': 0.20,   # Common
    'half-siblings': 0.10,   # Less common
    'first-cousins': 0.30,   # Very common in genetic databases
    'second-cousins': 0.25,  # Very common in genetic databases
    'unrelated': 0.10        # We're usually looking at related pairs
}

def infer_relationship_bayesian_custom_prior(observed_features, prior=custom_prior):
    """Infer most likely relationship using Bayesian approach with custom prior.
    
    Args:
        observed_features: Tuple of (total_ibd, num_segments)
        prior: Dictionary of prior probabilities
        
    Returns:
        Tuple of (most_likely_relationship, posterior_probability, all_posteriors)
    """
    all_log_likelihoods = {}
    all_posteriors = {}
    
    # Calculate log-likelihood and unnormalized log-posterior for each relationship
    for relationship in multi_feature_models.keys():
        # Calculate log-likelihood
        log_likelihood = log_probability_multi_feature(observed_features, relationship)
        all_log_likelihoods[relationship] = log_likelihood
        
        # Add log-prior to get unnormalized log-posterior
        log_prior = np.log(prior[relationship])
        log_posterior = log_likelihood + log_prior
        all_posteriors[relationship] = log_posterior
    
    # Normalize posteriors
    # First convert log-posteriors to posteriors
    max_log_posterior = max(all_posteriors.values())
    posteriors = {r: np.exp(lp - max_log_posterior) for r, lp in all_posteriors.items()}
    
    # Normalize
    sum_posteriors = sum(posteriors.values())
    posteriors = {r: p / sum_posteriors for r, p in posteriors.items()}
    
    # Find relationship with maximum posterior probability
    most_likely_relationship = max(posteriors.items(), key=lambda x: x[1])[0]
    max_posterior = posteriors[most_likely_relationship]
    
    return most_likely_relationship, max_posterior, posteriors

# Compare standard likelihood, uniform prior Bayesian, and custom prior Bayesian
print("Comparison of inference methods:")
print("-" * 100)
print(f"{'Total IBD':<10} {'Segments':<10} {'ML Relationship':<20} {'Bayesian (Uniform)':<25} {'Bayesian (Custom)':<25}")
print("-" * 100)

for features in example_multi_feature_values:
    total_ibd, num_segments = features
    
    # Maximum likelihood
    ml_relationship, _, _ = infer_relationship_from_multi_feature(features)
    
    # Bayesian with uniform prior
    bayes_uniform_relationship, bayes_uniform_prob, _ = infer_relationship_bayesian(features)
    
    # Bayesian with custom prior
    bayes_custom_relationship, bayes_custom_prob, _ = infer_relationship_bayesian_custom_prior(features)
    
    print(f"{total_ibd:<10} {num_segments:<10} {ml_relationship:<20} "
          f"{bayes_uniform_relationship} ({bayes_uniform_prob:.2f}):<25} "
          f"{bayes_custom_relationship} ({bayes_custom_prob:.2f}):<25}")

## 2. Core Relationship Likelihood Models

Now that we understand the statistical foundations, let's implement detailed likelihood models for specific relationship types. These models will incorporate knowledge about the expected patterns of IBD sharing for different genetic relationships.

### 2.1 Parent-Child Relationship Model

Parent-child relationships have a very distinctive IBD sharing pattern:
- They share exactly one copy of each autosomal chromosome (IBD1 across the genome)
- Total IBD sharing is approximately 3400-3600 cM
- The variation in total IBD is much lower than other relationships
- Very specific sex-chromosome inheritance patterns

Let's implement a specialized likelihood model for parent-child relationships:

In [None]:
class ParentChildModel:
    """Likelihood model for parent-child relationships."""
    
    def __init__(self):
        """Initialize parent-child model parameters."""
        # Parameters for total IBD sharing
        self.expected_total_ibd = 3540
        self.total_ibd_sd = 60  # Very low variance
        
        # Parameters for number of segments
        # Parent-child should have relatively few recombination events
        self.expected_segments = 40
        self.segments_sd = 5
        
        # Expected fraction of genome with IBD1 sharing
        self.expected_ibd1_fraction = 1.0
        self.ibd1_fraction_sd = 0.02  # Allow for small measurement errors
        
        # Expected fraction of genome with IBD2 sharing
        self.expected_ibd2_fraction = 0.0
        self.ibd2_fraction_sd = 0.01  # Allow for small measurement errors
    
    def calculate_log_likelihood(self, features):
        """Calculate log-likelihood of features under parent-child model.
        
        Args:
            features: Dictionary with keys 'total_ibd', 'num_segments', 'ibd1_fraction', 'ibd2_fraction'
        
        Returns:
            Log-likelihood value
        """
        log_likelihood = 0.0
        
        # Component for total IBD
        if 'total_ibd' in features:
            log_likelihood += norm.logpdf(features['total_ibd'], 
                                         self.expected_total_ibd, 
                                         self.total_ibd_sd)
        
        # Component for number of segments
        if 'num_segments' in features:
            log_likelihood += norm.logpdf(features['num_segments'], 
                                         self.expected_segments, 
                                         self.segments_sd)
        
        # Component for IBD1 fraction
        if 'ibd1_fraction' in features:
            # Use truncated normal to keep within [0, 1]
            # Simplified here with regular normal, but should be truncated in practice
            log_likelihood += norm.logpdf(features['ibd1_fraction'], 
                                         self.expected_ibd1_fraction, 
                                         self.ibd1_fraction_sd)
        
        # Component for IBD2 fraction
        if 'ibd2_fraction' in features:
            # Use truncated normal to keep within [0, 1]
            # Simplified here with regular normal, but should be truncated in practice
            log_likelihood += norm.logpdf(features['ibd2_fraction'], 
                                         self.expected_ibd2_fraction, 
                                         self.ibd2_fraction_sd)
        
        # Additional components for sex-chromosome patterns could be added here
        
        return log_likelihood
    
    def is_consistent_with_sex(self, sex1, sex2):
        """Check if the sexes are consistent with a parent-child relationship.
        
        Args:
            sex1: Sex of individual 1 ('M', 'F', or 'U' for unknown)
            sex2: Sex of individual 2 ('M', 'F', or 'U' for unknown)
            
        Returns:
            Boolean indicating whether sexes are consistent
        """
        # Any combination of M/F/U is consistent with parent-child
        # Two males could be father-son
        # Two females could be mother-daughter
        # Male-female could be father-daughter or mother-son
        return True
    
    def is_consistent_with_age(self, age1, age2, min_parent_age=15):
        """Check if the ages are consistent with a parent-child relationship.
        
        Args:
            age1: Age of individual 1 (or None if unknown)
            age2: Age of individual 2 (or None if unknown)
            min_parent_age: Minimum age for a parent at child's birth
            
        Returns:
            Boolean indicating whether ages are consistent
        """
        if age1 is None or age2 is None:
            # If ages are unknown, we can't constrain
            return True
        
        # Check both possible directions (either could be parent)
        age_diff = abs(age1 - age2)
        
        # Parent should be at least min_parent_age years older than child
        return age_diff >= min_parent_age

# Example parent-child features
parent_child_features = {
    'total_ibd': 3520,
    'num_segments': 42,
    'ibd1_fraction': 0.99,
    'ibd2_fraction': 0.01
}

# Not parent-child features
not_parent_child_features = {
    'total_ibd': 2500,
    'num_segments': 35,
    'ibd1_fraction': 0.7,
    'ibd2_fraction': 0.3
}

# Create model and calculate likelihoods
pc_model = ParentChildModel()
pc_log_likelihood = pc_model.calculate_log_likelihood(parent_child_features)
non_pc_log_likelihood = pc_model.calculate_log_likelihood(not_parent_child_features)

print(f"Parent-child model log-likelihood for likely parent-child: {pc_log_likelihood:.2f}")
print(f"Parent-child model log-likelihood for non-parent-child: {non_pc_log_likelihood:.2f}")
print(f"Likelihood ratio: {np.exp(pc_log_likelihood - non_pc_log_likelihood):.2e}")

# Test age consistency
age_pairs = [(40, 20), (20, 40), (25, 20), (20, 15)]
print("\nAge consistency for parent-child relationship:")
for age1, age2 in age_pairs:
    is_consistent = pc_model.is_consistent_with_age(age1, age2)
    print(f"Ages {age1} and {age2}: {'Consistent' if is_consistent else 'Inconsistent'}")

### 2.2 Full Siblings Relationship Model

Full siblings have a distinctive IBD sharing pattern that differs from parent-child:
- They share on average 50% of their DNA (like parent-child)
- But they have a mix of IBD0, IBD1, and IBD2 regions
- The expected proportions are 25% IBD0, 50% IBD1, and 25% IBD2
- However, there is significant variation in these proportions

Let's implement a specialized likelihood model for full siblings:

In [None]:
class FullSiblingsModel:
    """Likelihood model for full sibling relationships."""
    
    def __init__(self):
        """Initialize full siblings model parameters."""
        # Parameters for total IBD sharing
        self.expected_total_ibd = 2550
        self.total_ibd_sd = 200  # Higher variance than parent-child
        
        # Parameters for number of segments
        self.expected_segments = 35
        self.segments_sd = 5
        
        # Expected fraction of genome with IBD1 sharing
        self.expected_ibd1_fraction = 0.5
        self.ibd1_fraction_sd = 0.1
        
        # Expected fraction of genome with IBD2 sharing
        self.expected_ibd2_fraction = 0.25
        self.ibd2_fraction_sd = 0.1
    
    def calculate_log_likelihood(self, features):
        """Calculate log-likelihood of features under full siblings model.
        
        Args:
            features: Dictionary with keys 'total_ibd', 'num_segments', 'ibd1_fraction', 'ibd2_fraction'
        
        Returns:
            Log-likelihood value
        """
        log_likelihood = 0.0
        
        # Component for total IBD
        if 'total_ibd' in features:
            log_likelihood += norm.logpdf(features['total_ibd'], 
                                         self.expected_total_ibd, 
                                         self.total_ibd_sd)
        
        # Component for number of segments
        if 'num_segments' in features:
            log_likelihood += norm.logpdf(features['num_segments'], 
                                         self.expected_segments, 
                                         self.segments_sd)
        
        # Component for IBD1 fraction
        if 'ibd1_fraction' in features:
            log_likelihood += norm.logpdf(features['ibd1_fraction'], 
                                         self.expected_ibd1_fraction, 
                                         self.ibd1_fraction_sd)
        
        # Component for IBD2 fraction
        if 'ibd2_fraction' in features:
            log_likelihood += norm.logpdf(features['ibd2_fraction'], 
                                         self.expected_ibd2_fraction, 
                                         self.ibd2_fraction_sd)
        
        # IBD0 fraction is redundant (1 - IBD1 - IBD2)
        # but we could add a check for consistency
        if 'ibd1_fraction' in features and 'ibd2_fraction' in features:
            ibd0_fraction = 1 - features['ibd1_fraction'] - features['ibd2_fraction']
            expected_ibd0 = 1 - self.expected_ibd1_fraction - self.expected_ibd2_fraction
            
            # Add a consistency check (should be close to 0.25 for siblings)
            if ibd0_fraction < 0 or ibd0_fraction > 1:
                # Invalid fractions, highly unlikely to be siblings
                log_likelihood -= 100  # Large penalty
            else:
                # Check if IBD0 fraction is consistent with siblings
                log_likelihood += norm.logpdf(ibd0_fraction, expected_ibd0, 0.1)
        
        return log_likelihood
    
    def is_consistent_with_sex(self, sex1, sex2):
        """Check if the sexes are consistent with a full sibling relationship.
        
        Args:
            sex1: Sex of individual 1 ('M', 'F', or 'U' for unknown)
            sex2: Sex of individual 2 ('M', 'F', or 'U' for unknown)
            
        Returns:
            Boolean indicating whether sexes are consistent
        """
        # Any combination of M/F/U is consistent with full siblings
        return True
    
    def is_consistent_with_age(self, age1, age2, max_sibling_age_diff=30):
        """Check if the ages are consistent with a full sibling relationship.
        
        Args:
            age1: Age of individual 1 (or None if unknown)
            age2: Age of individual 2 (or None if unknown)
            max_sibling_age_diff: Maximum plausible age difference between siblings
            
        Returns:
            Boolean indicating whether ages are consistent
        """
        if age1 is None or age2 is None:
            # If ages are unknown, we can't constrain
            return True
        
        # Siblings should be within a reasonable age range of each other
        age_diff = abs(age1 - age2)
        
        return age_diff <= max_sibling_age_diff

# Example full siblings features
full_siblings_features = {
    'total_ibd': 2600,
    'num_segments': 36,
    'ibd1_fraction': 0.48,
    'ibd2_fraction': 0.27
}

# Not full siblings features (more like parent-child)
not_full_siblings_features = {
    'total_ibd': 3520,
    'num_segments': 42,
    'ibd1_fraction': 0.99,
    'ibd2_fraction': 0.01
}

# Create model and calculate likelihoods
fs_model = FullSiblingsModel()
fs_log_likelihood = fs_model.calculate_log_likelihood(full_siblings_features)
non_fs_log_likelihood = fs_model.calculate_log_likelihood(not_full_siblings_features)

print(f"Full siblings model log-likelihood for likely full siblings: {fs_log_likelihood:.2f}")
print(f"Full siblings model log-likelihood for non-full siblings: {non_fs_log_likelihood:.2f}")
print(f"Likelihood ratio: {np.exp(fs_log_likelihood - non_fs_log_likelihood):.2e}")

# Test age consistency
age_pairs = [(40, 35), (20, 10), (25, 65), (5, 5)]
print("\nAge consistency for full sibling relationship:")
for age1, age2 in age_pairs:
    is_consistent = fs_model.is_consistent_with_age(age1, age2)
    print(f"Ages {age1} and {age2}: {'Consistent' if is_consistent else 'Inconsistent'}")

### 2.3 Half Siblings, Grandparent-Grandchild, and Avuncular Relationships

Half siblings, grandparent-grandchild, and avuncular (aunt/uncle-niece/nephew) relationships all share approximately 25% of their DNA (coefficient of relatedness r = 0.25). However, there are subtle differences in their IBD patterns that can sometimes help distinguish them:

- All have approximately 1700 cM of shared DNA
- Half siblings have only IBD1 regions (no IBD2)
- Grandparent-grandchild and avuncular relationships also have only IBD1 regions
- The length distribution of segments differs slightly between these relationships

Let's implement a model for these relationships:

In [ ]:
class HalfSiblingGrandparentAvuncularModel:
    """Likelihood model for r=0.25 relationships (half siblings, grandparent-grandchild, avuncular)."""
    
    def __init__(self, relationship_type='half-siblings'):
        """Initialize model parameters.
        
        Args:
            relationship_type: One of 'half-siblings', 'grandparent-grandchild', or 'avuncular'
        """
        self.relationship_type = relationship_type
        
        # Common parameters for all r=0.25 relationships
        self.expected_total_ibd = 1700
        self.total_ibd_sd = 300
        
        # Parameters specific to each relationship type
        if relationship_type == 'half-siblings':
            self.expected_segments = 25
            self.segments_sd = 5
            self.expected_segment_mean_length = 68  # Expected mean length of segments
            self.segment_mean_length_sd = 15
        elif relationship_type == 'grandparent-grandchild':
            self.expected_segments = 30
            self.segments_sd = 5
            self.expected_segment_mean_length = 57  # Slightly shorter segments on average
            self.segment_mean_length_sd = 12
        elif relationship_type == 'avuncular':
            self.expected_segments = 28
            self.segments_sd = 5
            self.expected_segment_mean_length = 61  # Intermediate
            self.segment_mean_length_sd = 13
        else:
            raise ValueError(f"Unknown relationship type: {relationship_type}")
        
        # Common to all r=0.25 relationships
        self.expected_ibd1_fraction = 0.25
        self.ibd1_fraction_sd = 0.05
        self.expected_ibd2_fraction = 0.0
        self.ibd2_fraction_sd = 0.01  # Small to allow for measurement error
    
    def calculate_log_likelihood(self, features):
        """Calculate log-likelihood of features under this model.
        
        Args:
            features: Dictionary with keys 'total_ibd', 'num_segments', 'ibd1_fraction', 
                     'ibd2_fraction', and optionally 'mean_segment_length'
        
        Returns:
            Log-likelihood value
        """
        log_likelihood = 0.0
        
        # Component for total IBD
        if 'total_ibd' in features:
            log_likelihood += norm.logpdf(features['total_ibd'], 
                                         self.expected_total_ibd, 
                                         self.total_ibd_sd)
        
        # Component for number of segments
        if 'num_segments' in features:
            log_likelihood += norm.logpdf(features['num_segments'], 
                                         self.expected_segments, 
                                         self.segments_sd)
        
        # Component for IBD1 fraction
        if 'ibd1_fraction' in features:
            log_likelihood += norm.logpdf(features['ibd1_fraction'], 
                                         self.expected_ibd1_fraction, 
                                         self.ibd1_fraction_sd)
        
        # Component for IBD2 fraction (should be close to 0)
        if 'ibd2_fraction' in features:
            log_likelihood += norm.logpdf(features['ibd2_fraction'], 
                                         self.expected_ibd2_fraction, 
                                         self.ibd2_fraction_sd)
        
        # Component for mean segment length (helpful for distinguishing between r=0.25 relationships)
        if 'mean_segment_length' in features:
            log_likelihood += norm.logpdf(features['mean_segment_length'], 
                                         self.expected_segment_mean_length, 
                                         self.segment_mean_length_sd)
        
        return log_likelihood
    
    def is_consistent_with_sex(self, sex1, sex2):
        """Check if the sexes are consistent with this relationship.
        
        Args:
            sex1: Sex of individual 1 ('M', 'F', or 'U' for unknown)
            sex2: Sex of individual 2 ('M', 'F', or 'U' for unknown)
            
        Returns:
            Boolean indicating whether sexes are consistent
        """
        # For half-siblings, any combination is valid
        if self.relationship_type == 'half-siblings':
            return True
        
        # For grandparent-grandchild, any combination is valid
        elif self.relationship_type == 'grandparent-grandchild':
            return True
        
        # For avuncular, any combination is valid
        elif self.relationship_type == 'avuncular':
            return True
        
        return False  # Should not reach here
    
    def is_consistent_with_age(self, age1, age2):
        """Check if the ages are consistent with this relationship.
        
        Args:
            age1: Age of individual 1 (or None if unknown)
            age2: Age of individual 2 (or None if unknown)
            
        Returns:
            Boolean indicating whether ages are consistent
        """
        if age1 is None or age2 is None:
            return True
        
        # For half-siblings, similar constraints as full siblings
        if self.relationship_type == 'half-siblings':
            return abs(age1 - age2) <= 30  # Arbitrary threshold
        
        # For grandparent-grandchild, should be at least 30 years apart typically
        elif self.relationship_type == 'grandparent-grandchild':
            return abs(age1 - age2) >= 30
        
        # For avuncular, should be at least 15 years apart typically
        elif self.relationship_type == 'avuncular':
            return abs(age1 - age2) >= 12  # Can be closer in age than grandparent-grandchild
        
        return False  # Should not reach here

# Example half-sibling features
half_sibling_features = {
    'total_ibd': 1750,
    'num_segments': 26,
    'ibd1_fraction': 0.26,
    'ibd2_fraction': 0.01,
    'mean_segment_length': 67
}

# Example grandparent-grandchild features
grandparent_features = {
    'total_ibd': 1680,
    'num_segments': 31,
    'ibd1_fraction': 0.25,
    'ibd2_fraction': 0.0,
    'mean_segment_length': 54
}

# Example avuncular features
avuncular_features = {
    'total_ibd': 1720,
    'num_segments': 29,
    'ibd1_fraction': 0.25,
    'ibd2_fraction': 0.005,
    'mean_segment_length': 59
}

# Create models
half_sibling_model = HalfSiblingGrandparentAvuncularModel('half-siblings')
grandparent_model = HalfSiblingGrandparentAvuncularModel('grandparent-grandchild')
avuncular_model = HalfSiblingGrandparentAvuncularModel('avuncular')

# Compare all models on all examples
models = [half_sibling_model, grandparent_model, avuncular_model]
examples = [
    ('Half-sibling', half_sibling_features),
    ('Grandparent-Grandchild', grandparent_features),
    ('Avuncular', avuncular_features)
]

print("Log-likelihoods for r=0.25 relationships:")
print("-" * 80)
print(f"{'Example Data':<25} {'Half-Sibling Model':<20} {'Grandparent Model':<20} {'Avuncular Model':<20}")
print("-" * 80)

for example_name, features in examples:
    likelihoods = []
    for model in models:
        log_likelihood = model.calculate_log_likelihood(features)
        likelihoods.append(log_likelihood)
    
    print(f"{example_name:<25} {likelihoods[0]:<20.2f} {likelihoods[1]:<20.2f} {likelihoods[2]:<20.2f}")

print("\nMost likely relationship for each example:")
for example_name, features in examples:
    best_model_idx = np.argmax([model.calculate_log_likelihood(features) for model in models])
    best_model = models[best_model_idx]
    print(f"{example_name}: Most likely {best_model.relationship_type}")

# Analyze age consistency
age_pairs = [(40, 35), (60, 25), (30, 10), (25, 15)]
print("\nAge consistency by relationship type:")
for age1, age2 in age_pairs:
    hs_consistent = half_sibling_model.is_consistent_with_age(age1, age2)
    gp_consistent = grandparent_model.is_consistent_with_age(age1, age2)
    av_consistent = avuncular_model.is_consistent_with_age(age1, age2)
    
    print(f"Ages {age1} and {age2}:")
    print(f"  Half-sibling: {'Consistent' if hs_consistent else 'Inconsistent'}")
    print(f"  Grandparent-Grandchild: {'Consistent' if gp_consistent else 'Inconsistent'}")
    print(f"  Avuncular: {'Consistent' if av_consistent else 'Inconsistent'}")

### 2.4 Distant Relationship Models

For more distant relationships (first cousins, second cousins, etc.), the amount of IBD sharing decreases and becomes more variable. The length distribution of IBD segments also becomes more important for accurately identifying these relationships.

Let's implement models for distant relationships that incorporate both the total amount of sharing and the segment length distribution:

In [ ]:
class DistantRelationshipModel:
    """Likelihood model for distant relationships."""
    
    def __init__(self, relationship_type):
        """Initialize distant relationship model parameters.
        
        Args:
            relationship_type: One of 'first-cousins', 'first-cousins-once-removed', 
                              'second-cousins', 'second-cousins-once-removed', or 'third-cousins'
        """
        self.relationship_type = relationship_type
        
        # Set parameters based on relationship type
        if relationship_type == 'first-cousins':
            self.coef_relatedness = 0.125  # r = 1/8
            self.expected_total_ibd = 850
            self.total_ibd_sd = 200
            self.expected_segments = 15
            self.segments_sd = 5
            self.expected_longest_segment = 75
            self.longest_segment_sd = 20
        elif relationship_type == 'first-cousins-once-removed':
            self.coef_relatedness = 0.0625  # r = 1/16
            self.expected_total_ibd = 425
            self.total_ibd_sd = 150
            self.expected_segments = 10
            self.segments_sd = 4
            self.expected_longest_segment = 55
            self.longest_segment_sd = 15
        elif relationship_type == 'second-cousins':
            self.coef_relatedness = 0.03125  # r = 1/32
            self.expected_total_ibd = 212
            self.total_ibd_sd = 100
            self.expected_segments = 5
            self.segments_sd = 3
            self.expected_longest_segment = 45
            self.longest_segment_sd = 15
        elif relationship_type == 'second-cousins-once-removed':
            self.coef_relatedness = 0.015625  # r = 1/64
            self.expected_total_ibd = 106
            self.total_ibd_sd = 70
            self.expected_segments = 3
            self.segments_sd = 2
            self.expected_longest_segment = 35
            self.longest_segment_sd = 12
        elif relationship_type == 'third-cousins':
            self.coef_relatedness = 0.0078125  # r = 1/128
            self.expected_total_ibd = 53
            self.total_ibd_sd = 50
            self.expected_segments = 2
            self.segments_sd = 1.5
            self.expected_longest_segment = 28
            self.longest_segment_sd = 10
        else:
            raise ValueError(f"Unknown relationship type: {relationship_type}")
        
        # Expected segment length
        # For distant relationships, segments get shorter as relationship distance increases
        # This is due to more recombination events between the common ancestor and the relatives
        self.expected_mean_segment_length = self.expected_total_ibd / max(1, self.expected_segments)
        self.mean_segment_length_sd = self.expected_mean_segment_length * 0.3  # 30% variation
        
        # All distant relationships should have only IBD1 sharing (no IBD2)
        self.expected_ibd1_fraction = self.coef_relatedness
        self.ibd1_fraction_sd = self.coef_relatedness * 0.4  # 40% variation
        self.expected_ibd2_fraction = 0.0
        self.ibd2_fraction_sd = 0.005  # Very small, just for measurement error
    
    def calculate_log_likelihood(self, features):
        """Calculate log-likelihood of features under this model.
        
        Args:
            features: Dictionary with keys 'total_ibd', 'num_segments', 'mean_segment_length',
                     'longest_segment', 'ibd1_fraction', 'ibd2_fraction'
        
        Returns:
            Log-likelihood value
        """
        log_likelihood = 0.0
        
        # Component for total IBD
        if 'total_ibd' in features:
            log_likelihood += norm.logpdf(features['total_ibd'], 
                                         self.expected_total_ibd, 
                                         self.total_ibd_sd)
        
        # Component for number of segments
        if 'num_segments' in features:
            # For very distant relationships, use Poisson distribution
            if self.expected_segments < 5:
                log_likelihood += poisson.logpmf(features['num_segments'], self.expected_segments)
            else:
                log_likelihood += norm.logpdf(features['num_segments'], 
                                             self.expected_segments, 
                                             self.segments_sd)
        
        # Component for IBD1 fraction
        if 'ibd1_fraction' in features:
            log_likelihood += norm.logpdf(features['ibd1_fraction'], 
                                         self.expected_ibd1_fraction, 
                                         self.ibd1_fraction_sd)
        
        # Component for IBD2 fraction (should be close to 0)
        if 'ibd2_fraction' in features:
            log_likelihood += norm.logpdf(features['ibd2_fraction'], 
                                         self.expected_ibd2_fraction, 
                                         self.ibd2_fraction_sd)
        
        # Component for mean segment length
        if 'mean_segment_length' in features:
            log_likelihood += norm.logpdf(features['mean_segment_length'], 
                                         self.expected_mean_segment_length, 
                                         self.mean_segment_length_sd)
        
        # Component for longest segment (very informative for distant relationships)
        if 'longest_segment' in features:
            log_likelihood += norm.logpdf(features['longest_segment'], 
                                         self.expected_longest_segment, 
                                         self.longest_segment_sd)
        
        return log_likelihood
    
    def is_consistent_with_age(self, age1, age2):
        """Check if the ages are consistent with this relationship.
        
        Args:
            age1: Age of individual 1 (or None if unknown)
            age2: Age of individual 2 (or None if unknown)
            
        Returns:
            Boolean indicating whether ages are consistent
        """
        # For distant relationships, almost any age difference is plausible
        # We'll just apply some very loose constraints
        if age1 is None or age2 is None:
            return True
        
        # Calculate minimum plausible age difference based on relationship
        if self.relationship_type == 'first-cousins':
            min_plausible_age_diff = 0  # Can be the same age
        elif self.relationship_type == 'first-cousins-once-removed':
            min_plausible_age_diff = 15  # Should be generational difference
        elif self.relationship_type == 'second-cousins':
            min_plausible_age_diff = 0  # Can be the same age
        elif self.relationship_type == 'second-cousins-once-removed':
            min_plausible_age_diff = 15  # Should be generational difference
        elif self.relationship_type == 'third-cousins':
            min_plausible_age_diff = 0  # Can be the same age
        else:
            return True  # Unknown relationship, be conservative
        
        if 'once-removed' in self.relationship_type and abs(age1 - age2) < min_plausible_age_diff:
            return False  # Age difference too small for once-removed relationship
        
        return True  # Most distant relationships have compatible ages

# Create example data for distant relationships
distant_relationship_examples = [
    ('First Cousins', {
        'total_ibd': 870,
        'num_segments': 16,
        'mean_segment_length': 54,
        'longest_segment': 80,
        'ibd1_fraction': 0.13,
        'ibd2_fraction': 0.0
    }),
    ('First Cousins Once Removed', {
        'total_ibd': 410,
        'num_segments': 9,
        'mean_segment_length': 45,
        'longest_segment': 60,
        'ibd1_fraction': 0.06,
        'ibd2_fraction': 0.0
    }),
    ('Second Cousins', {
        'total_ibd': 220,
        'num_segments': 6,
        'mean_segment_length': 37,
        'longest_segment': 48,
        'ibd1_fraction': 0.03,
        'ibd2_fraction': 0.0
    }),
    ('Second Cousins Once Removed', {
        'total_ibd': 105,
        'num_segments': 3,
        'mean_segment_length': 35,
        'longest_segment': 40,
        'ibd1_fraction': 0.015,
        'ibd2_fraction': 0.0
    }),
    ('Third Cousins', {
        'total_ibd': 60,
        'num_segments': 2,
        'mean_segment_length': 30,
        'longest_segment': 35,
        'ibd1_fraction': 0.009,
        'ibd2_fraction': 0.0
    })
]

# Create models for distant relationships
distant_models = [
    DistantRelationshipModel('first-cousins'),
    DistantRelationshipModel('first-cousins-once-removed'),
    DistantRelationshipModel('second-cousins'),
    DistantRelationshipModel('second-cousins-once-removed'),
    DistantRelationshipModel('third-cousins')
]

# Calculate log-likelihoods for each example using each model
print("Log-likelihoods for distant relationships:")
print("-" * 100)
header = f"{'Example Data':<25}"
for model in distant_models:
    header += f"{model.relationship_type:<20}"
print(header)
print("-" * 100)

for example_name, features in distant_relationship_examples:
    line = f"{example_name:<25}"
    for model in distant_models:
        log_likelihood = model.calculate_log_likelihood(features)
        line += f"{log_likelihood:<20.2f}"
    print(line)

print("\nMost likely relationship for each example:")
for example_name, features in distant_relationship_examples:
    log_likelihoods = [model.calculate_log_likelihood(features) for model in distant_models]
    best_model_idx = np.argmax(log_likelihoods)
    best_model = distant_models[best_model_idx]
    print(f"{example_name}: Most likely {best_model.relationship_type}")

# Visualize the expected total IBD for different relationships
plt.figure(figsize=(12, 6))
relationships = [
    'Parent-Child', 'Full Siblings', 'Half Siblings/\nGrandparent/\nAvuncular', 
    'First Cousins', '1C1R', '2C', '2C1R', '3C'
]
expected_ibd = [3540, 2550, 1700, 850, 425, 212, 106, 53]
std_dev = [60, 200, 300, 200, 150, 100, 70, 50]

plt.errorbar(range(len(relationships)), expected_ibd, yerr=std_dev, fmt='o', capsize=5)
plt.xticks(range(len(relationships)), relationships, rotation=45)
plt.ylim(0, 4000)
plt.grid(True, alpha=0.3)
plt.title('Expected Total IBD Sharing by Relationship Type')
plt.ylabel('Expected Total IBD (cM)')
plt.tight_layout()
plt.show()

# Visualize segment length distribution for different relationships
plt.figure(figsize=(10, 6))

# Parameters for normal distribution of segment lengths
relationships = [
    'First Cousins', 'First Cousins Once Removed', 
    'Second Cousins', 'Second Cousins Once Removed', 'Third Cousins'
]
segment_means = [54, 45, 37, 35, 30]
segment_sds = [20, 15, 12, 10, 8]

x = np.linspace(0, 100, 1000)

for i, rel in enumerate(relationships):
    y = norm.pdf(x, segment_means[i], segment_sds[i])
    plt.plot(x, y, label=rel)

plt.xlabel('Segment Length (cM)')
plt.ylabel('Probability Density')
plt.title('Segment Length Distributions by Relationship Type')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 3. Advanced Likelihood Calculation Techniques

So far, we've implemented relationship models using normal distributions for each feature independently. However, this approach ignores dependencies between features and may not capture the full complexity of IBD sharing patterns. Let's explore more advanced techniques for likelihood calculation.

### 3.1 Multivariate Probability Distributions

Instead of modeling each feature independently, we can use multivariate probability distributions to capture the dependencies between features. The multivariate normal distribution is a common choice:

In [ ]:
class MultivariateRelationshipModel:
    """Relationship model using multivariate normal distribution."""
    
    def __init__(self, relationship_type):
        """Initialize multivariate relationship model.
        
        Args:
            relationship_type: Type of relationship to model
        """
        self.relationship_type = relationship_type
        
        # Define mean vectors and covariance matrices for different relationships
        if relationship_type == 'parent-child':
            # Features: [total_ibd, num_segments, ibd1_fraction, ibd2_fraction]
            self.mean = np.array([3540, 40, 1.0, 0.0])
            
            # Covariance matrix - positive correlation between total_ibd and num_segments
            self.cov = np.array([
                [3600, 100, 0.01, 0],     # total_ibd variance and covariances
                [100, 25, 0.01, 0],       # num_segments variance and covariances
                [0.01, 0.01, 0.0004, 0],  # ibd1_fraction variance and covariances
                [0, 0, 0, 0.0001]         # ibd2_fraction variance and covariances
            ])
            
        elif relationship_type == 'full-siblings':
            # Features: [total_ibd, num_segments, ibd1_fraction, ibd2_fraction]
            self.mean = np.array([2550, 35, 0.5, 0.25])
            
            # Covariance matrix - more complex correlations for siblings
            self.cov = np.array([
                [40000, 500, 0.2, 0.1],   # total_ibd variance and covariances
                [500, 25, 0.05, 0.02],    # num_segments variance and covariances
                [0.2, 0.05, 0.01, -0.005], # ibd1_fraction variance and covariances (negative correlation with ibd2)
                [0.1, 0.02, -0.005, 0.01]  # ibd2_fraction variance and covariances
            ])
            
        elif relationship_type == 'half-siblings':
            # Features: [total_ibd, num_segments, ibd1_fraction, ibd2_fraction]
            self.mean = np.array([1700, 25, 0.25, 0.0])
            
            # Covariance matrix
            self.cov = np.array([
                [90000, 400, 0.15, 0],    # total_ibd variance and covariances
                [400, 25, 0.04, 0],       # num_segments variance and covariances
                [0.15, 0.04, 0.0025, 0],  # ibd1_fraction variance and covariances
                [0, 0, 0, 0.0001]         # ibd2_fraction variance and covariances
            ])
            
        elif relationship_type == 'first-cousins':
            # Features: [total_ibd, num_segments, ibd1_fraction, ibd2_fraction]
            self.mean = np.array([850, 15, 0.125, 0.0])
            
            # Covariance matrix
            self.cov = np.array([
                [40000, 300, 0.1, 0],     # total_ibd variance and covariances
                [300, 25, 0.03, 0],       # num_segments variance and covariances
                [0.1, 0.03, 0.0016, 0],   # ibd1_fraction variance and covariances
                [0, 0, 0, 0.0001]         # ibd2_fraction variance and covariances
            ])
            
        elif relationship_type == 'second-cousins':
            # Features: [total_ibd, num_segments, ibd1_fraction, ibd2_fraction]
            self.mean = np.array([212, 5, 0.03125, 0.0])
            
            # Covariance matrix
            self.cov = np.array([
                [10000, 150, 0.05, 0],    # total_ibd variance and covariances
                [150, 9, 0.01, 0],        # num_segments variance and covariances
                [0.05, 0.01, 0.0004, 0],  # ibd1_fraction variance and covariances
                [0, 0, 0, 0.0001]         # ibd2_fraction variance and covariances
            ])
            
        elif relationship_type == 'unrelated':
            # Features: [total_ibd, num_segments, ibd1_fraction, ibd2_fraction]
            self.mean = np.array([30, 1, 0.004, 0.0])
            
            # Covariance matrix
            self.cov = np.array([
                [2500, 50, 0.01, 0],      # total_ibd variance and covariances
                [50, 1, 0.005, 0],        # num_segments variance and covariances
                [0.01, 0.005, 0.0001, 0], # ibd1_fraction variance and covariances
                [0, 0, 0, 0.0001]         # ibd2_fraction variance and covariances
            ])
            
        else:
            raise ValueError(f"Unknown relationship type: {relationship_type}")
    
    def calculate_log_likelihood(self, features):
        """Calculate log-likelihood of features using multivariate normal distribution.
        
        Args:
            features: Dictionary with keys 'total_ibd', 'num_segments', 'ibd1_fraction', 'ibd2_fraction'
            
        Returns:
            Log-likelihood value
        """
        # Extract features as vector in the correct order
        feature_vector = np.array([
            features.get('total_ibd', self.mean[0]),
            features.get('num_segments', self.mean[1]),
            features.get('ibd1_fraction', self.mean[2]),
            features.get('ibd2_fraction', self.mean[3])
        ])
        
        # Calculate log-likelihood using multivariate normal distribution
        log_likelihood = multivariate_normal.logpdf(feature_vector, mean=self.mean, cov=self.cov)
        
        return log_likelihood

# Create multivariate models for different relationships
mv_models = {
    'parent-child': MultivariateRelationshipModel('parent-child'),
    'full-siblings': MultivariateRelationshipModel('full-siblings'),
    'half-siblings': MultivariateRelationshipModel('half-siblings'),
    'first-cousins': MultivariateRelationshipModel('first-cousins'),
    'second-cousins': MultivariateRelationshipModel('second-cousins'),
    'unrelated': MultivariateRelationshipModel('unrelated')
}

# Test with some example data
test_examples = [
    ('Likely Parent-Child', {
        'total_ibd': 3520,
        'num_segments': 42,
        'ibd1_fraction': 0.99,
        'ibd2_fraction': 0.01
    }),
    ('Likely Full Siblings', {
        'total_ibd': 2600,
        'num_segments': 36,
        'ibd1_fraction': 0.48,
        'ibd2_fraction': 0.27
    }),
    ('Likely Half Siblings', {
        'total_ibd': 1750,
        'num_segments': 26,
        'ibd1_fraction': 0.26,
        'ibd2_fraction': 0.01
    }),
    ('Likely First Cousins', {
        'total_ibd': 870,
        'num_segments': 16,
        'ibd1_fraction': 0.13,
        'ibd2_fraction': 0.0
    }),
    ('Likely Second Cousins', {
        'total_ibd': 220,
        'num_segments': 6,
        'ibd1_fraction': 0.03,
        'ibd2_fraction': 0.0
    }),
    ('Likely Unrelated', {
        'total_ibd': 35,
        'num_segments': 1,
        'ibd1_fraction': 0.005,
        'ibd2_fraction': 0.0
    })
]

# Calculate log-likelihoods for each example using each model
print("Multivariate model log-likelihoods:")
print("-" * 100)
header = f"{'Example Data':<25}"
for model_name in mv_models.keys():
    header += f"{model_name:<20}"
print(header)
print("-" * 100)

for example_name, features in test_examples:
    line = f"{example_name:<25}"
    for model_name, model in mv_models.items():
        log_likelihood = model.calculate_log_likelihood(features)
        line += f"{log_likelihood:<20.2f}"
    print(line)

print("\nMost likely relationship for each example:")
for example_name, features in test_examples:
    log_likelihoods = {model_name: model.calculate_log_likelihood(features) 
                      for model_name, model in mv_models.items()}
    best_relationship = max(log_likelihoods.items(), key=lambda x: x[1])[0]
    max_log_likelihood = log_likelihoods[best_relationship]
    print(f"{example_name}: Most likely {best_relationship} (log-likelihood: {max_log_likelihood:.2f})")

# Visualize feature correlations for full siblings
full_sib_model = mv_models['full-siblings']

# Generate random samples from full sibling model
np.random.seed(42)
full_sib_samples = np.random.multivariate_normal(full_sib_model.mean, full_sib_model.cov, 1000)

# Create a pandas DataFrame for plotting
full_sib_df = pd.DataFrame(full_sib_samples, columns=['total_ibd', 'num_segments', 'ibd1_fraction', 'ibd2_fraction'])

# Scatterplot matrix
plt.figure(figsize=(12, 10))
pd.plotting.scatter_matrix(full_sib_df, alpha=0.3, figsize=(12, 10), diagonal='kde')
plt.suptitle('Feature Correlations for Full Siblings Model', fontsize=16)
plt.tight_layout()
plt.subplots_adjust(top=0.95)
plt.show()

### 3.2 Segment-Based Likelihood Models

Another advanced approach is to model the likelihood based on the properties of individual IBD segments rather than summary statistics. This can provide more detailed information about the relationship, especially for distant relationships where the number of segments is small.

In [ ]:
class SegmentBasedModel:
    """Likelihood model based on individual IBD segments."""
    
    def __init__(self, relationship_type):
        """Initialize segment-based relationship model.
        
        Args:
            relationship_type: Type of relationship to model
        """
        self.relationship_type = relationship_type
        
        # Number of meioses (recombination events) between relatives
        # This affects the length distribution of segments
        if relationship_type == 'parent-child':
            self.meioses = 1
        elif relationship_type == 'full-siblings':
            self.meioses = 2  # Complex, but effectively 2 for IBD1 regions
        elif relationship_type == 'half-siblings':
            self.meioses = 2
        elif relationship_type == 'grandparent':
            self.meioses = 2
        elif relationship_type == 'avuncular':
            self.meioses = 3
        elif relationship_type == 'first-cousins':
            self.meioses = 4
        elif relationship_type == 'first-cousins-once-removed':
            self.meioses = 5
        elif relationship_type == 'second-cousins':
            self.meioses = 6
        elif relationship_type == 'second-cousins-once-removed':
            self.meioses = 7
        elif relationship_type == 'third-cousins':
            self.meioses = 8
        else:
            raise ValueError(f"Unknown relationship type: {relationship_type}")
        
        # Expected number of segments
        # This follows from population genetics theory
        genome_length = 3500  # cM
        if relationship_type == 'parent-child':
            self.expected_segments = 23  # One per chromosome (simplified)
        else:
            # Expected number of segments is proportional to the genetic distance
            # and the number of meioses
            self.expected_segments = (0.01 * genome_length * self.meioses) / 2
    
    def segment_length_pdf(self, length):
        """Calculate probability density of observing an IBD segment of given length.
        
        Args:
            length: Length of IBD segment in centiMorgans
            
        Returns:
            Probability density value
        """
        # For parent-child, all chromosomes are shared entirely
        if self.relationship_type == 'parent-child':
            # Use a normal distribution centered on chromosome lengths
            # This is a simplification
            mean_chrom_length = 150  # Average chromosome length in cM
            sd = 50  # Standard deviation
            return norm.pdf(length, mean_chrom_length, sd)
        
        # For other relationships, segment lengths follow an exponential distribution
        # The rate parameter depends on the number of meioses
        rate = self.meioses / 100  # Rate in cM^-1
        return expon.pdf(length, scale=1/rate)
    
    def calculate_log_likelihood_segments(self, segments):
        """Calculate log-likelihood for a list of IBD segments.
        
        Args:
            segments: List of segment lengths in centiMorgans
            
        Returns:
            Log-likelihood value
        """
        # If no segments, return very low likelihood
        if not segments:
            return -1000
        
        # Component for number of segments
        # Use Poisson distribution for the count
        num_segments = len(segments)
        log_likelihood = poisson.logpmf(num_segments, self.expected_segments)
        
        # Component for segment lengths
        # Assumes independence of segment lengths given the relationship
        for length in segments:
            log_likelihood += np.log(self.segment_length_pdf(length))
        
        return log_likelihood
    
    def calculate_log_likelihood(self, features):
        """Calculate log-likelihood based on features.
        
        Args:
            features: Dictionary with segment information
            
        Returns:
            Log-likelihood value
        """
        # If we have individual segments, use those
        if 'segments' in features:
            segments = features['segments']
            return self.calculate_log_likelihood_segments(segments)
        
        # If we only have summary statistics, make an approximation
        elif 'total_ibd' in features and 'num_segments' in features:
            # Estimate average segment length
            num_segments = features['num_segments']
            total_ibd = features['total_ibd']
            
            if num_segments > 0:
                avg_segment_length = total_ibd / num_segments
            else:
                avg_segment_length = 0
            
            # Create approximate segment list (all segments same length)
            segments = [avg_segment_length] * num_segments
            
            # Calculate likelihood using this approximation
            return self.calculate_log_likelihood_segments(segments)
        
        else:
            raise ValueError("Missing required features: 'segments' or 'total_ibd' and 'num_segments'")

# Sample data with individual segments
individual_segment_data = [
    ('Parent-Child', {
        'segments': [156, 85, 143, 201, 78, 198, 134, 99, 149, 177, 168, 129, 95, 103, 114, 67, 81, 92, 156, 75, 69, 52]
    }),
    ('Full Siblings', {
        'segments': [142, 65, 98, 128, 73, 42, 89, 57, 32, 78, 115, 98, 67, 54, 49, 83, 72, 94, 61, 58, 55, 48, 51]
    }),
    ('Half Siblings', {
        'segments': [97, 78, 63, 54, 87, 92, 65, 71, 59, 48, 55, 41, 38, 44]
    }),
    ('First Cousins', {
        'segments': [85, 67, 54, 48, 42, 39, 35, 33, 29, 27]
    }),
    ('Second Cousins', {
        'segments': [63, 42, 36, 29, 25]
    }),
    ('Third Cousins', {
        'segments': [38, 24]
    })
]

# Create segment-based models
segment_models = [
    SegmentBasedModel('parent-child'),
    SegmentBasedModel('full-siblings'),
    SegmentBasedModel('half-siblings'),
    SegmentBasedModel('first-cousins'),
    SegmentBasedModel('second-cousins'),
    SegmentBasedModel('third-cousins')
]

# Calculate log-likelihoods for each example using each model
print("Segment-based model log-likelihoods:")
print("-" * 100)
header = f"{'Example Data':<25}"
for model in segment_models:
    header += f"{model.relationship_type:<20}"
print(header)
print("-" * 100)

for example_name, features in individual_segment_data:
    line = f"{example_name:<25}"
    for model in segment_models:
        log_likelihood = model.calculate_log_likelihood(features)
        line += f"{log_likelihood:<20.2f}"
    print(line)

print("\nMost likely relationship for each example:")
for example_name, features in individual_segment_data:
    log_likelihoods = [model.calculate_log_likelihood(features) for model in segment_models]
    best_model_idx = np.argmax(log_likelihoods)
    best_model = segment_models[best_model_idx]
    print(f"{example_name}: Most likely {best_model.relationship_type}")

# Visualize the segment length distributions for different relationship types
plt.figure(figsize=(12, 6))
x = np.linspace(1, 200, 1000)

for model in segment_models:
    y = [model.segment_length_pdf(length) for length in x]
    plt.plot(x, y, label=model.relationship_type)

plt.xlabel('Segment Length (cM)')
plt.ylabel('Probability Density')
plt.title('Theoretical IBD Segment Length Distributions by Relationship Type')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Compare segment length histograms from example data
plt.figure(figsize=(15, 10))

for i, (example_name, features) in enumerate(individual_segment_data):
    plt.subplot(2, 3, i+1)
    segments = features['segments']
    plt.hist(segments, bins=15, alpha=0.7, density=True)
    
    # Overlay theoretical distribution for this relationship
    x = np.linspace(1, max(segments) + 10, 1000)
    y = [segment_models[i].segment_length_pdf(length) for length in x]
    plt.plot(x, y, 'r-', linewidth=2)
    
    plt.title(f"{example_name}\n{len(segments)} segments, total {sum(segments):.0f} cM")
    plt.xlabel('Segment Length (cM)')
    plt.ylabel('Density')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 3.3 Bayesian Integration of Prior Knowledge

We introduced a simple Bayesian approach earlier, but now let's explore a more sophisticated implementation that integrates prior knowledge from multiple sources:

In [ ]:
class BayesianRelationshipInference:
    """Bayesian relationship inference incorporating multiple sources of prior knowledge."""
    
    def __init__(self):
        """Initialize Bayesian relationship inference model."""
        # Define relationship types
        self.relationship_types = [
            'parent-child', 'full-siblings', 'half-siblings', 'grandparent',
            'avuncular', 'first-cousins', 'second-cousins', 'third-cousins', 'unrelated'
        ]
        
        # Base prior probabilities (uniform by default)
        self.base_priors = {rel: 1.0 / len(self.relationship_types) for rel in self.relationship_types}
        
        # Likelihood models for different features
        self.likelihood_models = {
            # Create models for IBD features
            'genetic': {rel: MultivariateRelationshipModel(rel) if rel in mv_models else None 
                      for rel in self.relationship_types}
        }
    
    def calculate_prior(self, relationship, demographic_info=None):
        """Calculate prior probability for a relationship given demographic information.
        
        Args:
            relationship: Relationship type
            demographic_info: Dictionary with demographic information (ages, populations, etc.)
            
        Returns:
            Prior probability
        """
        # Start with base prior
        prior = self.base_priors[relationship]
        
        # If demographic information is provided, adjust the prior
        if demographic_info:
            # Age constraints
            if 'age1' in demographic_info and 'age2' in demographic_info:
                age1 = demographic_info['age1']
                age2 = demographic_info['age2']
                age_diff = abs(age1 - age2)
                
                # Apply age-based adjustments
                if relationship == 'parent-child' and age_diff < 15:
                    prior *= 0.01  # Very unlikely to be parent-child if age difference < 15
                elif relationship == 'grandparent' and age_diff < 30:
                    prior *= 0.1  # Unlikely to be grandparent if age difference < 30
                elif relationship in ['full-siblings', 'half-siblings'] and age_diff > 25:
                    prior *= 0.5  # Less likely to be siblings if age difference > 25
                
            # Population information (e.g., endogamy)
            if 'population' in demographic_info:
                population = demographic_info['population']
                
                if population == 'general':
                    # Standard priors, no adjustment
                    pass
                elif population == 'endogamous':
                    # In endogamous populations, distant cousins are more common
                    if relationship in ['first-cousins', 'second-cousins', 'third-cousins']:
                        prior *= 2.0  # Double the prior for cousin relationships
                    elif relationship == 'unrelated':
                        prior *= 0.5  # Reduce prior for truly unrelated individuals
            
            # Known family structure
            if 'family_structure' in demographic_info:
                family_structure = demographic_info['family_structure']
                
                if 'siblings_known' in family_structure and family_structure['siblings_known']:
                    # If all siblings are already known, reduce prior for sibling relationships
                    if relationship in ['full-siblings', 'half-siblings']:
                        prior *= 0.1
        
        return prior
    
    def calculate_likelihood(self, relationship, features):
        """Calculate likelihood of observing features given relationship.
        
        Args:
            relationship: Relationship type
            features: Dictionary with observed features
            
        Returns:
            Log-likelihood value
        """
        log_likelihood = 0.0
        
        # Genetic features (IBD patterns)
        if 'genetic' in features and relationship in self.likelihood_models['genetic']:
            model = self.likelihood_models['genetic'][relationship]
            if model:
                log_likelihood += model.calculate_log_likelihood(features['genetic'])
            else:
                # If no model for this relationship, use a default value
                log_likelihood += -100  # Very low likelihood
        
        return log_likelihood
    
    def infer_relationship(self, features, demographic_info=None):
        """Infer most likely relationship using Bayesian inference.
        
        Args:
            features: Dictionary with observed features
            demographic_info: Optional dictionary with demographic information
            
        Returns:
            Dictionary mapping relationships to posterior probabilities
        """
        # Calculate prior and likelihood for each relationship type
        log_posteriors = {}
        
        for relationship in self.relationship_types:
            # Calculate prior
            prior = self.calculate_prior(relationship, demographic_info)
            log_prior = np.log(prior) if prior > 0 else -1000
            
            # Calculate likelihood
            log_likelihood = self.calculate_likelihood(relationship, features)
            
            # Calculate unnormalized log posterior
            log_posteriors[relationship] = log_prior + log_likelihood
        
        # Normalize posteriors
        # First convert log-posteriors to posteriors
        max_log_posterior = max(log_posteriors.values())
        posteriors = {r: np.exp(lp - max_log_posterior) for r, lp in log_posteriors.items()}
        
        # Normalize
        sum_posteriors = sum(posteriors.values())
        posteriors = {r: p / sum_posteriors for r, p in posteriors.items()}
        
        return posteriors

# Example data with genetic features
genetic_features = [
    ('Parent-Child Example', {
        'genetic': {
            'total_ibd': 3520,
            'num_segments': 42,
            'ibd1_fraction': 0.99,
            'ibd2_fraction': 0.01
        }
    }),
    ('Full Siblings Example', {
        'genetic': {
            'total_ibd': 2600,
            'num_segments': 36,
            'ibd1_fraction': 0.48,
            'ibd2_fraction': 0.27
        }
    }),
    ('Half Siblings Example', {
        'genetic': {
            'total_ibd': 1750,
            'num_segments': 26,
            'ibd1_fraction': 0.26,
            'ibd2_fraction': 0.01
        }
    })
]

# Example demographic information
demographic_examples = [
    ('No Demographic Info', None),
    ('Parent-Child Demographics', {
        'age1': 45,
        'age2': 20,
        'population': 'general'
    }),
    ('Sibling Demographics', {
        'age1': 25,
        'age2': 22,
        'population': 'general'
    }),
    ('Endogamous Population', {
        'age1': 35,
        'age2': 40,
        'population': 'endogamous'
    })
]

# Create Bayesian inference model
bayes_model = BayesianRelationshipInference()

# Test with different combinations of features and demographic info
print("Bayesian relationship inference with multiple sources of prior knowledge:")
print("=" * 100)

for genetic_name, genetic_data in genetic_features:
    print(f"\nGenetic data: {genetic_name}")
    print("-" * 80)
    
    for demo_name, demographic_info in demographic_examples:
        print(f"\nWith demographic info: {demo_name}")
        
        # Infer relationship
        posteriors = bayes_model.infer_relationship(genetic_data, demographic_info)
        
        # Sort by probability (descending)
        sorted_posteriors = {k: v for k, v in sorted(posteriors.items(), key=lambda item: item[1], reverse=True)}
        
        # Print top 3 most likely relationships
        top_3 = list(sorted_posteriors.items())[:3]
        for rel, prob in top_3:
            print(f"  {rel}: {prob:.4f}")

# Show how demographic information affects the results for ambiguous cases
ambiguous_case = {
    'genetic': {
        'total_ibd': 1680,
        'num_segments': 28,
        'ibd1_fraction': 0.24,
        'ibd2_fraction': 0.0
    }
}

print("\n\nAnalysis of ambiguous case (could be half-sibling, grandparent, or avuncular):")
print("=" * 100)

demographic_scenarios = [
    ('No demographic info', None),
    ('Ages consistent with half-siblings', {'age1': 35, 'age2': 32, 'population': 'general'}),
    ('Ages consistent with grandparent', {'age1': 70, 'age2': 25, 'population': 'general'}),
    ('Ages consistent with avuncular', {'age1': 45, 'age2': 20, 'population': 'general'})
]

for scenario_name, demographic_info in demographic_scenarios:
    print(f"\nScenario: {scenario_name}")
    
    # Infer relationship
    posteriors = bayes_model.infer_relationship(ambiguous_case, demographic_info)
    
    # Sort by probability (descending)
    sorted_posteriors = {k: v for k, v in sorted(posteriors.items(), key=lambda item: item[1], reverse=True)}
    
    # Print top 5 most likely relationships
    top_5 = list(sorted_posteriors.items())[:5]
    for rel, prob in top_5:
        print(f"  {rel}: {prob:.4f}")

# Visualize how demographic information affects posteriors for the ambiguous case
plt.figure(figsize=(12, 6))

# Relationships of interest for the ambiguous case
rel_focus = ['half-siblings', 'grandparent', 'avuncular', 'first-cousins']

# Get posteriors for each scenario
scenarios = []
posteriors_list = []

for scenario_name, demographic_info in demographic_scenarios:
    scenarios.append(scenario_name)
    posteriors = bayes_model.infer_relationship(ambiguous_case, demographic_info)
    posteriors_list.append([posteriors.get(rel, 0) for rel in rel_focus])

# Create grouped bar chart
x = np.arange(len(scenarios))
width = 0.2
multiplier = 0

for i, relationship in enumerate(rel_focus):
    offset = width * multiplier
    plt.bar(x + offset, [posteriors[i] for posteriors in posteriors_list], width, label=relationship)
    multiplier += 1

plt.ylabel('Posterior Probability')
plt.title('Effect of Demographic Information on Relationship Inference')
plt.xticks(x + width, scenarios, rotation=45, ha='right')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

## 4. Pedigree-Level Likelihood Calculations

So far, we've focused on pairwise relationship inference. However, Bonsai's power comes from its ability to reason about entire pedigrees, considering consistency among all relationships simultaneously. Let's explore how pedigree-level likelihoods are calculated.

### 4.1 From Pairwise to Global Likelihood

Bonsai computes the likelihood of an entire pedigree by combining the likelihoods of all pairwise relationships implied by the pedigree. Let's implement a simple version of this approach:

In [ ]:
class PedigreeLikelihood:
    """Calculator for pedigree-level likelihoods."""
    
    def __init__(self, relationship_model=None):
        """Initialize pedigree likelihood calculator.
        
        Args:
            relationship_model: Model for calculating pairwise relationship likelihoods
        """
        self.relationship_model = relationship_model or MultivariateRelationshipModel('parent-child')
        
        # Cache for inferred relationships to avoid redundant computation
        self.inferred_relationship_cache = {}
    
    def infer_relationship_from_pedigree(self, pedigree, id1, id2):
        """Infer relationship between two individuals based on pedigree structure.
        
        Args:
            pedigree: Dictionary representing pedigree structure
            id1: ID of first individual
            id2: ID of second individual
            
        Returns:
            Inferred relationship type
        """
        # Check if result is in cache
        cache_key = (min(id1, id2), max(id1, id2))
        if cache_key in self.inferred_relationship_cache:
            return self.inferred_relationship_cache[cache_key]
        
        # Helper function to get parents
        def get_parents(indiv_id):
            if indiv_id not in pedigree:
                return (None, None)  # Individual not in pedigree
            
            father = pedigree[indiv_id].get('father')
            mother = pedigree[indiv_id].get('mother')
            return (father, mother)
        
        # Check if one is a direct ancestor of the other
        def is_ancestor(ancestor_id, descendant_id, depth=0):
            if depth > 10:  # Prevent excessive recursion
                return False
            
            if descendant_id not in pedigree:
                return False
            
            father, mother = get_parents(descendant_id)
            
            if father == ancestor_id or mother == ancestor_id:
                return True
            
            # Recursively check parents
            return (father and is_ancestor(ancestor_id, father, depth+1)) or \
                   (mother and is_ancestor(ancestor_id, mother, depth+1))
        
        # Identify parent-child
        if is_ancestor(id1, id2):
            result = 'parent-child'
        elif is_ancestor(id2, id1):
            result = 'parent-child'
        
        # Identify siblings
        else:
            father1, mother1 = get_parents(id1)
            father2, mother2 = get_parents(id2)
            
            if father1 and father2 and mother1 and mother2:
                # Both have known parents
                if father1 == father2 and mother1 == mother2:
                    result = 'full-siblings'
                elif father1 == father2 or mother1 == mother2:
                    result = 'half-siblings'
                else:
                    # Check for more distant relationships
                    result = self._infer_distant_relationship(pedigree, id1, id2)
            else:
                # Can't determine sibling status with missing parents
                result = self._infer_distant_relationship(pedigree, id1, id2)
        
        # Store result in cache
        self.inferred_relationship_cache[cache_key] = result
        return result
    
    def _infer_distant_relationship(self, pedigree, id1, id2):
        """Infer distant relationship between individuals.
        
        Simplified version - in a real implementation, would trace through
        the pedigree to determine exact relationship.
        """
        # For demonstration, we'll return a default value
        return 'unrelated'
    
    def calculate_pedigree_likelihood(self, pedigree, ibd_segments):
        """Calculate the likelihood of a pedigree given observed IBD segments.
        
        Args:
            pedigree: Dictionary representing pedigree structure
            ibd_segments: Dictionary mapping pairs to IBD segments
            
        Returns:
            Log-likelihood value
        """
        total_log_likelihood = 0.0
        
        # Identify all pairs of individuals with IBD segments
        pairs = list(ibd_segments.keys())
        
        # Calculate likelihood for each pair
        for pair in pairs:
            id1, id2 = pair
            segments = ibd_segments[pair]
            
            # Infer relationship from pedigree
            relationship = self.infer_relationship_from_pedigree(pedigree, id1, id2)
            
            # Calculate likelihood of observed segments given inferred relationship
            if hasattr(self.relationship_model, 'calculate_log_likelihood_segments'):
                # If model supports segment-based calculation
                pair_likelihood = self.relationship_model.calculate_log_likelihood_segments(segments)
            else:
                # Create features from segments
                features = {
                    'total_ibd': sum(segments),
                    'num_segments': len(segments),
                    'mean_segment_length': sum(segments) / max(1, len(segments))
                }
                pair_likelihood = self.relationship_model.calculate_log_likelihood(features)
            
            # Add to total likelihood
            total_log_likelihood += pair_likelihood
        
        return total_log_likelihood

# Example pedigree structure
"""
        A1(-1)     A2(-2)   A3(-3)     A4(-4)
          |  \______/         |  \______/
          |         |         |         |
        P1(1)     P2(2)     P3(3)     P4(4)
          |         |         |         |
          |         |         |         |
        C1(5) -- C2(6)      C3(7) -- C4(8)
                  |                    |
                  |                    |
                GC1(9)              GC2(10)
"""
example_pedigree = {
    # First generation (grandparents)
    -1: {'father': None, 'mother': None, 'sex': 'M'},  # A1
    -2: {'father': None, 'mother': None, 'sex': 'F'},  # A2
    -3: {'father': None, 'mother': None, 'sex': 'M'},  # A3
    -4: {'father': None, 'mother': None, 'sex': 'F'},  # A4
    
    # Second generation (parents)
    1: {'father': -1, 'mother': -2, 'sex': 'M'},  # P1
    2: {'father': -1, 'mother': -2, 'sex': 'F'},  # P2
    3: {'father': -3, 'mother': -4, 'sex': 'M'},  # P3
    4: {'father': -3, 'mother': -4, 'sex': 'F'},  # P4
    
    # Third generation (children)
    5: {'father': 1, 'mother': None, 'sex': 'M'},  # C1
    6: {'father': None, 'mother': 2, 'sex': 'F'},  # C2
    7: {'father': 3, 'mother': None, 'sex': 'M'},  # C3
    8: {'father': None, 'mother': 4, 'sex': 'F'},  # C4
    
    # Fourth generation (grandchildren)
    9: {'father': None, 'mother': 6, 'sex': 'M'},   # GC1
    10: {'father': None, 'mother': 8, 'sex': 'F'},  # GC2
}

# Visualize the pedigree
def visualize_pedigree(pedigree):
    """Create a visualization of a pedigree."""
    G = nx.DiGraph()
    
    # Add nodes
    for id, info in pedigree.items():
        # Label for the node
        if id < 0:
            label = f"A{abs(id)}"
        else:
            if id <= 4:
                label = f"P{id}"
            elif id <= 8:
                label = f"C{id-4}"
            else:
                label = f"GC{id-8}"
        
        # Node color based on sex
        if info.get('sex') == 'M':
            color = 'lightblue'
            shape = 's'  # square
        elif info.get('sex') == 'F':
            color = 'pink'
            shape = 'o'  # circle
        else:
            color = 'lightgray'
            shape = 'd'  # diamond
        
        G.add_node(id, label=label, color=color, shape=shape)
    
    # Add edges (directed from parent to child)
    for child_id, info in pedigree.items():
        if 'father' in info and info['father'] is not None:
            G.add_edge(info['father'], child_id)
        if 'mother' in info and info['mother'] is not None:
            G.add_edge(info['mother'], child_id)
    
    # Create layout
    pos = nx.nx_agraph.graphviz_layout(G, prog="dot")
    
    # Draw the graph
    plt.figure(figsize=(12, 8))
    
    # Draw nodes by shape and color
    for shape in ['s', 'o', 'd']:
        node_list = [n for n, d in G.nodes(data=True) if d.get('shape') == shape]
        node_colors = [G.nodes[n].get('color', 'lightgray') for n in node_list]
        if shape == 's':
            nx.draw_networkx_nodes(G, pos, nodelist=node_list, node_shape=shape, 
                                  node_color=node_colors, node_size=500)
        elif shape == 'o':
            nx.draw_networkx_nodes(G, pos, nodelist=node_list, node_shape=shape, 
                                  node_color=node_colors, node_size=500)
        else:
            nx.draw_networkx_nodes(G, pos, nodelist=node_list, node_shape=shape, 
                                  node_color=node_colors, node_size=500)
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, arrows=True)
    
    # Draw labels
    labels = {n: d.get('label', str(n)) for n, d in G.nodes(data=True)}
    nx.draw_networkx_labels(G, pos, labels=labels)
    
    plt.title("Example Pedigree Structure")
    plt.axis('off')
    plt.show()

# Visualize the example pedigree
visualize_pedigree(example_pedigree)

# Create example IBD segments
example_ibd_data = {
    # Full siblings
    (1, 2): [120, 105, 98, 87, 76, 65, 58, 52, 49, 45, 40, 38, 35, 33, 32, 30, 29, 28, 27, 26],
    
    # Parent-child
    (2, 6): [175, 163, 155, 142, 130, 125, 112, 105, 98, 95, 88, 85, 82, 78, 75, 72, 68, 65, 55, 50, 45, 42],
    
    # Cousins
    (5, 6): [75, 62, 58, 52, 45, 38, 35, 30, 28, 25],
    
    # Unrelated (some small segments due to background IBD or false positives)
    (5, 10): [15, 10]
}

# Create a pedigree likelihood calculator
pedigree_calculator = PedigreeLikelihood(relationship_model=SegmentBasedModel('parent-child'))

# Calculate the likelihood of the example pedigree
example_likelihood = pedigree_calculator.calculate_pedigree_likelihood(example_pedigree, example_ibd_data)
print(f"Log-likelihood of example pedigree: {example_likelihood:.2f}")

# Let's also demonstrate how the likelihood changes with different pedigrees
print("\nDemonstrating how likelihood changes with incorrect pedigrees:")

# Create a modified pedigree with an error (swapping parents)
incorrect_pedigree_1 = example_pedigree.copy()
incorrect_pedigree_1[6] = {'father': 1, 'mother': None, 'sex': 'F'}  # Changed father from None to 1

# Calculate likelihood
incorrect_likelihood_1 = pedigree_calculator.calculate_pedigree_likelihood(incorrect_pedigree_1, example_ibd_data)
print(f"Log-likelihood of pedigree with incorrect parent: {incorrect_likelihood_1:.2f}")

# Create another modified pedigree with a different error (breaking sibling relationship)
incorrect_pedigree_2 = example_pedigree.copy()
incorrect_pedigree_2[2] = {'father': -3, 'mother': -4, 'sex': 'F'}  # Changed parents to be different from sibling

# Calculate likelihood
incorrect_likelihood_2 = pedigree_calculator.calculate_pedigree_likelihood(incorrect_pedigree_2, example_ibd_data)
print(f"Log-likelihood of pedigree with incorrect sibling relationship: {incorrect_likelihood_2:.2f}")

# Compare likelihoods
print(f"\nLikelihood ratio (correct vs incorrect parent): {np.exp(example_likelihood - incorrect_likelihood_1):.2e}")
print(f"Likelihood ratio (correct vs incorrect sibling): {np.exp(example_likelihood - incorrect_likelihood_2):.2e}")

### 4.2 Relationship Consistency Constraints

One of the key advantages of pedigree-level likelihood calculations is the ability to enforce consistency constraints among relationships. For example, if A is a parent of B, and B is a parent of C, then A must be a grandparent of C (not some other relationship).

Let's implement a function to check consistency constraints in a pedigree:

In [ ]:
def check_pedigree_consistency(pedigree, ibd_data, tolerance=0.3):
    """Check if a pedigree is consistent with observed IBD data.
    
    Args:
        pedigree: Dictionary representing pedigree structure
        ibd_data: Dictionary mapping pairs to IBD segments
        tolerance: Tolerance for deviations from expected IBD sharing
        
    Returns:
        Tuple of (is_consistent, list_of_inconsistencies)
    """
    inconsistencies = []
    
    # Get all pairs with IBD data
    pairs = list(ibd_data.keys())
    
    # Helper function to calculate expected IBD sharing
    def calculate_expected_ibd(relationship_type):
        """Calculate expected total IBD sharing for a relationship type."""
        if relationship_type == 'parent-child':
            return 3540
        elif relationship_type == 'full-siblings':
            return 2550
        elif relationship_type in ['half-siblings', 'grandparent', 'avuncular']:
            return 1700
        elif relationship_type == 'first-cousins':
            return 850
        elif relationship_type == 'second-cousins':
            return 212
        elif relationship_type == 'third-cousins':
            return 53
        else:
            return 0  # Unknown or more distant
    
    # Create a relationship calculator
    calculator = PedigreeLikelihood()
    
    # Check each pair
    for pair in pairs:
        id1, id2 = pair
        segments = ibd_data[pair]
        total_ibd = sum(segments)
        
        # Infer relationship from pedigree
        inferred_relationship = calculator.infer_relationship_from_pedigree(pedigree, id1, id2)
        
        # Calculate expected IBD for this relationship
        expected_ibd = calculate_expected_ibd(inferred_relationship)
        
        # Check if observed IBD is consistent with expected IBD
        if expected_ibd > 0:  # Skip if relationship is unknown or very distant
            # Calculate deviation from expected
            deviation = abs(total_ibd - expected_ibd) / expected_ibd
            
            if deviation > tolerance:
                inconsistency = {
                    'pair': pair,
                    'inferred_relationship': inferred_relationship,
                    'expected_ibd': expected_ibd,
                    'observed_ibd': total_ibd,
                    'deviation': deviation
                }
                inconsistencies.append(inconsistency)
    
    # Return results
    is_consistent = len(inconsistencies) == 0
    return is_consistent, inconsistencies

# Check consistency of our example pedigree
is_consistent, inconsistencies = check_pedigree_consistency(example_pedigree, example_ibd_data)

print(f"Is pedigree consistent with IBD data? {is_consistent}")
if not is_consistent:
    print("\nInconsistencies found:")
    for i, inconsistency in enumerate(inconsistencies):
        print(f"Inconsistency {i+1}:")
        print(f"  Pair: {inconsistency['pair']}")
        print(f"  Inferred relationship: {inconsistency['inferred_relationship']}")
        print(f"  Expected IBD: {inconsistency['expected_ibd']:.2f} cM")
        print(f"  Observed IBD: {inconsistency['observed_ibd']:.2f} cM")
        print(f"  Deviation: {inconsistency['deviation']*100:.2f}%")

# Check consistency of an incorrect pedigree
is_consistent_2, inconsistencies_2 = check_pedigree_consistency(incorrect_pedigree_1, example_ibd_data)

print(f"\nIs incorrect pedigree consistent with IBD data? {is_consistent_2}")
if not is_consistent_2:
    print("\nInconsistencies found in incorrect pedigree:")
    for i, inconsistency in enumerate(inconsistencies_2):
        print(f"Inconsistency {i+1}:")
        print(f"  Pair: {inconsistency['pair']}")
        print(f"  Inferred relationship: {inconsistency['inferred_relationship']}")
        print(f"  Expected IBD: {inconsistency['expected_ibd']:.2f} cM")
        print(f"  Observed IBD: {inconsistency['observed_ibd']:.2f} cM")
        print(f"  Deviation: {inconsistency['deviation']*100:.2f}%")

## 5. Specialized Models for Complex Scenarios

Standard relationship models work well for most cases, but some scenarios require specialized models. Let's explore models for several complex scenarios that are common in genetic genealogy.

### 5.1 Models for Endogamous Populations

Endogamous populations (those with a history of intermarriage within a small group) present unique challenges. In these populations, individuals share more IBD segments due to multiple distant relationships, making standard models unreliable.

In [ ]:
class EndogamousPopulationModel:
    """Specialized model for relationship inference in endogamous populations."""
    
    def __init__(self, endogamy_factor=2.0, background_ibd=50):
        """Initialize model for endogamous populations.
        
        Args:
            endogamy_factor: Factor by which to increase expected IBD for distant relationships
            background_ibd: Expected background IBD sharing in cM between 'unrelated' individuals
        """
        self.endogamy_factor = endogamy_factor
        self.background_ibd = background_ibd
        
        # Define relationship models with adjusted parameters
        self.relationship_models = {}
        
        # Create models for each relationship type
        # Close relationships are less affected by endogamy
        self.relationship_models['parent-child'] = {
            'expected_total_ibd': 3540,
            'total_ibd_sd': 100,
            'expected_segments': 40,
            'segments_sd': 5
        }
        
        self.relationship_models['full-siblings'] = {
            'expected_total_ibd': 2550,
            'total_ibd_sd': 250,  # Increased variance due to endogamy
            'expected_segments': 35,
            'segments_sd': 6  # Increased variance
        }
        
        # More distant relationships are increasingly affected by endogamy
        self.relationship_models['half-siblings'] = {
            'expected_total_ibd': 1700 * 1.1,  # Slight increase
            'total_ibd_sd': 350,  # Increased variance
            'expected_segments': 25 * 1.2,  # More segments due to endogamy
            'segments_sd': 7
        }
        
        self.relationship_models['first-cousins'] = {
            'expected_total_ibd': 850 * endogamy_factor * 0.9,  # Significantly increased
            'total_ibd_sd': 300,
            'expected_segments': 15 * endogamy_factor * 0.8,
            'segments_sd': 8
        }
        
        self.relationship_models['second-cousins'] = {
            'expected_total_ibd': 212 * endogamy_factor,
            'total_ibd_sd': 150,
            'expected_segments': 5 * endogamy_factor,
            'segments_sd': 4
        }
        
        self.relationship_models['third-cousins'] = {
            'expected_total_ibd': 53 * endogamy_factor * 1.2,
            'total_ibd_sd': 100,
            'expected_segments': 2 * endogamy_factor * 1.3,
            'segments_sd': 3
        }
        
        self.relationship_models['unrelated'] = {
            'expected_total_ibd': background_ibd,
            'total_ibd_sd': background_ibd * 0.8,
            'expected_segments': background_ibd / 25,  # Rough estimate
            'segments_sd': 2
        }
    
    def calculate_relationship_likelihood(self, relationship, features):
        """Calculate log-likelihood of features under a specific relationship model.
        
        Args:
            relationship: Type of relationship
            features: Dictionary with observed features
            
        Returns:
            Log-likelihood value
        """
        if relationship not in self.relationship_models:
            return float('-inf')  # Unknown relationship
        
        model = self.relationship_models[relationship]
        log_likelihood = 0.0
        
        # Component for total IBD
        if 'total_ibd' in features:
            log_likelihood += norm.logpdf(features['total_ibd'], 
                                         model['expected_total_ibd'], 
                                         model['total_ibd_sd'])
        
        # Component for number of segments
        if 'num_segments' in features:
            log_likelihood += norm.logpdf(features['num_segments'], 
                                         model['expected_segments'], 
                                         model['segments_sd'])
        
        return log_likelihood
    
    def infer_relationship(self, features):
        """Infer most likely relationship type for an endogamous population.
        
        Args:
            features: Dictionary with observed features
            
        Returns:
            Tuple of (most_likely_relationship, log_likelihood, all_likelihoods)
        """
        all_likelihoods = {}
        
        # Calculate likelihood for each relationship type
        for relationship in self.relationship_models.keys():
            log_likelihood = self.calculate_relationship_likelihood(relationship, features)
            all_likelihoods[relationship] = log_likelihood
        
        # Find most likely relationship
        most_likely = max(all_likelihoods.items(), key=lambda x: x[1])
        most_likely_relationship = most_likely[0]
        max_log_likelihood = most_likely[1]
        
        return most_likely_relationship, max_log_likelihood, all_likelihoods

# Example data for endogamous population
endogamous_examples = [
    ('Parent-Child Example', {
        'total_ibd': 3520,
        'num_segments': 42
    }),
    ('Full Siblings Example', {
        'total_ibd': 2600,
        'num_segments': 37
    }),
    ('Half Siblings Example', {
        'total_ibd': 1900,  # Higher than expected in non-endogamous
        'num_segments': 30  # More segments than expected in non-endogamous
    }),
    ('First Cousins Example', {
        'total_ibd': 1450,  # Much higher than expected in non-endogamous
        'num_segments': 25  # Many more segments than expected in non-endogamous
    }),
    ('Second Cousins Example', {
        'total_ibd': 650,  # Much higher than expected in non-endogamous
        'num_segments': 16  # Many more segments than expected in non-endogamous
    }),
    ('Third Cousins Example', {
        'total_ibd': 250,  # Much higher than expected in non-endogamous
        'num_segments': 8  # Many more segments than expected in non-endogamous
    }),
    ('Unrelated Example', {
        'total_ibd': 95,  # Background IBD in endogamous population
        'num_segments': 4
    })
]

# Create models with different endogamy factors
standard_model = EndogamousPopulationModel(endogamy_factor=1.0, background_ibd=10)  # No endogamy
moderate_endogamy_model = EndogamousPopulationModel(endogamy_factor=2.0, background_ibd=50)
high_endogamy_model = EndogamousPopulationModel(endogamy_factor=3.5, background_ibd=100)

# Compare results using different models
print("Relationship inference in endogamous populations:")
print("=" * 100)

for example_name, features in endogamous_examples:
    print(f"\n{example_name}:")
    print(f"  Total IBD: {features['total_ibd']} cM, Segments: {features['num_segments']}")
    
    # Standard model
    rel_std, ll_std, _ = standard_model.infer_relationship(features)
    print(f"  Standard model (no endogamy): {rel_std} (log-likelihood: {ll_std:.2f})")
    
    # Moderate endogamy model
    rel_mod, ll_mod, _ = moderate_endogamy_model.infer_relationship(features)
    print(f"  Moderate endogamy model: {rel_mod} (log-likelihood: {ll_mod:.2f})")
    
    # High endogamy model
    rel_high, ll_high, _ = high_endogamy_model.infer_relationship(features)
    print(f"  High endogamy model: {rel_high} (log-likelihood: {ll_high:.2f})")

# Visualize how the expected IBD changes with endogamy
plt.figure(figsize=(12, 6))

# Define relationships to show
relationships = ['first-cousins', 'second-cousins', 'third-cousins', 'unrelated']
endogamy_factors = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5]

for relationship in relationships:
    expected_ibd = []
    for factor in endogamy_factors:
        model = EndogamousPopulationModel(endogamy_factor=factor, background_ibd=25*factor)
        expected_ibd.append(model.relationship_models[relationship]['expected_total_ibd'])
    
    plt.plot(endogamy_factors, expected_ibd, marker='o', label=relationship)

plt.xlabel('Endogamy Factor')
plt.ylabel('Expected Total IBD (cM)')
plt.title('Effect of Endogamy on Expected IBD Sharing')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

## 6. Exercises

Now that we've explored advanced likelihood calculations in Bonsai, let's complete some exercises to deepen our understanding.

### Exercise 1: Implement a Custom Relationship Model

Create a custom relationship model for a specific scenario that we haven't covered yet. For example, you could implement a model for:
- Double first cousins (share both sets of grandparents)
- Complex relationships (e.g., half-sibling and also first cousin)
- Populations with unusual recombination rates

In [ ]:
# Exercise 1: Your solution here
# Hint: Start by defining expected IBD sharing for your chosen relationship type

### Exercise 2: Create a More Sophisticated Bayesian Model

Enhance the Bayesian relationship inference model to incorporate additional types of prior knowledge, such as:
- Known or suspected family structures
- Historical demographic information
- Geographical proximity information
- Shared surnames or other metadata

In [ ]:
# Exercise 2: Your solution here
# Hint: Modify the calculate_prior method of the BayesianRelationshipInference class

### Exercise 3: Implement a Segment-Based Model for Endogamous Populations

Combine the segment-based model and the endogamous population model to create a more accurate model for endogamous populations that considers the distribution of segment lengths.

In [ ]:
# Exercise 3: Your solution here
# Hint: Modify the segment_length_pdf method to account for endogamy

### Exercise 4: Improve the Pedigree Consistency Check

Enhance the pedigree consistency checking function to handle more complex pedigrees and provide more detailed diagnostic information.

In [ ]:
# Exercise 4: Your solution here
# Hint: Add checks for specific relationships like parent-child, siblings, etc.

### Exercise 5: Evaluate Model Performance

Design and implement a simulation to evaluate the performance of different relationship inference models. Compare their accuracy on simulated data with known true relationships.

In [ ]:
# Exercise 5: Your solution here
# Hint: Generate simulated data with varying noise levels and test each model

## Conclusion

In this lab, we've explored advanced likelihood calculations in Bonsai, from the statistical foundations to specialized models for complex scenarios. We've implemented and tested various models for relationship inference, and seen how these models can be combined to infer entire pedigrees.

Key takeaways:
- Likelihood functions provide a principled way to evaluate genetic relationship hypotheses
- Advanced likelihood models incorporate multiple features and their correlations
- Bayesian approaches allow us to integrate prior knowledge with genetic evidence
- Pedigree-level analysis considers consistency among all relationships
- Specialized models are needed for complex scenarios like endogamous populations

In the next lab, we'll explore optimization techniques used in Bonsai to efficiently search the space of possible pedigrees.