# Lab 7: The PwLogLike Class and Likelihood Computation

## Overview

This lab explores the `PwLogLike` class, a core component of Bonsai v3 responsible for calculating relationship likelihoods based on IBD data. The class combines genetic evidence with demographic information to infer relationships between pairs of individuals.

Key topics include:

1. Structure and initialization of the `PwLogLike` class
2. Methods for calculating genetic likelihoods from IBD data
3. Age-based relationship likelihood computation
4. Integration of multiple evidence sources
5. Practical applications in relationship inference

By the end of this lab, you'll understand how the `PwLogLike` class serves as the quantitative foundation for Bonsai's pedigree reconstruction capabilities.

In [None]:
# 🧬 Google Colab Setup - Run this cell first!
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown

def is_colab():
    '''Check if running in Google Colab'''
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    print("🔬 Setting up Google Colab environment...")
    
    # Install dependencies
    print("📦 Installing packages...")
    !pip install -q pysam biopython scikit-allel networkx pygraphviz seaborn plotly
    !apt-get update -qq && apt-get install -qq samtools bcftools tabix graphviz-dev
    
    # Create directories
    !mkdir -p /content/class_data /content/results
    
    # Download essential class data
    print("📥 Downloading class data...")
    S3_BASE = "https://computational-genetic-genealogy.s3.us-east-2.amazonaws.com/class_data/"
    data_files = [
        "pedigree.fam", "pedigree.def", 
        "merged_opensnps_autosomes_ped_sim.seg",
        "merged_opensnps_autosomes_ped_sim-everyone.fam",
        "ped_sim_run2.seg", "ped_sim_run2-everyone.fam"
    ]
    
    for file in data_files:
        !wget -q -O /content/class_data/{file} {S3_BASE}{file}
        print(f"  ✅ {file}")
    
    # Define utility functions
    def setup_environment():
        return "/content/class_data", "/content/results"
    
    def save_results(dataframe, filename, description="results"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        dataframe.to_csv(full_path, index=False)
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e3f2fd; border-left: 4px solid #2196f3; margin: 10px 0;">
            <p><strong>💾 Results saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    def save_plot(plt, filename, description="plot"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        plt.savefig(full_path, dpi=300, bbox_inches='tight')
        plt.show()
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e8f5e8; border-left: 4px solid #4caf50; margin: 10px 0;">
            <p><strong>📊 Plot saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    print("✅ Colab setup complete! Ready to explore genetic genealogy.")
    
else:
    print("🏠 Local environment detected")
    def setup_environment():
        return "class_data", "results"
    def save_results(df, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        df.to_csv(path, index=False)
        return path
    def save_plot(plt, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        plt.savefig(path, dpi=300, bbox_inches='tight')
        plt.show()
        return path

# Set up paths and configure visualization
DATA_DIR, RESULTS_DIR = setup_environment()
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def view_class_methods(cls):
    """Display methods of a class with their docstrings."""
    for name, method in inspect.getmembers(cls, predicate=inspect.isfunction):
        if not name.startswith('_'):  # Skip private methods
            print(f"\n## {name}")
            sig = inspect.signature(method)
            print(f"Signature: {name}{sig}")
            
            doc = inspect.getdoc(method)
            if doc:
                print(f"Documentation: {doc}")
            else:
                print("No documentation available.")

def view_source(obj):
    """Display the source code of an object."""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\n{source}\n```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Import Bonsai Modules

Let's import the necessary Bonsai v3 modules, particularly the `likelihoods` module that contains the `PwLogLike` class:

In [None]:
try:
    from utils.bonsaitree.bonsaitree.v3 import likelihoods, moments
    print("✅ Successfully imported Bonsai v3 likelihoods and moments modules")
    
    # Check if the PwLogLike class is available
    if hasattr(likelihoods, 'PwLogLike'):
        print("✅ PwLogLike class is available")
        # Look at class attributes and methods
        PwLogLike = likelihoods.PwLogLike
        print(f"\nMethods in PwLogLike class:")
        methods = [name for name, method in inspect.getmembers(PwLogLike, predicate=inspect.isfunction) 
                  if not name.startswith('_')]
        for method in methods:
            print(f"- {method}")
    else:
        print("❌ PwLogLike class not found in likelihoods module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai modules: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")

## Part 1: The PwLogLike Class Structure

The `PwLogLike` class is the workhorse of relationship inference in Bonsai v3. It calculates the likelihood of different possible relationships between pairs of individuals based on their IBD sharing patterns and demographic information.

Let's examine the class structure and initialization:

In [None]:
# Examine the class initialization method
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    view_source(likelihoods.PwLogLike.__init__)

### 1.1 Class Initialization and Parameters

The `PwLogLike` class is initialized with several key parameters:

1. **`bio_info`**: Biographical information for individuals, including:
   - `genotype_id`: ID of the genotyped individual
   - `age`: Age of the individual (for age-based likelihood computation)
   - `sex`: Sex of the individual (for biological constraint validation)
   - Coverage information for the individual's genome

2. **`unphased_ibd_seg_list`**: List of IBD segments in the format:
   ```
   [id1, id2, chromosome, start_pos, end_pos, is_full_ibd, length_cm]
   ```
   Where `is_full_ibd` indicates if this is an IBD2 segment (both alleles identical)

3. **`condition_pair_set`**: Set of pairs to condition on (optional)
   - Used for conditional likelihood calculations

4. **`mean_bgd_num`** and **`mean_bgd_len`**: Background IBD parameters
   - Model the chance sharing between supposedly unrelated individuals

Let's create an example of the input data format that `PwLogLike` expects:

In [None]:
# Example bio_info format for PwLogLike
example_bio_info = [
    {
        'genotype_id': 1001,
        'age': 70,
        'sex': 'M',
        'coverage': 0.9  # 90% coverage of the genome
    },
    {
        'genotype_id': 1002,
        'age': 40,
        'sex': 'F',
        'coverage': 0.85  # 85% coverage of the genome
    },
    {
        'genotype_id': 1003,
        'age': 45,
        'sex': 'M',
        'coverage': 0.92  # 92% coverage of the genome
    },
    {
        'genotype_id': 1004,
        'age': 15,
        'sex': 'F',
        'coverage': 0.88  # 88% coverage of the genome
    }
]

# Example unphased IBD segment format for PwLogLike
# Format: [id1, id2, chromosome, start_pos, end_pos, is_full_ibd, length_cm]
example_ibd_segments = [
    # Parent-child relationship between 1001 and 1002
    [1001, 1002, 1, 10000, 50000000, False, 50.0],  # IBD1 segment on chr1
    [1001, 1002, 2, 5000, 60000000, False, 55.0],   # IBD1 segment on chr2
    [1001, 1002, 3, 20000, 40000000, False, 40.0],  # IBD1 segment on chr3
    
    # Full siblings relationship between 1002 and 1003
    [1002, 1003, 1, 10000, 30000000, False, 30.0],  # IBD1 segment on chr1
    [1002, 1003, 1, 40000, 50000000, True, 20.0],   # IBD2 segment on chr1
    [1002, 1003, 2, 5000, 25000000, False, 25.0],   # IBD1 segment on chr2
    
    # Parent-child relationship between 1003 and 1004
    [1003, 1004, 1, 15000, 45000000, False, 45.0],  # IBD1 segment on chr1
    [1003, 1004, 2, 10000, 50000000, False, 50.0],  # IBD1 segment on chr2
    [1003, 1004, 3, 5000, 35000000, False, 35.0]    # IBD1 segment on chr3
]

# Display the example data
print("Example biographical information:")
display(pd.DataFrame(example_bio_info))

print("\nExample IBD segments:")
ibd_df = pd.DataFrame(example_ibd_segments, 
                     columns=['id1', 'id2', 'chromosome', 'start_pos', 'end_pos', 'is_full_ibd', 'length_cm'])
display(ibd_df)

### 1.2 Internal Data Structures

When initialized, the `PwLogLike` class creates several internal data structures:

1. **`ibd_stat_dict`**: Dictionary of IBD statistics for each pair of individuals
   - Maps pairs to statistics like total IBD1, total IBD2, segment counts, etc.

2. **`id_to_info`**: Maps individual IDs to their biographical information
   - Quick lookup for age, sex, and coverage information

3. **`age_diff_dict`**, **`sex_dict`**, etc.: Derived dictionaries for quick access to specific attributes

Let's see how these are created, focusing on the IBD statistics extraction and computation:

In [None]:
# Function to extract IBD statistics (similar to what PwLogLike does internally)
def get_ibd_stats(ibd_segments):
    """Calculate IBD statistics from a list of IBD segments."""
    # Initialize a dictionary to store statistics
    ibd_stats = defaultdict(lambda: {
        'total_half': 0,    # Total cM of IBD1 (half-identical) segments
        'total_full': 0,    # Total cM of IBD2 (fully identical) segments
        'num_half': 0,      # Number of IBD1 segments
        'num_full': 0,      # Number of IBD2 segments
        'max_seg_cm': 0     # Length of largest segment
    })
    
    # Process each segment
    for segment in ibd_segments:
        id1, id2, _, _, _, is_full_ibd, seg_cm = segment
        
        # Create a frozenset key for the pair (order-independent)
        pair_key = frozenset([id1, id2])
        
        # Update statistics based on segment type
        if is_full_ibd:  # IBD2 segment
            ibd_stats[pair_key]['total_full'] += seg_cm
            ibd_stats[pair_key]['num_full'] += 1
        else:  # IBD1 segment
            ibd_stats[pair_key]['total_half'] += seg_cm
            ibd_stats[pair_key]['num_half'] += 1
        
        # Update max segment length if this segment is larger
        ibd_stats[pair_key]['max_seg_cm'] = max(ibd_stats[pair_key]['max_seg_cm'], seg_cm)
    
    return dict(ibd_stats)

# Calculate IBD statistics from our example data
example_ibd_stats = get_ibd_stats(example_ibd_segments)

# Display the results
print("IBD Statistics by Pair:")
for pair, stats in example_ibd_stats.items():
    ids = list(pair)
    print(f"\nPair {ids[0]}-{ids[1]}:")
    print(f"  Total IBD1: {stats['total_half']:.1f} cM in {stats['num_half']} segments")
    print(f"  Total IBD2: {stats['total_full']:.1f} cM in {stats['num_full']} segments")
    print(f"  Total IBD: {stats['total_half'] + stats['total_full']:.1f} cM")
    print(f"  Max segment: {stats['max_seg_cm']:.1f} cM")

## Part 2: Genetic Likelihood Calculation

The `PwLogLike` class calculates genetic likelihoods using the IBD statistics we just examined. Let's explore the core methods used for genetic likelihood computation.

In [None]:
# Examine the get_log_like method, which is the main entry point for likelihood calculations
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    view_source(likelihoods.PwLogLike.get_log_like)

### 2.1 The `get_log_like` Method

The `get_log_like` method calculates the likelihood of a specific relationship between two individuals. Let's break down its key components:

1. **Input Parameters**:
   - `id1`, `id2`: IDs of the two individuals
   - `relationship_tuple`: (up, down, num_ancs) tuple representing the relationship
   - `condition`: Whether to condition on observing IBD

2. **Processing Steps**:
   - Calculates the genetic likelihood (`get_pw_gen_ll`)
   - Optionally calculates the age-based likelihood (`get_pw_age_ll`)
   - Combines the likelihoods with appropriate weights

3. **Return Value**:
   - Log-likelihood score for the specified relationship

Let's implement a simplified version of this method to understand the core logic:

In [None]:
def describe_relationship(rel_tuple):
    """Convert a relationship tuple (up, down, num_ancs) to a human-readable description."""
    up, down, num_ancs = rel_tuple
    
    if up == 0 and down == 0 and num_ancs == 2:
        return "Self"
    elif up == 0 and down == 1 and num_ancs == 1:
        return "Parent"
    elif up == 1 and down == 0 and num_ancs == 1:
        return "Child"
    elif up == 1 and down == 1 and num_ancs == 2:
        return "Full Sibling"
    elif up == 1 and down == 1 and num_ancs == 1:
        return "Half Sibling"
    elif up == 0 and down == 2 and num_ancs == 1:
        return "Grandparent"
    elif up == 2 and down == 0 and num_ancs == 1:
        return "Grandchild"
    elif up == 1 and down == 2 and num_ancs == 1:
        return "Aunt/Uncle"
    elif up == 2 and down == 1 and num_ancs == 1:
        return "Niece/Nephew"
    elif up == 2 and down == 2 and num_ancs == 2:
        return "Full First Cousin"
    elif up == 2 and down == 2 and num_ancs == 1:
        return "Half First Cousin"
    else:
        return f"Complex Relationship (up={up}, down={down}, num_ancs={num_ancs})"

# Let's see if we can create and use a PwLogLike instance with our example data
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    try:
        # Create a PwLogLike instance using our example data
        pw_ll = likelihoods.PwLogLike(
            bio_info=example_bio_info,
            unphased_ibd_seg_list=example_ibd_segments
        )
        print("✅ Successfully created PwLogLike instance")
        
        # Test some relationship likelihoods
        test_pairs = [
            (1001, 1002),  # Parent-child
            (1002, 1003),  # Full siblings
            (1003, 1004)   # Parent-child
        ]
        
        test_relationships = [
            (0, 1, 1),  # Parent (up=0, down=1)
            (1, 0, 1),  # Child (up=1, down=0)
            (1, 1, 2),  # Full sibling (up=1, down=1, num_ancs=2)
            (1, 1, 1),  # Half sibling (up=1, down=1, num_ancs=1)
            (2, 2, 2)   # Full first cousin (up=2, down=2, num_ancs=2)
        ]
        
        print("\nCalculating likelihoods for different relationships:")
        for id1, id2 in test_pairs:
            print(f"\nPair {id1}-{id2}:")
            
            # Calculate likelihoods for each test relationship
            likelihoods_results = []
            for rel_tuple in test_relationships:
                log_ll = pw_ll.get_log_like(id1, id2, rel_tuple)
                likelihoods_results.append((rel_tuple, log_ll))
            
            # Sort by likelihood (highest first)
            likelihoods_results.sort(key=lambda x: x[1], reverse=True)
            
            # Display results
            for rel_tuple, log_ll in likelihoods_results:
                rel_desc = describe_relationship(rel_tuple)
                print(f"  {rel_desc}: {log_ll:.2f}")
    except Exception as e:
        print(f"Error using PwLogLike: {e}")
else:
    print("Cannot demonstrate PwLogLike directly. Showing simplified calculation instead.")
    
    # Simplified helper functions for likelihood calculation
    def get_expected_ibd(relationship_tuple, min_seg_len=7, genome_length=3400):
        """Get expected IBD statistics for a relationship."""
        up, down, num_ancs = relationship_tuple
        meiotic_distance = up + down
        
        # Handle special case for self
        if meiotic_distance == 0 and num_ancs == 2:
            return {'total_half': 0, 'total_full': genome_length, 'num_half': 0, 'num_full': 1}
        
        # Expected IBD1 fraction for standard relationships
        if num_ancs == 1:  # Half relationship
            expected_fraction = 0.5 ** meiotic_distance
            expected_total_half = expected_fraction * genome_length
            expected_total_full = 0
        else:  # Full relationship
            if meiotic_distance == 2:  # Full siblings
                expected_half_fraction = 0.5
                expected_full_fraction = 0.25
            else:
                expected_half_fraction = 2 * (0.5 ** meiotic_distance)
                expected_full_fraction = 0
                
            expected_total_half = expected_half_fraction * genome_length
            expected_total_full = expected_full_fraction * genome_length
        
        # Approximate segment numbers
        if meiotic_distance > 0:
            # Rough approximation of segment count
            num_segments = num_ancs * (meiotic_distance * 34 + 22) / (2 ** (meiotic_distance - 1))
            
            if expected_total_half > 0:
                num_half = int(num_segments * (expected_total_half / (expected_total_half + expected_total_full)))
            else:
                num_half = 0
                
            if expected_total_full > 0:
                num_full = int(num_segments * (expected_total_full / (expected_total_half + expected_total_full)))
            else:
                num_full = 0
        else:
            num_half = 0
            num_full = 1 if expected_total_full > 0 else 0
        
        return {
            'total_half': expected_total_half,
            'total_full': expected_total_full,
            'num_half': num_half,
            'num_full': num_full
        }
    
    def calculate_simplified_likelihood(observed_stats, relationship_tuple):
        """Calculate a simplified likelihood of observed IBD given a relationship."""
        # Get expected statistics for this relationship
        expected_stats = get_expected_ibd(relationship_tuple)
        
        # Calculate log-likelihood components
        # 1. Total IBD1 similarity
        total_half_ll = -((observed_stats['total_half'] - expected_stats['total_half']) ** 2) / (2 * (expected_stats['total_half'] + 1) * 100)
        
        # 2. Total IBD2 similarity
        total_full_ll = -((observed_stats['total_full'] - expected_stats['total_full']) ** 2) / (2 * (expected_stats['total_full'] + 1) * 100)
        
        # 3. Segment count similarity
        num_half_ll = -((observed_stats['num_half'] - expected_stats['num_half']) ** 2) / (2 * (expected_stats['num_half'] + 1))
        num_full_ll = -((observed_stats['num_full'] - expected_stats['num_full']) ** 2) / (2 * (expected_stats['num_full'] + 1))
        
        # Combine components
        combined_ll = total_half_ll + total_full_ll + num_half_ll + num_full_ll
        
        return combined_ll
    
    # Test the simplified likelihood calculation with our example data
    test_pairs = [(1001, 1002), (1002, 1003), (1003, 1004)]
    test_relationships = [
        (0, 1, 1),  # Parent (up=0, down=1)
        (1, 0, 1),  # Child (up=1, down=0)
        (1, 1, 2),  # Full sibling
        (1, 1, 1),  # Half sibling
        (2, 2, 2)   # Full first cousin
    ]
    
    print("\nSimplified likelihood calculation:")
    for pair_key in [frozenset(pair) for pair in test_pairs]:
        if pair_key in example_ibd_stats:
            id1, id2 = list(pair_key)
            stats = example_ibd_stats[pair_key]
            
            print(f"\nPair {id1}-{id2}:")
            
            # Calculate likelihoods for each test relationship
            likelihoods_results = []
            for rel_tuple in test_relationships:
                log_ll = calculate_simplified_likelihood(stats, rel_tuple)
                likelihoods_results.append((rel_tuple, log_ll))
            
            # Sort by likelihood (highest first)
            likelihoods_results.sort(key=lambda x: x[1], reverse=True)
            
            # Display results
            for rel_tuple, log_ll in likelihoods_results:
                rel_desc = describe_relationship(rel_tuple)
                print(f"  {rel_desc}: {log_ll:.2f}")

### 2.2 Genetic Likelihood Components

The genetic likelihood calculation in `PwLogLike` is built upon several components:

1. **Segment Count Likelihood**: How well the observed number of segments matches the expected count for a relationship

2. **Length Distribution Likelihood**: How well the distribution of segment lengths matches the expected distribution

3. **IBD2 Proportion Likelihood**: How well the proportion of IBD2 segments matches expectations

4. **Total IBD Sharing Likelihood**: How well the total amount of IBD matches expectations

The actual implementation in Bonsai v3 uses more sophisticated statistical models from the `moments` module, which we explored in previous labs. These models have been calibrated based on real genetic data to provide accurate likelihood estimations.

Let's check the `get_pw_gen_ll` method, which is responsible for the genetic likelihood calculation:

In [None]:
# Examine the get_pw_gen_ll method
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    view_source(likelihoods.PwLogLike.get_pw_gen_ll)

## Part 3: Age-Based Likelihood Calculation

One of the unique features of the `PwLogLike` class is its ability to incorporate age information into relationship inference. Let's explore how age-based likelihoods are calculated.

In [None]:
# Examine the get_pw_age_ll method
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    view_source(likelihoods.PwLogLike.get_pw_age_ll)

### 3.1 Age-Based Likelihood Calculation

The `get_pw_age_ll` method calculates the likelihood of a relationship based on the age difference between individuals. This provides an additional source of evidence that complements the genetic evidence.

Key aspects of age-based likelihood calculation:

1. **Expected Age Differences**: Different relationships have different expected age differences
   - Parent-child: ~25-30 years
   - Siblings: ~0-10 years
   - Grandparent-grandchild: ~50-60 years

2. **Normal Distribution Model**: Age differences for each relationship type are modeled using normal distributions
   - Mean and standard deviation parameters are calibrated based on demographic data

3. **Directional Relationships**: For directional relationships (e.g., parent-child), the sign of the age difference matters
   - Parents should be older than children
   - Aunts/uncles should be older than nieces/nephews

4. **Biological Constraints**: Some age differences are biologically impossible
   - E.g., a "parent" cannot be younger than a "child"

Let's implement a simplified version of age-based likelihood calculation:

In [None]:
def get_expected_age_diff(relationship_tuple):
    """Get expected age difference parameters for a relationship.
    
    Returns:
        (mean_diff, std_dev, sign): Mean difference, standard deviation, and expected sign
        Sign is 1 if id1 should be older, -1 if id2 should be older, 0 if no constraint
    """
    up, down, num_ancs = relationship_tuple
    
    # Self relationship
    if up == 0 and down == 0:
        return 0, 0.1, 0  # Exact same age
    
    # Parent-child relationships
    if up == 0 and down == 1:  # id1 is parent of id2
        return 30, 10, 1  # id1 should be older
    if up == 1 and down == 0:  # id1 is child of id2
        return -30, 10, -1  # id1 should be younger
    
    # Sibling relationships
    if up == 1 and down == 1:  # siblings
        return 0, 10, 0  # No strong constraint on age difference
    
    # Grandparent relationships
    if up == 0 and down == 2:  # id1 is grandparent of id2
        return 60, 15, 1  # id1 should be much older
    if up == 2 and down == 0:  # id1 is grandchild of id2
        return -60, 15, -1  # id1 should be much younger
    
    # Avuncular relationships (aunt/uncle, niece/nephew)
    if up == 1 and down == 2:  # id1 is aunt/uncle of id2
        return 20, 15, 1  # id1 should be older
    if up == 2 and down == 1:  # id1 is niece/nephew of id2
        return -20, 15, -1  # id1 should be younger
    
    # Cousin relationships
    if up == 2 and down == 2:  # cousins
        return 0, 20, 0  # No strong constraint on age difference
    
    # Default for other relationships
    return 0, 30, 0  # No specific expectation

def calculate_age_likelihood(age1, age2, relationship_tuple):
    """Calculate the likelihood of a relationship based on age difference."""
    # Calculate age difference (id1 - id2)
    age_diff = age1 - age2
    
    # Get expected age difference parameters
    expected_diff, std_dev, expected_sign = get_expected_age_diff(relationship_tuple)
    
    # Check for biological impossibility
    if (expected_sign > 0 and age_diff < 0) or (expected_sign < 0 and age_diff > 0):
        return float('-inf')  # Biologically impossible
    
    # Calculate log-likelihood using normal distribution
    log_likelihood = stats.norm.logpdf(age_diff, expected_diff, std_dev)
    
    return log_likelihood

# Create an age dictionary from our example data
age_dict = {person['genotype_id']: person['age'] for person in example_bio_info}

# Test the age-based likelihood calculation
test_pairs = [(1001, 1002), (1002, 1003), (1003, 1004)]
test_relationships = [
    (0, 1, 1),  # Parent (up=0, down=1)
    (1, 0, 1),  # Child (up=1, down=0)
    (1, 1, 2),  # Full sibling
    (1, 1, 1),  # Half sibling
    (2, 2, 2)   # Full first cousin
]

print("Age-based likelihood calculation:")
for id1, id2 in test_pairs:
    age1 = age_dict[id1]
    age2 = age_dict[id2]
    age_diff = age1 - age2
    
    print(f"\nPair {id1}-{id2}: Ages {age1} and {age2} (difference: {age_diff} years)")
    
    # Calculate likelihoods for each test relationship
    likelihoods_results = []
    for rel_tuple in test_relationships:
        log_ll = calculate_age_likelihood(age1, age2, rel_tuple)
        likelihoods_results.append((rel_tuple, log_ll))
    
    # Sort by likelihood (highest first)
    likelihoods_results.sort(key=lambda x: x[1], reverse=True)
    
    # Display results
    for rel_tuple, log_ll in likelihoods_results:
        rel_desc = describe_relationship(rel_tuple)
        print(f"  {rel_desc}: {log_ll:.2f}")

## Part 4: Combined Likelihood and Relationship Inference

Now that we've explored both genetic and age-based likelihood calculations, let's see how `PwLogLike` combines these sources of evidence to infer relationships.

### 4.1 Combining Likelihoods

The `get_log_like` method combines genetic and age-based likelihoods using weights that reflect their relative reliability:

```python
combined_ll = gen_ll + age_weight * age_ll
```

Where:
- `gen_ll` is the genetic likelihood
- `age_ll` is the age-based likelihood
- `age_weight` controls the influence of age information (typically between 0.1 and 0.5)

Let's implement a simple combined likelihood calculation:

In [None]:
def combined_likelihood(genetic_ll, age_ll, age_weight=0.25):
    """Combine genetic and age-based likelihoods."""
    return genetic_ll + age_weight * age_ll

# Let's only do this if we didn't already do it with the real PwLogLike class
if is_jupyterlite() or not hasattr(likelihoods, 'PwLogLike'):
    # For each pair, calculate both genetic and age likelihoods, then combine them
    print("Combined likelihood calculation:")
    for pair_key in [frozenset(pair) for pair in test_pairs]:
        if pair_key in example_ibd_stats:
            id1, id2 = list(pair_key)
            stats = example_ibd_stats[pair_key]
            age1 = age_dict[id1]
            age2 = age_dict[id2]
            
            print(f"\nPair {id1}-{id2}: Ages {age1} and {age2}")
            
            # Calculate likelihoods for each test relationship
            likelihoods_results = []
            for rel_tuple in test_relationships:
                genetic_ll = calculate_simplified_likelihood(stats, rel_tuple)
                age_ll = calculate_age_likelihood(age1, age2, rel_tuple)
                combined_ll = combined_likelihood(genetic_ll, age_ll)
                
                likelihoods_results.append((rel_tuple, genetic_ll, age_ll, combined_ll))
            
            # Sort by combined likelihood (highest first)
            likelihoods_results.sort(key=lambda x: x[3], reverse=True)
            
            # Display results
            print(f"{'Relationship':<20} {'Genetic LL':<12} {'Age LL':<12} {'Combined LL':<12}")
            for rel_tuple, genetic_ll, age_ll, combined_ll in likelihoods_results:
                rel_desc = describe_relationship(rel_tuple)
                print(f"{rel_desc:<20} {genetic_ll:<12.2f} {age_ll:<12.2f} {combined_ll:<12.2f}")

### 4.2 Inferring the Most Likely Relationship

Using the combined likelihoods, we can now infer the most likely relationship for a pair of individuals. The `get_most_likely_rel` method in `PwLogLike` does this by evaluating a range of possible relationships and returning the one with the highest likelihood.

Let's examine this method:

In [None]:
# Examine the get_most_likely_rel method
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    view_source(likelihoods.PwLogLike.get_most_likely_rel)

Let's implement a simplified version of this method to understand the core logic:

In [None]:
def get_most_likely_relationship(id1, id2, ibd_stats, age_dict, max_degree=4):
    """Find the most likely relationship between two individuals."""
    # Generate all possible relationship tuples up to max_degree
    relationship_options = []
    
    # Add direct relationships
    for deg in range(max_degree + 1):
        if deg > 0:
            relationship_options.append((0, deg, 1))  # id1 is ancestor of id2
            relationship_options.append((deg, 0, 1))  # id1 is descendant of id2
    
    # Add collateral relationships
    for up in range(1, max_degree + 1):
        for down in range(1, max_degree + 1):
            if up + down <= max_degree * 2:
                relationship_options.append((up, down, 2))  # Full relationship
                relationship_options.append((up, down, 1))  # Half relationship
    
    # Get age information
    age1 = age_dict.get(id1)
    age2 = age_dict.get(id2)
    
    # Calculate likelihood for each relationship option
    likelihoods_results = []
    for rel_tuple in relationship_options:
        # Get IBD statistics for this pair
        pair_key = frozenset([id1, id2])
        stats = ibd_stats.get(pair_key, {'total_half': 0, 'total_full': 0, 'num_half': 0, 'num_full': 0, 'max_seg_cm': 0})
        
        # Calculate genetic likelihood
        genetic_ll = calculate_simplified_likelihood(stats, rel_tuple)
        
        # Calculate age likelihood if ages are available
        if age1 is not None and age2 is not None:
            age_ll = calculate_age_likelihood(age1, age2, rel_tuple)
        else:
            age_ll = 0.0
        
        # Combine likelihoods
        combined_ll = combined_likelihood(genetic_ll, age_ll)
        
        likelihoods_results.append((rel_tuple, genetic_ll, age_ll, combined_ll))
    
    # Sort by combined likelihood (highest first)
    likelihoods_results.sort(key=lambda x: x[3], reverse=True)
    
    # Return the most likely relationship and its likelihood
    return likelihoods_results[0]

# Test the relationship inference with our example data
test_pairs = [(1001, 1002), (1002, 1003), (1003, 1004), (1001, 1003), (1001, 1004), (1002, 1004)]

print("Relationship inference results:")
for id1, id2 in test_pairs:
    pair_key = frozenset([id1, id2])
    
    if pair_key in example_ibd_stats:
        # Infer relationship
        best_rel, genetic_ll, age_ll, combined_ll = get_most_likely_relationship(id1, id2, example_ibd_stats, age_dict)
        rel_desc = describe_relationship(best_rel)
        
        # Display result
        print(f"\nPair {id1}-{id2}:")
        print(f"  Most likely relationship: {rel_desc} {best_rel}")
        print(f"  Genetic likelihood: {genetic_ll:.2f}")
        print(f"  Age likelihood: {age_ll:.2f}")
        print(f"  Combined likelihood: {combined_ll:.2f}")
    else:
        print(f"\nPair {id1}-{id2}: No IBD data available")

## Part 5: Practical Applications

Now that we understand how the `PwLogLike` class works, let's explore how it's used in practical applications within Bonsai v3.

### 5.1 Relationship Inference in Pedigree Construction

The `PwLogLike` class plays a crucial role in Bonsai's pedigree construction process:

1. **Initial Relationship Assessment**: At the start of pedigree construction, `PwLogLike` is used to assess pairwise relationships between all individuals.

2. **Connection Point Evaluation**: When considering how to connect individuals or merge pedigrees, `PwLogLike` evaluates the likelihood of different connection configurations.

3. **Incremental Pedigree Building**: As the pedigree grows, `PwLogLike` continually assesses relationships between new individuals and the existing pedigree.

4. **Validation and Refinement**: `PwLogLike` helps validate and refine the pedigree structure by identifying unlikely or inconsistent relationships.

The example below shows how `PwLogLike` might be used in the pedigree construction process:

In [None]:
if not is_jupyterlite() and hasattr(likelihoods, 'PwLogLike'):
    # Demonstrate how PwLogLike is used in pedigree construction
    try:
        from utils.bonsaitree.bonsaitree.v3 import pedigrees, connections
        print("✅ Successfully imported additional Bonsai modules")
        
        # Create a simple initial pedigree
        initial_pedigree = {
            1001: {},       # Founder
            1002: {1001: 1} # Child of 1001
        }
        
        # Create PwLogLike instance with our example data
        pw_ll = likelihoods.PwLogLike(
            bio_info=example_bio_info,
            unphased_ibd_seg_list=example_ibd_segments
        )
        
        # 1. Find the most likely relationship between 1003 and existing individuals
        print("\nStep 1: Find the most likely relationship for individual 1003")
        
        for existing_id in [1001, 1002]:
            most_likely_rel, log_ll = pw_ll.get_most_likely_rel(existing_id, 1003)
            rel_desc = describe_relationship(most_likely_rel)
            print(f"  Relationship between {existing_id} and 1003: {rel_desc} (log-likelihood = {log_ll:.2f})")
        
        # 2. Add individual 1003 to the pedigree based on the most likely relationship
        print("\nStep 2: Add individual 1003 to the pedigree")
        
        # In this case, let's assume 1003 is a sibling of 1002 (child of 1001)
        updated_pedigree = initial_pedigree.copy()
        updated_pedigree[1003] = {1001: 1}
        
        print("  Updated pedigree:")
        for individual, parents in updated_pedigree.items():
            parent_list = ", ".join([str(parent) for parent in parents])
            if parent_list:
                print(f"    Individual {individual} has parents: {parent_list}")
            else:
                print(f"    Individual {individual} is a founder")
        
        # 3. Find the most likely relationship between 1004 and existing individuals
        print("\nStep 3: Find the most likely relationship for individual 1004")
        
        for existing_id in [1001, 1002, 1003]:
            most_likely_rel, log_ll = pw_ll.get_most_likely_rel(existing_id, 1004)
            rel_desc = describe_relationship(most_likely_rel)
            print(f"  Relationship between {existing_id} and 1004: {rel_desc} (log-likelihood = {log_ll:.2f})")
        
        # 4. Add individual 1004 to the pedigree based on the most likely relationship
        print("\nStep 4: Add individual 1004 to the pedigree")
        
        # In this case, let's assume 1004 is a child of 1003
        final_pedigree = updated_pedigree.copy()
        final_pedigree[1004] = {1003: 1}
        
        print("  Final pedigree:")
        for individual, parents in final_pedigree.items():
            parent_list = ", ".join([str(parent) for parent in parents])
            if parent_list:
                print(f"    Individual {individual} has parents: {parent_list}")
            else:
                print(f"    Individual {individual} is a founder")
    except Exception as e:
        print(f"Error demonstrating PwLogLike in pedigree construction: {e}")
else:
    print("Cannot demonstrate PwLogLike in pedigree construction in JupyterLite environment.")

### 5.2 The Relationship Inference Pipeline

In practice, the relationship inference pipeline in Bonsai v3 involves several steps:

1. **Data Preparation**:
   - Process IBD segments from IBD detection tools
   - Gather demographic information (age, sex)
   - Organize data into the formats expected by `PwLogLike`

2. **PwLogLike Initialization**:
   - Create a `PwLogLike` instance with the prepared data
   - Configure parameters like background IBD and likelihood weights

3. **Pairwise Relationship Inference**:
   - For each pair of individuals, assess the most likely relationship
   - Calculate confidence measures for inferred relationships

4. **Pedigree Construction**:
   - Use inferred relationships to build a consistent pedigree structure
   - Resolve conflicting relationship evidence
   - Iteratively refine the pedigree

5. **Validation and Evaluation**:
   - Validate the final pedigree against all available evidence
   - Identify areas of uncertainty or potential errors

This pipeline transforms raw genetic and demographic data into a comprehensive family tree structure.

## Summary

In this lab, we've explored the `PwLogLike` class, which is the foundation of relationship inference in Bonsai v3. Key takeaways include:

1. **Class Structure**: The `PwLogLike` class combines IBD statistics and demographic information to calculate relationship likelihoods.

2. **Genetic Likelihood**: Using statistical models from the `moments` module, `PwLogLike` calculates the likelihood of observed IBD patterns under different relationship hypotheses.

3. **Age-Based Likelihood**: By modeling expected age differences for different relationships, `PwLogLike` incorporates demographic evidence into relationship inference.

4. **Combined Inference**: `PwLogLike` combines genetic and age-based evidence to determine the most likely relationship between individuals.

5. **Applications**: The `PwLogLike` class is used throughout the pedigree construction process, from initial relationship assessment to final validation.

Understanding the `PwLogLike` class is crucial for working with Bonsai v3, as it forms the quantitative foundation for transforming genetic data into family tree structures.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab07_PwLogLike_Class.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive