# Lab 12: Relationship Assessment and Validation

## Overview

In this lab, we'll explore how Bonsai v3 assesses and validates relationships between individuals in a pedigree. Understanding how to evaluate the plausibility of potential relationships is crucial for accurate pedigree reconstruction from genetic data.

In [None]:
# 🧬 Google Colab Setup - Run this cell first!
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown

def is_colab():
    '''Check if running in Google Colab'''
    try:
        import google.colab
        return True
    except ImportError:
        return False

if is_colab():
    print("🔬 Setting up Google Colab environment...")
    
    # Install dependencies
    print("📦 Installing packages...")
    !pip install -q pysam biopython scikit-allel networkx pygraphviz seaborn plotly
    !apt-get update -qq && apt-get install -qq samtools bcftools tabix graphviz-dev
    
    # Create directories
    !mkdir -p /content/class_data /content/results
    
    # Download essential class data
    print("📥 Downloading class data...")
    S3_BASE = "https://computational-genetic-genealogy.s3.us-east-2.amazonaws.com/class_data/"
    data_files = [
        "pedigree.fam", "pedigree.def", 
        "merged_opensnps_autosomes_ped_sim.seg",
        "merged_opensnps_autosomes_ped_sim-everyone.fam",
        "ped_sim_run2.seg", "ped_sim_run2-everyone.fam"
    ]
    
    for file in data_files:
        !wget -q -O /content/class_data/{file} {S3_BASE}{file}
        print(f"  ✅ {file}")
    
    # Define utility functions
    def setup_environment():
        return "/content/class_data", "/content/results"
    
    def save_results(dataframe, filename, description="results"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        dataframe.to_csv(full_path, index=False)
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e3f2fd; border-left: 4px solid #2196f3; margin: 10px 0;">
            <p><strong>💾 Results saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    def save_plot(plt, filename, description="plot"):
        os.makedirs("/content/results", exist_ok=True)
        full_path = f"/content/results/{filename}"
        plt.savefig(full_path, dpi=300, bbox_inches='tight')
        plt.show()
        display(HTML(f'''
        <div style="padding: 10px; background-color: #e8f5e8; border-left: 4px solid #4caf50; margin: 10px 0;">
            <p><strong>📊 Plot saved!</strong> To download: 
            <code>from google.colab import files; files.download('{full_path}')</code></p>
        </div>
        '''))
        return full_path
    
    print("✅ Colab setup complete! Ready to explore genetic genealogy.")
    
else:
    print("🏠 Local environment detected")
    def setup_environment():
        return "class_data", "results"
    def save_results(df, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        df.to_csv(path, index=False)
        return path
    def save_plot(plt, filename, description=""):
        os.makedirs("results", exist_ok=True)
        path = f"results/{filename}"
        plt.savefig(path, dpi=300, bbox_inches='tight')
        plt.show()
        return path

# Set up paths and configure visualization
DATA_DIR, RESULTS_DIR = setup_environment()
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\n## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\nMethods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\n## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_source(obj):
    """Display the source code of an object (function or class)"""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\n{source}\n```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Lab 12: Relationship Assessment and Validation

In genetic genealogy, assessing and validating potential relationships between individuals is critical. This process involves:

1. Evaluating if a proposed relationship is consistent with observed genetic sharing
2. Comparing multiple relationship hypotheses to determine the most likely one
3. Validating that a relationship satisfies biological constraints (age, gender, etc.)

The `connections.py` module in Bonsai v3 contains the core functions for relationship assessment and validation. Let's explore these functions and see how they work in practice.

## Part 1: Understanding Relationship Assessment

Let's start by examining the core functions in the `connections.py` module that assess relationships between individuals.

In [None]:
# Let's explore the connections module in Bonsai v3
if not is_jupyterlite():
    print("Looking at functions in utils.bonsaitree.bonsaitree.v3.connections:")
    display_module_functions("utils.bonsaitree.bonsaitree.v3.connections")
else:
    print("Cannot access the Bonsai v3 codebase directly in JupyterLite environment.")

In [ ]:
# Import key relationship assessment functions for examination
from utils.bonsaitree.bonsaitree.v3.connections import (
    is_valid_relationship as actual_is_valid_relationship,
    passes_age_check as actual_passes_age_check,
    assess_connections as actual_assess_connections,
    get_connection_log_like,
    find_closest_pedigrees,
    infer_anc_id_age
)

# Let's examine these functions
print("Actual Bonsai v3 relationship validation and assessment functions:")
if not is_jupyterlite():
    print("\n1. is_valid_relationship function:")
    view_source(actual_is_valid_relationship)
    
    print("\n2. passes_age_check function:")
    view_source(actual_passes_age_check)
    
    print("\n3. assess_connections function:")
    view_source(actual_assess_connections)

# Now import the functions for actual use
from utils.bonsaitree.bonsaitree.v3.connections import (
    is_valid_relationship,
    passes_age_check,
    assess_connections
)

# For JupyterLite compatibility, provide simplified implementations
if is_jupyterlite():
    def is_valid_relationship(rel_tuple, sex1, sex2, age1, age2, min_age_of_fertility=16, max_age_of_fertility=50):
        """Check if a relationship is biologically valid based on sex and age.
        
        Args:
            rel_tuple: (up, down, num_ancs) tuple representing the relationship
            sex1: Sex of individual 1 ('M', 'F', or None)
            sex2: Sex of individual 2 ('M', 'F', or None)
            age1: Age of individual 1 (in years) or None
            age2: Age of individual 2 (in years) or None
            min_age_of_fertility: Minimum age for having children
            max_age_of_fertility: Maximum age for having children
            
        Returns:
            is_valid: True if the relationship is biologically valid
        """
        if rel_tuple is None:
            return True  # No relationship to validate
        
        up, down, num_ancs = rel_tuple
        
        # Parent-child relationships have specific sex requirements for biological relationships
        if up == 1 and down == 0:  # Individual 1 is child of individual 2
            if sex2 is not None and sex2 == 'M' and num_ancs == 2:
                return False  # A male can't be a full biological parent (needs a female)
            if sex2 is not None and sex2 == 'F' and num_ancs == 2:
                return False  # A female can't be a full biological parent (needs a female)
        elif up == 0 and down == 1:  # Individual 1 is parent of individual 2
            if sex1 is not None and sex1 == 'M' and num_ancs == 2:
                return False  # A male can't be a full biological parent (needs a female)
            if sex1 is not None and sex1 == 'F' and num_ancs == 2:
                return False  # A female can't be a full biological parent (needs a male)
        
        # Check ages for parent-child relationships
        if up == 1 and down == 0 and age1 is not None and age2 is not None:
            return age2 - age1 >= min_age_of_fertility  # Parent should be older by at least min_age_of_fertility
        elif up == 0 and down == 1 and age1 is not None and age2 is not None:
            return age1 - age2 >= min_age_of_fertility  # Parent should be older by at least min_age_of_fertility
        
        return True  # Default to valid if we don't have specific checks
    
    def passes_age_check(rel_tuple, age1, age2, min_age_of_fertility=16, max_age_of_fertility=50):
        """Check if a relationship passes the age constraints."""
        if rel_tuple is None or age1 is None or age2 is None:
            return True  # Can't validate without all information
        
        up, down, num_ancs = rel_tuple
        
        # For parent-child relationships
        if up == 1 and down == 0:  # Individual 1 is child of individual 2
            return age2 - age1 >= min_age_of_fertility and age2 - age1 <= max_age_of_fertility
        elif up == 0 and down == 1:  # Individual 1 is parent of individual 2
            return age1 - age2 >= min_age_of_fertility and age1 - age2 <= max_age_of_fertility
        
        # For other relationships, we'd need a more complex model
        # This is simplified for JupyterLite
        return True
    
    def assess_connections(rel_tuple, ibd_df, demography=None, sex1=None, sex2=None, age1=None, age2=None):
        """Assess whether a relationship is consistent with observed IBD."""
        # This is a simplified implementation for JupyterLite
        if rel_tuple is None:
            return 0.0  # No relationship
        
        # Check if the relationship is biologically valid
        if not is_valid_relationship(rel_tuple, sex1, sex2, age1, age2):
            return 0.0  # Invalid relationship
        
        # In a real implementation, this would compute a likelihood based on IBD
        # Here we'll just return a simple score based on the relationship type
        up, down, num_ancs = rel_tuple
        
        # Simplified assessment - assign scores based on relationship degree
        degree = up + down
        if degree == 0:  # Self
            return 1.0
        elif degree == 1:  # Parent-child
            return 0.9
        elif degree == 2 and num_ancs == 2:  # Full siblings
            return 0.8
        elif degree == 2 and num_ancs == 1:  # Half siblings/grandparents
            return 0.7
        elif degree == 3:  # First cousins, etc.
            return 0.6
        elif degree == 4:  # Second cousins, etc.
            return 0.5
        elif degree == 5:  # Second cousins once removed, etc.
            return 0.4
        elif degree == 6:  # Third cousins, etc.
            return 0.3
        elif degree > 6:  # Distant relatives
            return 0.2
        else:
            return 0.1  # Fallback

### 1.1 Examining Core Assessment Functions

Let's examine the source code of the key functions for relationship assessment:

In [None]:
# View the source code of is_valid_relationship (if not in JupyterLite)
if not is_jupyterlite():
    print("Source code for is_valid_relationship:")
    view_source(is_valid_relationship)
else:
    print("Using simplified is_valid_relationship in JupyterLite environment")

In [None]:
# View the source code of passes_age_check (if not in JupyterLite)
if not is_jupyterlite():
    print("Source code for passes_age_check:")
    view_source(passes_age_check)
else:
    print("Using simplified passes_age_check in JupyterLite environment")

In [None]:
# Define a function to visualize pedigrees
def visualize_pedigree(up_node_dict, title="Pedigree", highlight_nodes=None, individual_metadata=None):
    """Visualize a pedigree from an up_node_dict using networkx.
    
    Args:
        up_node_dict: Dictionary mapping individuals to their parents
        title: Title for the visualization
        highlight_nodes: Set of nodes to highlight
        individual_metadata: Dictionary mapping individuals to their metadata (age, sex, etc.)
    """
    # Create a directed graph (edges point from child to parent)
    G = nx.DiGraph()
    
    # Add all nodes to the graph (combine all IDs from keys and values)
    all_ids = set(up_node_dict.keys())
    for parents in up_node_dict.values():
        all_ids.update(parents.keys())
    
    # Create node labels
    node_labels = {}
    for node_id in all_ids:
        label = str(node_id)
        if individual_metadata and node_id in individual_metadata:
            metadata = individual_metadata[node_id]
            if 'sex' in metadata and metadata['sex']:
                label += f" ({metadata['sex']})"
            if 'age' in metadata and metadata['age'] is not None:
                label += f"\nAge: {metadata['age']}"
        node_labels[node_id] = label
    
    # Create a color map - blue for males, pink for females, gray for unknown
    highlight_nodes = highlight_nodes or set()
    color_map = []
    for node_id in all_ids:
        if node_id in highlight_nodes:
            color_map.append('red')
        elif individual_metadata and node_id in individual_metadata and 'sex' in individual_metadata[node_id]:
            if individual_metadata[node_id]['sex'] == 'M':
                color_map.append('lightblue')
            elif individual_metadata[node_id]['sex'] == 'F':
                color_map.append('pink')
            else:
                color_map.append('lightgray')
        else:
            color_map.append('lightgray')
    
    # Add edges (from child to parent)
    edges = []
    for child, parents in up_node_dict.items():
        for parent in parents:
            edges.append((child, parent))
    
    G.add_edges_from(edges)
    
    # Create plot
    plt.figure(figsize=(10, 6))
    plt.title(title)
    
    # Layout: By default, parents are shown above children (opposite arrow direction)
    pos = nx.spring_layout(G, seed=42)  # For reproducibility, use a fixed seed
    
    # Draw nodes
    nx.draw(G, pos, with_labels=True, labels=node_labels, node_color=color_map, 
            node_size=800, font_weight='bold')
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5, arrows=True)
    
    plt.tight_layout()
    plt.show()

### 1.2 Validating Relationships based on Sex and Age

Let's demonstrate how to use the `is_valid_relationship` function to validate relationships based on sex and age constraints:

In [None]:
# Define a function to test various relationships
def test_relationship_validity(rel_tuple, sex1, sex2, age1, age2):
    """Test if a relationship is valid given sex and age constraints."""
    # Convert the relationship tuple to a name
    relationship_names = {
        (0, 0, 2): "Self",
        (1, 0, 1): "Child-Parent",
        (0, 1, 1): "Parent-Child",
        (1, 1, 2): "Full Siblings",
        (1, 1, 1): "Half Siblings",
        (2, 0, 1): "Grandchild-Grandparent",
        (0, 2, 1): "Grandparent-Grandchild",
        (2, 1, 1): "Niece/Nephew-Aunt/Uncle",
        (1, 2, 1): "Aunt/Uncle-Niece/Nephew",
        (2, 2, 1): "First Cousins"
    }
    rel_name = relationship_names.get(rel_tuple, f"Unknown: {rel_tuple}")
    
    # Check if the relationship is valid
    is_valid = is_valid_relationship(rel_tuple, sex1, sex2, age1, age2)
    
    # Check if the relationship passes the age check
    passes_age = passes_age_check(rel_tuple, age1, age2)
    
    return {
        "Relationship": rel_name,
        "Individual 1 Sex": sex1,
        "Individual 2 Sex": sex2,
        "Individual 1 Age": age1,
        "Individual 2 Age": age2,
        "Is Valid": is_valid,
        "Passes Age Check": passes_age
    }

# Test various relationship scenarios
test_cases = [
    # Valid parent-child: Mother (35) - Daughter (15)
    ((0, 1, 1), "F", "F", 35, 15),
    
    # Invalid parent-child: Son (20) - Father (30) [Age difference too small]
    ((1, 0, 1), "M", "M", 20, 30),
    
    # Valid full siblings: Brother (25) - Brother (27)
    ((1, 1, 2), "M", "M", 25, 27),
    
    # Valid grandparent-grandchild: Grandmother (70) - Grandson (20)
    ((0, 2, 1), "F", "M", 70, 20),
    
    # Invalid parent-child: Male (40) claiming to be full biological mother
    ((0, 1, 2), "M", "F", 40, 10),
    
    # Valid aunt-niece: Aunt (45) - Niece (20)
    ((1, 2, 1), "F", "F", 45, 20),
    
    # Valid first cousins: Male (30) - Female (28)
    ((2, 2, 1), "M", "F", 30, 28)
]

# Run the tests
test_results = [test_relationship_validity(*args) for args in test_cases]

# Display the results as a DataFrame
results_df = pd.DataFrame(test_results)
display(results_df)

Let's visualize a pedigree with some of these relationships to better understand the validation process:

In [None]:
# Create a pedigree with various relationships
sample_pedigree = {
    # Generation 3 (children/grandchildren)
    7: {5: 1, 6: 1},  # Child of 5 and 6
    8: {5: 1, 6: 1},  # Child of 5 and 6 (sibling of 7)
    9: {4: 1, -1: 1}, # Child of 4 and -1
    
    # Generation 2 (parents/aunts/uncles)
    4: {1: 1, 2: 1},  # Child of 1 and 2 (sibling of 5)
    5: {1: 1, 2: 1},  # Child of 1 and 2 (sibling of 4)
    6: {3: 1, -2: 1}, # Child of 3 and -2
    -1: {},           # Ungenotyped individual
    -2: {},           # Ungenotyped individual
    
    # Generation 1 (grandparents)
    1: {},            # Founder
    2: {},            # Founder
    3: {}             # Founder
}

# Create metadata for each individual
individual_metadata = {
    # Generation 3
    7: {"sex": "M", "age": 15},  # Male, 15 years old
    8: {"sex": "F", "age": 12},  # Female, 12 years old
    9: {"sex": "M", "age": 17},  # Male, 17 years old
    
    # Generation 2
    4: {"sex": "M", "age": 40},  # Male, 40 years old
    5: {"sex": "F", "age": 38},  # Female, 38 years old
    6: {"sex": "M", "age": 41},  # Male, 41 years old
    -1: {"sex": "F", "age": 39}, # Female, 39 years old
    -2: {"sex": "F", "age": 65}, # Female, 65 years old (biologically implausible)
    
    # Generation 1
    1: {"sex": "M", "age": 65},  # Male, 65 years old
    2: {"sex": "F", "age": 63},  # Female, 63 years old
    3: {"sex": "M", "age": 70}   # Male, 70 years old
}

# Visualize the pedigree with metadata
visualize_pedigree(sample_pedigree, title="Sample Pedigree with Age and Sex Information", individual_metadata=individual_metadata)

In [None]:
# Let's analyze the relationships in this pedigree
from utils.bonsaitree.bonsaitree.v3.pedigrees import get_simple_rel_tuple

# For JupyterLite compatibility
if is_jupyterlite():
    def get_simple_rel_tuple(up_node_dict, i, j):
        """Get relationship tuple (up, down, num_ancs) between individuals i and j."""
        if i == j:
            return (0, 0, 2)
        
        # Simple implementation for JupyterLite - this would be more complex in reality
        if j in up_node_dict.get(i, {}):
            return (1, 0, 1)  # i is child of j
        elif i in up_node_dict.get(j, {}):
            return (0, 1, 1)  # i is parent of j
        
        # Check for siblings/cousins (simplified)
        i_parents = set(up_node_dict.get(i, {}).keys())
        j_parents = set(up_node_dict.get(j, {}).keys())
        common_parents = i_parents.intersection(j_parents)
        
        if common_parents:
            if len(common_parents) == 2:
                return (1, 1, 2)  # Full siblings
            else:
                return (1, 1, 1)  # Half siblings
        
        # Default - no relationship found
        return None

# Select pairs of individuals to analyze
individual_pairs = [
    (7, 8),   # Siblings
    (7, 5),   # Child-Parent
    (7, 9),   # Cousins
    (5, 6),   # Partners
    (4, 5),   # Siblings
    (7, 1),   # Grandchild-Grandparent
    (6, -2)   # Child-Parent (biologically implausible - mother too old)
]

# Analyze each pair
relationship_analysis = []
for i, j in individual_pairs:
    # Get the relationship tuple
    rel_tuple = get_simple_rel_tuple(sample_pedigree, i, j)
    
    # Convert to a name
    relationship_names = {
        (0, 0, 2): "Self",
        (1, 0, 1): "Child-Parent",
        (0, 1, 1): "Parent-Child",
        (1, 1, 2): "Full Siblings",
        (1, 1, 1): "Half Siblings",
        (2, 0, 1): "Grandchild-Grandparent",
        (0, 2, 1): "Grandparent-Grandchild",
        (2, 1, 1): "Niece/Nephew-Aunt/Uncle",
        (1, 2, 1): "Aunt/Uncle-Niece/Nephew",
        (2, 2, 1): "First Cousins"
    }
    rel_name = relationship_names.get(rel_tuple, f"Unknown: {rel_tuple}")
    
    # Get metadata for validation
    sex1 = individual_metadata[i]["sex"] if i in individual_metadata else None
    sex2 = individual_metadata[j]["sex"] if j in individual_metadata else None
    age1 = individual_metadata[i]["age"] if i in individual_metadata else None
    age2 = individual_metadata[j]["age"] if j in individual_metadata else None
    
    # Check validity
    is_valid = is_valid_relationship(rel_tuple, sex1, sex2, age1, age2)
    passes_age = passes_age_check(rel_tuple, age1, age2)
    
    relationship_analysis.append({
        "Individual 1": i,
        "Individual 2": j,
        "Individual 1 Info": f"{sex1}, {age1} years" if sex1 and age1 else "Unknown",
        "Individual 2 Info": f"{sex2}, {age2} years" if sex2 and age2 else "Unknown",
        "Relationship": rel_name,
        "Is Valid": is_valid,
        "Passes Age Check": passes_age
    })

# Display the analysis
analysis_df = pd.DataFrame(relationship_analysis)
display(analysis_df)

### 1.3 Visualizing Problematic Relationships

Let's highlight the relationships in our pedigree that fail validation:

In [None]:
# Find invalid relationships
invalid_relationships = analysis_df[~analysis_df["Is Valid"]]
invalid_pairs = list(zip(invalid_relationships["Individual 1"], invalid_relationships["Individual 2"]))

# Extract the nodes involved in invalid relationships
problematic_nodes = set()
for i, j in invalid_pairs:
    problematic_nodes.add(i)
    problematic_nodes.add(j)

# Visualize the pedigree with problematic nodes highlighted
if problematic_nodes:
    print(f"Found {len(problematic_nodes)} individuals involved in biologically implausible relationships:")
    print(", ".join(str(node) for node in problematic_nodes))
    visualize_pedigree(sample_pedigree, title="Pedigree with Problematic Relationships Highlighted",
                      highlight_nodes=problematic_nodes, individual_metadata=individual_metadata)
else:
    print("No problematic relationships found in the pedigree.")

## Part 2: Assessing Relationships Based on IBD Data

Now let's explore how Bonsai uses IBD (Identity by Descent) segments to assess relationships. IBD segments are stretches of DNA that are identical between two individuals due to inheritance from a common ancestor.

### 2.1 Understanding the assess_connections Function

Let's examine the core function for assessing relationships based on IBD data:

In [ ]:
# Let's examine the source code of the actual Bonsai v3 function for assessing relationships based on IBD data
if not is_jupyterlite():
    print("Source code for get_connection_log_like:")
    view_source(get_connection_log_like)
    
    print("\nSource code for assess_connections:")
    view_source(actual_assess_connections)
else:
    print("Using simplified assess_connections in JupyterLite environment")

The `assess_connections` function takes a relationship tuple and IBD data, and returns a score indicating how well the relationship explains the observed IBD sharing. This score is based on comparing the observed IBD sharing to what would be expected for the given relationship.

Let's create some simulated IBD data to demonstrate how this works:

In [None]:
# Simulate IBD segments between pairs of individuals
import random

def simulate_ibd_segments(rel_tuple, num_segments=10, noise_level=0.1):
    """Simulate IBD segments for a given relationship tuple.
    
    Args:
        rel_tuple: (up, down, num_ancs) tuple representing the relationship
        num_segments: Number of segments to simulate
        noise_level: Level of noise to add to segment lengths
        
    Returns:
        segments: List of simulated IBD segments
    """
    if rel_tuple is None:
        return []  # No relationship, no IBD
    
    up, down, num_ancs = rel_tuple
    degree = up + down
    
    # Different relationships have different expected amounts of IBD
    if degree == 0:  # Self
        expected_total_cm = 3400  # Entire genome
        avg_segment_cm = 340
    elif degree == 1:  # Parent-child
        expected_total_cm = 3400 / 2  # Half the genome
        avg_segment_cm = 170
    elif degree == 2 and num_ancs == 2:  # Full siblings
        expected_total_cm = 2550  # ~75% of the genome
        avg_segment_cm = 85
    elif degree == 2 and num_ancs == 1:  # Half siblings/grandparents
        expected_total_cm = 1700  # ~50% of the genome
        avg_segment_cm = 42.5
    elif degree == 3:  # First cousins once removed
        expected_total_cm = 850  # ~25% of the genome
        avg_segment_cm = 21.25
    elif degree == 4:  # Second cousins
        expected_total_cm = 425  # ~12.5% of the genome
        avg_segment_cm = 10.6
    elif degree == 5:  # Second cousins once removed
        expected_total_cm = 212.5  # ~6.25% of the genome
        avg_segment_cm = 5.3
    elif degree == 6:  # Third cousins
        expected_total_cm = 106.25  # ~3.125% of the genome
        avg_segment_cm = 5.3 / 2
    else:  # More distant
        expected_total_cm = 53.125  # ~1.5625% of the genome
        avg_segment_cm = 5.3 / 4
    
    # Generate simulated segments
    segments = []
    chromosomes = list(range(1, 23))  # Chromosomes 1-22
    
    for _ in range(num_segments):
        # Select a random chromosome
        chromosome = random.choice(chromosomes)
        
        # Generate a segment length with some noise
        segment_cm = avg_segment_cm * (1 + noise_level * (random.random() - 0.5))
        
        # Generate random start and end positions (in genetic distance)
        max_pos = 100 + 20 * chromosome  # Approximate chromosome length
        start_cm = random.uniform(0, max_pos - segment_cm)
        end_cm = start_cm + segment_cm
        
        segments.append({
            "chromosome": chromosome,
            "start_cm": start_cm,
            "end_cm": end_cm,
            "length_cm": segment_cm
        })
    
    return segments

# Define relationships to simulate
relationships_to_simulate = [
    ((0, 0, 2), "Self"),
    ((0, 1, 1), "Parent-Child"),
    ((1, 1, 2), "Full Siblings"),
    ((1, 1, 1), "Half Siblings"),
    ((0, 2, 1), "Grandparent-Grandchild"),
    ((2, 2, 1), "First Cousins"),
    ((2, 3, 1), "First Cousins Once Removed"),
    ((3, 3, 1), "Second Cousins")
]

# Simulate IBD for each relationship
simulated_ibd = {}
for rel_tuple, rel_name in relationships_to_simulate:
    segments = simulate_ibd_segments(rel_tuple)
    simulated_ibd[rel_name] = segments
    total_cm = sum(seg["length_cm"] for seg in segments)
    num_segments = len(segments)
    print(f"{rel_name}: {num_segments} segments, total {total_cm:.2f} cM")

In [None]:
# Visualize the simulated IBD segments for different relationships
plt.figure(figsize=(12, 8))

# Set up the plot
relationships = list(simulated_ibd.keys())
y_positions = range(len(relationships))
plt.yticks(y_positions, relationships)
plt.xlabel('Chromosome Position (cM)')
plt.title('Simulated IBD Segments by Relationship Type')

# Plot segments for each relationship
for i, rel_name in enumerate(relationships):
    segments = simulated_ibd[rel_name]
    for seg in segments:
        chrom = seg['chromosome']
        start = seg['start_cm']
        end = seg['end_cm']
        # Offset each chromosome to display them side by side
        offset = chrom * 150
        plt.plot([offset + start, offset + end], [i, i], linewidth=5, alpha=0.7)

# Add chromosome labels
chrom_positions = [chrom * 150 + 75 for chrom in range(1, 23)]
plt.xticks(chrom_positions[::2], [str(c) for c in range(1, 23)][::2])

plt.grid(axis='x', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

### 2.2 Using IBD Data to Assess Relationships

Now let's demonstrate how to use the `assess_connections` function to evaluate how well a relationship explains the observed IBD data:

In [None]:
# Convert the simulated data to the format expected by assess_connections
def convert_segments_to_dataframe(segments):
    """Convert a list of segment dictionaries to a DataFrame."""
    return pd.DataFrame(segments)

# Assess each relationship against each set of IBD data
assessment_results = []

for actual_rel_tuple, actual_rel_name in relationships_to_simulate:
    # Get the IBD data for this relationship
    ibd_df = convert_segments_to_dataframe(simulated_ibd[actual_rel_name])
    
    # Test against each possible relationship
    for test_rel_tuple, test_rel_name in relationships_to_simulate:
        # Assess how well the test relationship explains the actual IBD data
        score = assess_connections(test_rel_tuple, ibd_df)
        
        assessment_results.append({
            "Actual Relationship": actual_rel_name,
            "Test Relationship": test_rel_name,
            "Score": score
        })

# Convert to DataFrame
assessment_df = pd.DataFrame(assessment_results)

# Reshape for a heatmap
assessment_pivot = assessment_df.pivot(index="Actual Relationship", columns="Test Relationship", values="Score")

# Plot a heatmap of scores
plt.figure(figsize=(12, 10))
sns.heatmap(assessment_pivot, annot=True, cmap="YlGnBu", fmt=".2f")
plt.title("Relationship Assessment Scores: Actual vs. Test Relationships")
plt.tight_layout()
plt.show()

In this heatmap, higher scores (darker colors) indicate a better match between the test relationship and the observed IBD data. The diagonal shows how well each relationship explains its own simulated data, which should be high. Off-diagonal elements show how well one relationship can be mistaken for another based on IBD data.

Let's extract the best-scoring relationship for each set of IBD data:

In [None]:
# Find the best-scoring relationship for each set of IBD data
best_relationships = []

for actual_rel_name in assessment_pivot.index:
    # Get scores for this actual relationship
    scores = assessment_pivot.loc[actual_rel_name]
    
    # Find the best-scoring test relationship
    best_rel = scores.idxmax()
    best_score = scores.max()
    
    best_relationships.append({
        "Actual Relationship": actual_rel_name,
        "Best Match": best_rel,
        "Score": best_score,
        "Correct": actual_rel_name == best_rel
    })

# Display the results
best_rel_df = pd.DataFrame(best_relationships)
display(best_rel_df)

# Let's examine the actual functions in Bonsai v3 that handle pedigree building and assessment
from utils.bonsaitree.bonsaitree.v3.connections import (
    combine_pedigrees as actual_combine_pedigrees,
    combine_up_dicts as actual_combine_up_dicts,
    get_sharing_ids as actual_get_sharing_ids
)

if not is_jupyterlite():
    print("Source code for combine_pedigrees function:")
    view_source(actual_combine_pedigrees)
    
    print("\nSource code for combine_up_dicts function:")
    view_source(actual_combine_up_dicts)
    
    print("\nSource code for get_sharing_ids function:")
    view_source(actual_get_sharing_ids)
else:
    print("Cannot display actual Bonsai v3 functions in JupyterLite environment.")

### 3.1 Understanding Bonsai v3's Pedigree Building Workflow

The actual Bonsai v3 library integrates relationship assessment into its pedigree building workflow through several steps:

1. **Relationship Validation**: Functions like `is_valid_relationship` and `passes_age_check` ensure biological plausibility
2. **Connection Assessment**: `assess_connections` evaluates how well relationships explain observed IBD data
3. **Pedigree Combination**: Higher-level functions like `combine_pedigrees` and `combine_up_dicts` merge relationship fragments
4. **IBD Analysis**: Functions like `get_sharing_ids` identify which individuals share genetic segments between pedigrees
5. **Likelihood Evaluation**: The `get_connection_log_like` function computes the composite likelihood of connecting pedigrees

In the real Bonsai v3 implementation, these functions use sophisticated statistical models and optimizations not shown in our simplified versions. For this lab, we'll use simplified implementations that provide the essential functionality while being compatible with JupyterLite.

In [ ]:
# First, let's look at how Bonsai v3 handles IBD data and sharing between individuals
from utils.bonsaitree.bonsaitree.v3.ibd import (
    get_id_to_shared_ibd as actual_get_id_to_shared_ibd,
    get_total_ibd_between_id_sets as actual_get_total_ibd_between_id_sets,
    get_closest_pair as actual_get_closest_pair
)

if not is_jupyterlite():
    print("Source code for get_id_to_shared_ibd function:")
    view_source(actual_get_id_to_shared_ibd)
    
    print("\nSource code for get_total_ibd_between_id_sets function:")
    view_source(actual_get_total_ibd_between_id_sets)
    
    print("\nSource code for get_closest_pair function:")
    view_source(actual_get_closest_pair)
else:
    print("Cannot display actual Bonsai v3 functions in JupyterLite environment.")

# Now, let's create our dataset for this lab
# Define individuals in our dataset
individuals = [
    {"id": 1, "sex": "M", "age": 70},  # Grandfather
    {"id": 2, "sex": "F", "age": 68},  # Grandmother
    {"id": 3, "sex": "M", "age": 45},  # Father
    {"id": 4, "sex": "F", "age": 43},  # Mother
    {"id": 5, "sex": "M", "age": 20},  # Son
    {"id": 6, "sex": "F", "age": 18},  # Daughter
    {"id": 7, "sex": "M", "age": 42},  # Uncle
    {"id": 8, "sex": "F", "age": 19}   # Cousin
]

# Define ground truth relationships
true_relationships = [
    (1, 3, (0, 1, 1)),  # 1 is parent of 3
    (2, 3, (0, 1, 1)),  # 2 is parent of 3
    (1, 7, (0, 1, 1)),  # 1 is parent of 7
    (2, 7, (0, 1, 1)),  # 2 is parent of 7
    (3, 5, (0, 1, 1)),  # 3 is parent of 5
    (4, 5, (0, 1, 1)),  # 4 is parent of 5
    (3, 6, (0, 1, 1)),  # 3 is parent of 6
    (4, 6, (0, 1, 1)),  # 4 is parent of 6
    (7, 8, (0, 1, 1)),  # 7 is parent of 8
    (3, 7, (1, 1, 2)),  # 3 and 7 are siblings
    (5, 6, (1, 1, 2)),  # 5 and 6 are siblings
    (5, 8, (2, 2, 1)),  # 5 and 8 are first cousins
    (6, 8, (2, 2, 1))   # 6 and 8 are first cousins
]

# Create IBD data for each pair
ibd_data = {}
for id1, id2, rel_tuple in true_relationships:
    # Include both directions of the relationship
    pair_key = tuple(sorted((id1, id2)))
    ibd_data[pair_key] = simulate_ibd_segments(rel_tuple)

# Create metadata dictionary
metadata = {ind["id"]: {"sex": ind["sex"], "age": ind["age"]} for ind in individuals}

# Visualize the true pedigree
true_pedigree = {}
for id1, id2, rel_tuple in true_relationships:
    if rel_tuple[0] == 0 and rel_tuple[1] == 1:  # Parent-child
        if id2 not in true_pedigree:
            true_pedigree[id2] = {}
        true_pedigree[id2][id1] = 1

visualize_pedigree(true_pedigree, title="Ground Truth Pedigree", individual_metadata=metadata)

### 3.2 Building a Pedigree from IBD Data Using Bonsai v3's Approach

In Bonsai v3, pedigree building from IBD data follows this general workflow:

1. **Data Preparation**:
   - Identify individuals who share genetic segments
   - Calculate the total amount of genetic sharing between each pair

2. **Relationship Inference**:
   - For each pair of individuals sharing DNA, infer the most likely relationship
   - Filter out biologically implausible relationships using `is_valid_relationship` and `passes_age_check`
   - Score relationship candidates using `assess_connections` based on IBD patterns

3. **Pedigree Construction**:
   - Start with the closest relationships (parent-child, siblings)
   - Iteratively add more distant relationships
   - Resolve conflicts when different relationships are incompatible
   - Use `combine_pedigrees` to join relationship fragments into coherent structures

4. **Optimization**:
   - Evaluate different possible pedigrees using likelihood scoring
   - Select the pedigree configuration that best explains the observed IBD data
   - Use `combine_up_dicts` to construct the final pedigree

Let's implement a simplified version of this workflow to build a pedigree from our simulated IBD data:

In [None]:
# Define possible relationship types to test
possible_relationships = [
    ((0, 1, 1), "Parent-Child"),
    ((1, 0, 1), "Child-Parent"),
    ((1, 1, 2), "Full Siblings"),
    ((1, 1, 1), "Half Siblings"),
    ((0, 2, 1), "Grandparent-Grandchild"),
    ((2, 0, 1), "Grandchild-Grandparent"),
    ((1, 2, 1), "Aunt/Uncle-Niece/Nephew"),
    ((2, 1, 1), "Niece/Nephew-Aunt/Uncle"),
    ((2, 2, 1), "First Cousins")
]

# For all pairs with IBD data, assess each possible relationship
inferred_relationships = []

for (id1, id2), segments in ibd_data.items():
    # Convert segments to DataFrame
    ibd_df = pd.DataFrame(segments)
    
    # Get demographic information
    sex1 = metadata[id1]["sex"]
    sex2 = metadata[id2]["sex"]
    age1 = metadata[id1]["age"]
    age2 = metadata[id2]["age"]
    
    # Assess each possible relationship
    relationship_scores = []
    for rel_tuple, rel_name in possible_relationships:
        # Check if the relationship is valid based on sex and age
        is_valid = is_valid_relationship(rel_tuple, sex1, sex2, age1, age2)
        
        # Only consider valid relationships
        if is_valid:
            # Assess how well the relationship explains the IBD data
            score = assess_connections(rel_tuple, ibd_df, sex1=sex1, sex2=sex2, age1=age1, age2=age2)
            relationship_scores.append((rel_tuple, rel_name, score))
    
    # Find the best-scoring valid relationship
    if relationship_scores:
        best_rel = max(relationship_scores, key=lambda x: x[2])
        rel_tuple, rel_name, score = best_rel
        
        inferred_relationships.append({
            "Individual 1": id1,
            "Individual 2": id2,
            "Individual 1 Info": f"{sex1}, {age1} years",
            "Individual 2 Info": f"{sex2}, {age2} years",
            "Inferred Relationship": rel_name,
            "Score": score,
            "Relationship Tuple": str(rel_tuple)
        })

# Convert to DataFrame
inferred_df = pd.DataFrame(inferred_relationships).sort_values(by="Score", ascending=False)
display(inferred_df)

Now let's build a pedigree from these inferred relationships, prioritizing the highest-scoring ones:

In [None]:
# Convert relationship tuples from string back to tuples for processing
import ast
inferred_df["Rel Tuple"] = inferred_df["Relationship Tuple"].apply(ast.literal_eval)

# Start with an empty pedigree
inferred_pedigree = {}

# Process relationships in order of confidence (highest score first)
for _, row in inferred_df.iterrows():
    id1 = row["Individual 1"]
    id2 = row["Individual 2"]
    rel_tuple = row["Rel Tuple"]
    
    # Initialize missing individuals in the pedigree
    if id1 not in inferred_pedigree:
        inferred_pedigree[id1] = {}
    if id2 not in inferred_pedigree:
        inferred_pedigree[id2] = {}
    
    # Parent-child relationships
    if rel_tuple == (0, 1, 1):  # id1 is parent of id2
        inferred_pedigree[id2][id1] = 1
    elif rel_tuple == (1, 0, 1):  # id1 is child of id2
        inferred_pedigree[id1][id2] = 1
    
    # For sibling relationships, we'd need to add common parents
    # This is a simplified approach and would need more complex logic
    # in a real implementation

# Visualize the inferred pedigree
visualize_pedigree(inferred_pedigree, title="Inferred Pedigree from IBD Data", individual_metadata=metadata)

### 3.3 Comparing the Inferred Pedigree to Ground Truth

Let's analyze how well our inferred pedigree matches the ground truth pedigree:

In [None]:
# Extract parent-child relationships from both pedigrees
def extract_parent_child_pairs(pedigree):
    """Extract all parent-child pairs from a pedigree."""
    pairs = set()
    for child, parents in pedigree.items():
        for parent in parents:
            pairs.add((parent, child))
    return pairs

true_pairs = extract_parent_child_pairs(true_pedigree)
inferred_pairs = extract_parent_child_pairs(inferred_pedigree)

# Find true positives, false positives, and false negatives
true_positives = true_pairs.intersection(inferred_pairs)
false_positives = inferred_pairs - true_pairs
false_negatives = true_pairs - inferred_pairs

# Calculate precision, recall, and F1 score
precision = len(true_positives) / len(inferred_pairs) if inferred_pairs else 0
recall = len(true_positives) / len(true_pairs) if true_pairs else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"Pedigree Reconstruction Accuracy:\n")
print(f"Total true parent-child relationships: {len(true_pairs)}")
print(f"Total inferred parent-child relationships: {len(inferred_pairs)}\n")
print(f"True positives: {len(true_positives)}")
print(f"False positives: {len(false_positives)}")
print(f"False negatives: {len(false_negatives)}\n")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Print examples of errors if any
if false_positives:
    print("\nExamples of false positive relationships (incorrectly inferred):")
    for parent, child in list(false_positives)[:3]:  # Show up to 3 examples
        print(f"  - Individual {parent} is not actually the parent of individual {child}")
        
if false_negatives:
    print("\nExamples of false negative relationships (missed):")
    for parent, child in list(false_negatives)[:3]:  # Show up to 3 examples
        print(f"  - Failed to identify that individual {parent} is the parent of individual {child}")

## Summary

In this lab, we've explored how Bonsai v3 assesses and validates relationships between individuals in a pedigree. We've examined the actual implementation of key functions and provided simplified versions for educational purposes. Key takeaways include:

1. **Understanding Bonsai v3's Core Functions**: We examined the source code of crucial functions like `is_valid_relationship`, `passes_age_check`, and `assess_connections` to understand how Bonsai validates and evaluates relationships.

2. **Biological Validation**: Bonsai validates relationships using biological constraints such as sex and age. Functions like `is_valid_relationship` and `passes_age_check` ensure that proposed relationships are biologically plausible.

3. **IBD-Based Assessment**: The `assess_connections` function evaluates how well a proposed relationship explains observed IBD sharing. This involves comparing observed IBD to expected values based on relationship type.

4. **Relationship Inference**: By testing multiple possible relationships and selecting the one with the highest score, Bonsai can infer the most likely relationship between two individuals based on their genetic sharing.

5. **Pedigree Building**: These relationship assessment techniques form the foundation of Bonsai's pedigree reconstruction capabilities, allowing it to build pedigrees that best explain the observed genetic data.

6. **Conflict Resolution**: When multiple relationship hypotheses are plausible, Bonsai can rank them by likelihood, helping users identify the most probable relationships.

7. **Integration in Bonsai's Workflow**: We saw how these functions are integrated into Bonsai v3's larger workflow, with functions like `combine_pedigrees` and `combine_up_dicts` handling the actual construction of complete pedigrees.

Understanding these relationship assessment mechanisms is crucial for effective pedigree reconstruction in genetic genealogy, as they determine which relationships are included in the final pedigree and how conflicting evidence is resolved.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab12_Relationship_Assessment.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive