# Lab 23: Handling Twins and Close Relatives

## Overview

This notebook explores Bonsai v3's specialized handling of identical twins and other extremely close genetic relatives that present unique challenges for genetic genealogy. We'll examine how the `twins.py` module implements algorithms to detect twin-like patterns and ensures appropriate treatment in pedigree construction.

**Learning Objectives:**
- Understand the twin detection algorithms in Bonsai v3
- Learn how to identify and classify twin relationships using genetic data
- Explore the implementation of twin sets in pedigree structures
- Analyze the threshold parameters for twin detection
- Apply twin detection to real-world genetic data scenarios

**Prerequisites:**
- Completion of Lab 9: Pedigree Data Structures
- Familiarity with IBD segments and genomic sharing metrics

**Estimated completion time:** 60-90 minutes

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")
sns.set_palette("colorblind")  # Improve accessibility with colorblind-friendly palette

# Configure plot defaults for better readability
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 12,
    'axes.labelsize': 12,
    'axes.titlesize': 14,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10
})

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        if not classes:
            print(f"No classes found in module {module_name}")
            return
            
        # Print info for each class
        for name, cls in classes:
            display(Markdown(f"### Class: {name}"))
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                display(Markdown(f"**Documentation:**\
{doc}"))
            else:
                display(Markdown("*No documentation available*"))
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            public_methods = [(method_name, method) for method_name, method in methods 
                             if not method_name.startswith('_')]
            
            if public_methods:
                display(Markdown("**Public Methods:**"))
                for method_name, method in public_methods:
                    sig = inspect.signature(method)
                    display(Markdown(f"- `{method_name}{sig}`"))
            else:
                display(Markdown("*No public methods*"))
            
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        if not functions:
            print(f"No functions found in module {module_name}")
            return
            
        # Filter public functions
        public_functions = [(name, func) for name, func in functions if not name.startswith('_')]
        
        if not public_functions:
            print(f"No public functions found in module {module_name}")
            return
            
        # Print info for each function
        for name, func in public_functions:                
            display(Markdown(f"### Function: {name}"))
            
            # Get signature
            sig = inspect.signature(func)
            display(Markdown(f"**Signature:** `{name}{sig}`"))
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                display(Markdown(f"**Documentation:**\
{doc}"))
            else:
                display(Markdown("*No documentation available*"))
                
            display(Markdown("---"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_function_source(module_name, function_name):
    """Display the source code of a function"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for `{function_name}`\
```python\
{source}\
```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Function {function_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

def view_class_source(module_name, class_name):
    """Display the source code of a class"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Get the class
        cls = getattr(module, class_name)
        
        # Get the source code
        source = inspect.getsource(cls)
        
        # Print the source code with syntax highlighting
        display(Markdown(f"### Source code for class `{class_name}`\
```python\
{source}\
```"))
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except AttributeError:
        print(f"Class {class_name} not found in module {module_name}")
    except Exception as e:
        print(f"Error processing class {class_name}: {e}")

def explore_module(module_name):
    """Display a comprehensive overview of a module with classes and functions"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Module docstring
        doc = inspect.getdoc(module)
        display(Markdown(f"# Module: {module_name}"))
        
        if doc:
            display(Markdown(f"**Module Documentation:**\
{doc}"))
        else:
            display(Markdown("*No module documentation available*"))
            
        display(Markdown("---"))
        
        # Display classes
        display(Markdown("## Classes"))
        display_module_classes(module_name)
        
        # Display functions
        display(Markdown("## Functions"))
        display_module_functions(module_name)
        
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error exploring module {module_name}: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
    
    # Print Bonsai version information if available
    if hasattr(v3, "__version__"):
        print(f"Bonsai v3 version: {v3.__version__}")
    
    # List key submodules
    print("\
Available Bonsai submodules:")
    for module_name in dir(v3):
        if not module_name.startswith("_") and not module_name.startswith("__"):
            print(f"- {module_name}")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Introduction

Identifying twins and extremely close genetic relatives presents unique challenges in genetic genealogy. While most relationships show varying degrees of genetic sharing, twins share nearly identical genetic profiles, making them a special case for relationship inference algorithms.

In this lab, we'll explore how Bonsai v3 handles twin detection and classification through its `twins.py` module. This specialized module contains algorithms designed to identify and handle identical twins, fraternal twins, and other extremely close genetic relatives.

**Key concepts we'll cover:**
- Twin detection algorithms and thresholds
- Distinguishing twins from parent-child relationships
- Building twin sets within pedigree structures
- The role of non-genetic information in twin identification

## Part 1: The Twin Detection Challenge

### Theory and Background

Twins present a unique challenge in genetic genealogy for several reasons:

1. **Identical Twins (Monozygotic)**: Share virtually 100% of their DNA because they develop from a single fertilized egg that splits into two embryos. They are always the same sex and typically share very similar physical characteristics.

2. **Fraternal Twins (Dizygotic)**: Develop from two separate fertilized eggs and share approximately 50% of their DNA, similar to non-twin siblings. They can be the same or different sexes.

3. **Genetic Ambiguity**: The extremely high DNA sharing in identical twins can be difficult to distinguish from parent-child relationships, which also show very high sharing.

4. **Pedigree Placement**: When constructing a family tree, twins require special handling to ensure they are both placed as children of the same parents with the correct relationships to other relatives.

Twin detection in genetic genealogy relies on a combination of genetic and non-genetic information:

- **Genetic Criteria**: The proportion of DNA shared between individuals
- **Demographic Criteria**: Age, sex, and other personal information
- **Statistical Thresholds**: Cutoff values that distinguish twins from other close relatives

Bonsai v3 implements these concepts in its `twins.py` module, establishing a systematic approach to twin detection and pedigree integration.

### Implementation in Bonsai v3

Let's examine how Bonsai v3 implements twin detection in its codebase. First, we'll explore the `twins.py` module structure and its key functions:

In [None]:
try:
    # Import the twins module from Bonsai v3
    from bonsaitree.v3 import twins
    print("✅ Successfully imported the twins module")
    
    # Examine the module structure
    explore_module("bonsaitree.v3.twins")
except ImportError as e:
    print(f"❌ Failed to import twins module: {e}")
    print("Using alternative approach to explore module...")
    
    # Alternative approach: import directly from path
    try:
        sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'utils'))
        from bonsaitree.bonsaitree.v3 import twins
        print("✅ Successfully imported the twins module using alternative path")
        
        # Examine the module structure
        explore_module("bonsaitree.bonsaitree.v3.twins")
    except ImportError as e:
        print(f"❌ Failed to import twins module using alternative path: {e}")
        print("We'll continue with manual explanation of the module.")

If the module import is unsuccessful, let's examine the key functions in the `twins.py` module directly:

In [None]:
# Display the key functions from twins.py
twins_module_functions = """
### Function: is_twin_pair

**Signature:** `is_twin_pair(total_half: float, total_full: float, age1: int, age2: int, sex1: str, sex2: str)`

**Documentation:**
Determine if a pair of individuals are twins.

Args:
    total_half: The total length of half segments shared by the two individuals.
    total_full: The total length of full segments shared by the two individuals.
    age1: The age of the first individual.
    age2: The age of the second individual.
    sex1: Sex of the first individual.
    sex2: Sex of the second individual.

Returns:
    True if the individuals are twins, False otherwise.

### Function: get_twin_sets

**Signature:** `get_twin_sets(ibd_stat_dict: dict[frozenset, dict[str, int]], age_dict: dict[int, float], sex_dict: dict[int, str])`

**Documentation:**
Find all sets of twins.

Args:
    ibd_stat_dict: A dictionary of IBD statistics.
    age_dict: A dictionary mapping ID to age
    sex_dict: A dictionary mapping ID to sex ('M' or 'F')

Returns:
    idx_to_twin_set: A dictionary mapping an index to a set of node IDs
        that form a twin set.
    id_to_idx: A dict mapping each twin ID to its index.
"""

display(Markdown(twins_module_functions))

In [None]:
# Display the implementation of the is_twin_pair function
is_twin_pair_source = """
def is_twin_pair(
    total_half: float,
    total_full: float,
    age1: int,
    age2: int,
    sex1: str,
    sex2: str,
):
    """
    Determine if a pair of individuals are twins.

    Args:
        total_half: The total length of half segments shared by the two individuals.
        total_full: The total length of full segments shared by the two individuals.
        age1: The age of the first individual.
        age2: The age of the second individual.
        sex1: Sex of the first individual.
        sex2: Sex of the second individual.

    Returns:
        True if the individuals are twins, False otherwise.
    """

    # handle unrelated people
    if total_half is None:
        return False
    if total_full is None:
        return False

    if total_half < TWIN_THRESHOLD:
        return False
    elif total_full < TWIN_THRESHOLD:
        return False
    elif sex1 != sex2:
        return False
    elif age1 and age2 and age1 != age2:
        return False

    return True
"""

display(Markdown(f"```python\
{is_twin_pair_source}\
```"))

In [None]:
# Display the implementation of the get_twin_sets function
get_twin_sets_source = """
def get_twin_sets(
    ibd_stat_dict: dict[frozenset, dict[str, int]],
    age_dict: dict[int, float],
    sex_dict: dict[int, str],
):
    """
    Find all sets of twins.

    Args:
        ibd_stat_dict: A dictionary of IBD statistics.
        age_dict: A dictionary mapping ID to age
        sex_dict: A dictionary mapping ID to sex ('M' or 'F')

    Returns:
        idx_to_twin_set: A dictionary mapping an index to a set of node IDs
            that form a twin set.
        id_to_idx: A dict mapping each twin ID to its index.
    """
    idx_to_twin_set = {}
    id_to_idx = {}
    ctr = 0
    for k,v in ibd_stat_dict.items():
        id1, id2 = [*k]
        age1 = age_dict.get(id1)
        age2 = age_dict.get(id2)
        sex1 = sex_dict.get(id1)
        sex2 = sex_dict.get(id2)

        total_half = v.get("total_half")
        total_full = v.get("total_full")

        is_twin = is_twin_pair(
            total_half = total_half,
            total_full = total_full,
            age1 = age1,
            age2 = age2,
            sex1 = sex1,
            sex2 = sex2,
        )

        if is_twin:
            idx1 = id_to_idx.get(id1, ctr)
            idx2 = id_to_idx.get(id2, ctr)

            # Add the IDs
            idx_to_twin_set.setdefault(idx1, set()).add(id1)
            idx_to_twin_set.setdefault(idx2, set()).add(id2)

            # combine sets
            if idx1 != idx2:
                idx_to_twin_set[idx1] |= idx_to_twin_set[idx2]

            for iid in idx_to_twin_set[idx1]:
                id_to_idx[iid] = idx1

            # delete idx2 if it is not ctr
            if idx1 != idx2:
                del idx_to_twin_set[idx2]

            ctr += 1

    return idx_to_twin_set, id_to_idx
"""

display(Markdown(f"```python\
{get_twin_sets_source}\
```"))

### Key Constants for Twin Detection

Bonsai v3 uses the `TWIN_THRESHOLD` constant to determine whether two individuals are twins. Let's examine the definition of this threshold:

In [None]:
# Display the TWIN_THRESHOLD constant
constants_excerpt = """
# From constants.py
AUTO_GENOME_LENGTH = 3545    # Autosomal genome length in centiMorgans
FULL_GENOME_LENGTH = 3725    # Full genome length including X-chromosome in centiMorgans
GENOME_LENGTH = AUTO_GENOME_LENGTH    # Default to autosomal length

# Twin detection threshold
TWIN_THRESHOLD = 0.95 * GENOME_LENGTH  # if two genomes share more than TWIN_THRESHOLD IBD, 
                                       # then we call them twins if their sexes and ages match
"""

display(Markdown(f"```python\
{constants_excerpt}\
```"))

# Calculate the actual threshold value
AUTO_GENOME_LENGTH = 3545
TWIN_THRESHOLD = 0.95 * AUTO_GENOME_LENGTH

print(f"TWIN_THRESHOLD value: {TWIN_THRESHOLD:.2f} cM ({0.95*100:.0f}% of the autosomal genome)")

### Exercise 1: Understanding Twin Detection Criteria

Let's analyze the `is_twin_pair` function to understand how Bonsai determines if two individuals are twins.

**Task:** Based on the implementation of `is_twin_pair`, complete the function below to check whether pairs of individuals qualify as twins according to Bonsai's criteria. Then analyze what factors influence the twin detection outcome.

**Hint:** Look carefully at the conditions in the `is_twin_pair` function and consider how each factor (IBD sharing, age, sex) contributes to the twin determination.

In [None]:
# Exercise 1 code template
def check_twin_status(total_half_ibd, total_full_ibd, age1, age2, sex1, sex2):
    """Check if a pair of individuals would be classified as twins by Bonsai.
    
    Args:
        total_half_ibd: Total half-IBD sharing in cM
        total_full_ibd: Total full-IBD sharing in cM
        age1: Age of person 1 (or None if unknown)
        age2: Age of person 2 (or None if unknown)
        sex1: Sex of person 1 ('M' or 'F')
        sex2: Sex of person 2 ('M' or 'F')
        
    Returns:
        is_twin: True if classified as twins, False otherwise
        reason: Explanation for the classification
    """
    # TODO: Implement the twin detection logic based on Bonsai's criteria
    # Use TWIN_THRESHOLD = 0.95 * 3545 = 3367.75 cM
    
    # Your code here
    TWIN_THRESHOLD = 0.95 * 3545
    
    # Check for None values in IBD sharing
    if total_half_ibd is None or total_full_ibd is None:
        return False, "Missing IBD data"
    
    # Check IBD thresholds
    if total_half_ibd < TWIN_THRESHOLD:
        return False, f"Half-IBD sharing ({total_half_ibd:.2f} cM) below threshold ({TWIN_THRESHOLD:.2f} cM)"
    
    if total_full_ibd < TWIN_THRESHOLD:
        return False, f"Full-IBD sharing ({total_full_ibd:.2f} cM) below threshold ({TWIN_THRESHOLD:.2f} cM)"
    
    # Check sex matching
    if sex1 != sex2:
        return False, f"Sex mismatch: {sex1} vs {sex2} (must match for twins)"
    
    # Check age matching (only if both ages are known)
    if age1 is not None and age2 is not None and age1 != age2:
        return False, f"Age mismatch: {age1} vs {age2} (must match for twins)"
    
    # If all criteria pass, classify as twins
    return True, "Meets all twin criteria: high IBD sharing, matching sex and age"

# Test cases for the function
test_cases = [
    # total_half_ibd, total_full_ibd, age1, age2, sex1, sex2, expected_result
    (3400, 3400, 30, 30, 'M', 'M', True),       # Identical twins (same age, sex, high IBD)
    (3400, 3400, 30, 31, 'M', 'M', False),      # Age mismatch
    (3400, 3400, 30, 30, 'M', 'F', False),      # Sex mismatch
    (3300, 3400, 30, 30, 'M', 'M', False),      # Half-IBD below threshold
    (3400, 3300, 30, 30, 'M', 'M', False),      # Full-IBD below threshold
    (3400, 3400, None, 30, 'M', 'M', True),     # Age1 unknown
    (3400, 3400, 30, None, 'M', 'M', True),     # Age2 unknown
    (3400, 3400, None, None, 'M', 'M', True),   # Both ages unknown
    (3400, None, 30, 30, 'M', 'M', False),      # Full-IBD unknown
    (None, 3400, 30, 30, 'M', 'M', False),      # Half-IBD unknown
    (1800, 1800, 30, 30, 'M', 'M', False),      # Low IBD (non-twin siblings)
    (3500, 3500, 30, 30, 'F', 'F', True)        # Female twins
]

# Test the function with our test cases
for i, (half_ibd, full_ibd, age1, age2, sex1, sex2, expected) in enumerate(test_cases):
    is_twin, reason = check_twin_status(half_ibd, full_ibd, age1, age2, sex1, sex2)
    
    result = "PASS" if is_twin == expected else "FAIL"
    print(f"Test {i+1}: {result}")
    print(f"  Input: half_ibd={half_ibd}, full_ibd={full_ibd}, age1={age1}, age2={age2}, sex1={sex1}, sex2={sex2}")
    print(f"  Expected: {'Twin' if expected else 'Not Twin'}")
    print(f"  Actual: {'Twin' if is_twin else 'Not Twin'} - {reason}")
    print()

### Analysis of Twin Detection Criteria

Based on our exploration of the `is_twin_pair` function and our implementation of the test cases, we can summarize Bonsai's twin detection criteria as follows:

1. **IBD Sharing Threshold**: Both individuals must share at least 95% of their autosomal genome (approximately 3367.75 cM) in both half-IBD and full-IBD segments.

2. **Sex Concordance**: Both individuals must have the same sex. This makes sense for identical twins, who are always the same sex, but means that fraternal twins of different sexes won't be classified as twins in Bonsai.

3. **Age Matching**: If the ages of both individuals are known, they must be identical. This reflects the biological reality that twins are born on the same day.

4. **Missing Data Handling**: 
   - If IBD data is missing, the individuals won't be classified as twins.
   - If age data is missing for one or both individuals, the age criterion is effectively skipped.

These strict criteria ensure that only individuals with very high genetic similarity and matching sex and age are classified as twins. However, there are some potential limitations to this approach:

- Identical twins with slightly different recorded ages (due to data errors or different registration dates) would be missed.
- Fraternal twins of different sexes won't be classified as twins despite having the same parents and birth date.
- The high IBD threshold might miss some fraternal twins who share less than 95% of their genome.

In the next section, we'll explore how Bonsai groups twins into sets and integrates them into pedigree structures.

## Part 2: Building Twin Sets

### Theory and Background

When multiple twins are present in a dataset, Bonsai needs to organize them into **twin sets** - groups of individuals who are all twins with each other. This organization is important for several reasons:

1. **Transitive Relationships**: If person A is a twin of person B, and person B is a twin of person C, then A and C must also be twins. Twin sets enforce this transitive property.

2. **Consistent Pedigree Integration**: All members of a twin set should have identical relationships with other individuals in the pedigree. For example, if one twin has a parent identified, all twins in the set should have that same parent.

3. **Computational Efficiency**: By grouping twins together, algorithms can treat the entire set as a unit in many operations, reducing redundancy.

The process of building twin sets involves:

1. Identifying all pairwise twin relationships using the criteria we examined in Part 1
2. Grouping these pairs into coherent sets based on shared membership
3. Ensuring that all members of a set share the same demographic attributes (sex, age)
4. Maintaining a mapping between individual IDs and their twin set memberships

In genetic genealogy algorithms, twin sets are often used to simplify relationship inference and pedigree construction by reducing the number of entities that need to be positioned independently in the family tree.

### Implementation in Bonsai v3

Let's examine the `get_twin_sets` function in detail to understand how Bonsai implements twin set construction:

```python
def get_twin_sets(
    ibd_stat_dict: dict[frozenset, dict[str, int]],
    age_dict: dict[int, float],
    sex_dict: dict[int, str],
):
    """
    Find all sets of twins.

    Args:
        ibd_stat_dict: A dictionary of IBD statistics.
        age_dict: A dictionary mapping ID to age
        sex_dict: A dictionary mapping ID to sex ('M' or 'F')

    Returns:
        idx_to_twin_set: A dictionary mapping an index to a set of node IDs
            that form a twin set.
        id_to_idx: A dict mapping each twin ID to its index.
    """
    idx_to_twin_set = {}  # Maps a twin set index to a set of individual IDs
    id_to_idx = {}        # Maps individual IDs to their twin set index
    ctr = 0               # Counter for creating new twin set indices
    
    # Iterate through all pairwise IBD statistics
    for k,v in ibd_stat_dict.items():
        id1, id2 = [*k]  # Extract the two individual IDs
        
        # Look up demographic information
        age1 = age_dict.get(id1)
        age2 = age_dict.get(id2)
        sex1 = sex_dict.get(id1)
        sex2 = sex_dict.get(id2)

        # Get IBD sharing statistics
        total_half = v.get("total_half")
        total_full = v.get("total_full")

        # Check if this is a twin pair
        is_twin = is_twin_pair(
            total_half = total_half,
            total_full = total_full,
            age1 = age1,
            age2 = age2,
            sex1 = sex1,
            sex2 = sex2,
        )

        if is_twin:
            # Get the twin set indices for each individual, or use counter if not already assigned
            idx1 = id_to_idx.get(id1, ctr)
            idx2 = id_to_idx.get(id2, ctr)

            # Add individuals to their twin sets
            idx_to_twin_set.setdefault(idx1, set()).add(id1)
            idx_to_twin_set.setdefault(idx2, set()).add(id2)

            # If they're in different twin sets, merge the sets
            if idx1 != idx2:
                idx_to_twin_set[idx1] |= idx_to_twin_set[idx2]

            # Update all individuals in the merged set to point to idx1
            for iid in idx_to_twin_set[idx1]:
                id_to_idx[iid] = idx1

            # Clean up - remove the redundant twin set
            if idx1 != idx2:
                del idx_to_twin_set[idx2]

            # Increment counter for potential new twin sets
            ctr += 1

    return idx_to_twin_set, id_to_idx
```

This function takes three inputs:

1. `ibd_stat_dict`: A dictionary containing IBD statistics for pairs of individuals
2. `age_dict`: A dictionary mapping individual IDs to their ages
3. `sex_dict`: A dictionary mapping individual IDs to their sexes

The function returns two outputs:

1. `idx_to_twin_set`: Maps twin set indices to sets of individual IDs
2. `id_to_idx`: Maps individual IDs to their twin set indices

Let's break down the algorithm:

1. **Initialization**: Create empty dictionaries for twin sets and ID mappings
2. **Iteration**: For each pair of individuals with IBD statistics...
   - Check if they form a twin pair based on genetic and demographic criteria
   - If they do, add them to appropriate twin sets
   - If they're already in different twin sets, merge those sets
   - Update all mappings to maintain consistency
3. **Return**: Provide the final twin sets and ID mappings

This approach efficiently handles the transitive property of twin relationships, ensuring that if A is a twin of B, and B is a twin of C, then A, B, and C will all be placed in the same twin set.

### Exercise 2: Implementing Twin Set Construction

Now let's implement a simplified version of the `get_twin_sets` function to practice working with twin sets.

**Task:** Complete the function below to build twin sets from a list of pairwise twin relationships. Your implementation should handle the transitive property of twin relationships.

**Hint:** Use sets to store twin groups, and make sure that when you find a new twin relationship, you check if either individual is already part of an existing twin set.

In [ ]:
# Exercise 2 code template
def build_twin_sets(twin_pairs):
    """
    Build twin sets from a list of pairwise twin relationships.
    
    Args:
        twin_pairs: List of tuples (id1, id2) where id1 and id2 are twin IDs
        
    Returns:
        list_of_twin_sets: List of sets, where each set contains IDs of individuals 
                          who are all twins with each other
    """
    # TODO: Implement the twin set construction algorithm
    
    # Initialize an empty list to store the twin sets
    list_of_twin_sets = []
    
    # Process each twin pair
    for id1, id2 in twin_pairs:
        # Find existing sets containing either id
        set1 = None
        set2 = None
        
        for twin_set in list_of_twin_sets:
            if id1 in twin_set:
                set1 = twin_set
            if id2 in twin_set:
                set2 = twin_set
        
        # Case 1: Neither twin is in an existing set - create a new set
        if set1 is None and set2 is None:
            list_of_twin_sets.append({id1, id2})
            
        # Case 2: id1 is in a set, but id2 is not - add id2 to id1's set
        elif set1 is not None and set2 is None:
            set1.add(id2)
            
        # Case 3: id2 is in a set, but id1 is not - add id1 to id2's set
        elif set1 is None and set2 is not None:
            set2.add(id1)
            
        # Case 4: Both twins are in different sets - merge the sets
        elif set1 is not set2:  # Note: 'is not' checks if they're different objects
            # Merge set2 into set1
            set1.update(set2)
            # Remove set2 from the list
            list_of_twin_sets.remove(set2)
            
        # Case 5: Both twins are already in the same set - do nothing
        
    return list_of_twin_sets

# Test cases for the function
test_cases = [
    # Simple case: pairs form a single twin set
    [
        ('A', 'B'),
        ('B', 'C'),
        ('C', 'D')
    ],
    
    # Multiple disjoint twin sets
    [
        ('A', 'B'),
        ('C', 'D'),
        ('E', 'F'),
        ('G', 'H')
    ],
    
    # Complex case with multiple merges
    [
        ('A', 'B'),
        ('C', 'D'),
        ('E', 'F'),
        ('B', 'C'),
        ('F', 'G'),
        ('H', 'I'),
        ('G', 'I')
    ],
    
    # Redundant pairs (should not affect the result)
    [
        ('A', 'B'),
        ('B', 'C'),
        ('A', 'C'),
        ('A', 'B')  # Duplicate
    ]
]

# Test the function with our test cases
for i, twin_pairs in enumerate(test_cases):
    twin_sets = build_twin_sets(twin_pairs)
    
    print(f"Test Case {i+1}:")
    print(f"  Input: {twin_pairs}")
    print(f"  Output: {twin_sets}")
    
    # Additional validation
    # 1. Each individual should appear in exactly one twin set
    all_individuals = set()
    for pair in twin_pairs:
        all_individuals.add(pair[0])
        all_individuals.add(pair[1])
    
    individuals_in_sets = set()
    for twin_set in twin_sets:
        individuals_in_sets.update(twin_set)
    
    print(f"  All individuals accounted for: {all_individuals == individuals_in_sets}")
    
    # 2. Each pair should be in the same set
    all_pairs_in_same_set = True
    for id1, id2 in twin_pairs:
        in_same_set = False
        for twin_set in twin_sets:
            if id1 in twin_set and id2 in twin_set:
                in_same_set = True
                break
        if not in_same_set:
            all_pairs_in_same_set = False
            break
    
    print(f"  All pairs in same set: {all_pairs_in_same_set}")
    print()

## Part 3: Twin Integration in Pedigree Construction

### Theory and Background

Once twin sets have been identified, they need to be properly integrated into pedigree structures. This integration presents several challenges:

1. **Relationship Consistency**: All members of a twin set must have identical relationships with other individuals in the pedigree. If twin A is a parent of person C, then twin B must also be a parent of person C.

2. **Disambiguation**: In some cases, the high genetic similarity between twins makes it difficult to determine unique relationships. For example, if A and B are twins, and C is genetically determined to be a child of one of them, it may be impossible to determine whether C is a child of A or B based on genetics alone.

3. **Efficiency in Representation**: Pedigree data structures should efficiently handle twins without duplicating relationship information.

In Bonsai v3, the twin handling in pedigree construction follows these general principles:

1. Twin sets are identified early in the pedigree construction process
2. Twins are treated as a unit for relationship inference
3. When relationships are assigned to one member of a twin set, they are automatically propagated to all members of the set
4. Visualization and reporting mechanisms indicate twin relationships

Let's examine how this works in practice for different relationship scenarios.

### Twin-Aware Pedigree Models

In pedigree construction, Bonsai v3 uses the twin set information when inferring relationships and building family trees. Let's visualize how pedigrees are affected by the presence of twins:

```python
# Pseudo-code for twin-aware relationship inference in Bonsai
def infer_relationships_with_twins(ibd_data, demographic_data):
    # Step 1: Identify twin sets
    twin_sets, id_to_twin_set = get_twin_sets(ibd_data, demographic_data["age"], demographic_data["sex"])
    
    # Step 2: Treat twin sets as single units for relationship inference
    # For each potential relationship
    for id1, id2 in potential_relationships:
        # Check if either ID is part of a twin set
        if id1 in id_to_twin_set:
            # Use a representative from the twin set for inference
            twin_rep1 = next(iter(twin_sets[id_to_twin_set[id1]]))
            # Use the representative for relationship inference
            relationship = infer_relationship(twin_rep1, id2, ibd_data)
            # Assign the same relationship to all twins in the set
            for twin_id in twin_sets[id_to_twin_set[id1]]:
                assign_relationship(twin_id, id2, relationship)
        elif id2 in id_to_twin_set:
            # Similar logic for id2 being part of a twin set
            # ...
        else:
            # Regular relationship inference for non-twins
            relationship = infer_relationship(id1, id2, ibd_data)
            assign_relationship(id1, id2, relationship)
```

This twin-aware approach ensures consistency in relationship inference and pedigree construction.

### Exercise 3: Visualizing Twin-Aware Pedigrees

In this exercise, we'll create a visualization of a pedigree that includes twins, to better understand how they're represented in family trees.

**Task:** Complete the function below to create a NetworkX graph representation of a pedigree with twins, then visualize it with appropriate highlighting for twin sets.

**Hint:** Use different colors or node styles to indicate twin relationships.

In [ ]:
# Exercise 3 code template
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import random

def visualize_pedigree_with_twins(relationships, twin_sets):
    """
    Create a visualization of a pedigree that includes twin relationships.
    
    Args:
        relationships: List of tuples (parent_id, child_id) representing parent-child relationships
        twin_sets: List of sets, where each set contains IDs of individuals who are twins
        
    Returns:
        None (displays the visualization)
    """
    # Create a directed graph for the pedigree
    G = nx.DiGraph()
    
    # Add all relationships to the graph
    for parent, child in relationships:
        G.add_edge(parent, child)
    
    # Ensure all individuals are included, even if they have no relationships
    all_individuals = set()
    for parent, child in relationships:
        all_individuals.add(parent)
        all_individuals.add(child)
    for twin_set in twin_sets:
        all_individuals.update(twin_set)
    
    for individual in all_individuals:
        G.add_node(individual)
    
    # Create a color map for twin sets
    twin_colors = ['lightgreen', 'lightblue', 'lightcoral', 'lightsalmon', 'lavender', 'khaki']
    color_map = {}
    
    # Assign colors to twins
    for i, twin_set in enumerate(twin_sets):
        color = twin_colors[i % len(twin_colors)]
        for twin in twin_set:
            color_map[twin] = color
    
    # Assign default color to non-twins
    for individual in all_individuals:
        if individual not in color_map:
            color_map[individual] = 'lightgrey'
    
    # Convert the color map to a list in the order of the nodes
    node_colors = [color_map[node] for node in G.nodes()]
    
    # Create the visualization
    plt.figure(figsize=(12, 8))
    
    # Use a pedigree-friendly layout
    pos = nx.spring_layout(G, seed=42)  # Use spring layout with a fixed seed for reproducibility
    
    # Draw the graph
    nx.draw_networkx_nodes(G, pos, node_size=500, node_color=node_colors)
    nx.draw_networkx_edges(G, pos, arrowstyle='->', arrowsize=15)
    nx.draw_networkx_labels(G, pos, font_size=10)
    
    # Add a legend for twin sets
    legend_patches = []
    for i, twin_set in enumerate(twin_sets):
        color = twin_colors[i % len(twin_colors)]
        label = f"Twin set {i+1}: {', '.join(sorted(twin_set))}"
        legend_patches.append(mpatches.Patch(color=color, label=label))
    
    legend_patches.append(mpatches.Patch(color='lightgrey', label='Non-twins'))
    
    plt.legend(handles=legend_patches, loc='upper left', bbox_to_anchor=(1, 1))
    
    # Adjust the plot
    plt.axis('off')
    plt.tight_layout()
    plt.title('Pedigree with Twin Relationships')
    
    plt.show()

# Example data for testing
example_relationships = [
    # Parent-child relationships
    ('GF1', 'F1'),  # Grandfather -> Father
    ('GF1', 'F2'),  # Grandfather -> Uncle
    ('GM1', 'F1'),  # Grandmother -> Father
    ('GM1', 'F2'),  # Grandmother -> Uncle
    ('F1', 'C1'),   # Father -> Child 1
    ('F1', 'C2'),   # Father -> Child 2
    ('F1', 'C3'),   # Father -> Child 3
    ('M1', 'C1'),   # Mother -> Child 1
    ('M1', 'C2'),   # Mother -> Child 2
    ('M1', 'C3'),   # Mother -> Child 3
    ('F2', 'N1'),   # Uncle -> Cousin 1
    ('F2', 'N2'),   # Uncle -> Cousin 2
    ('M2', 'N1'),   # Aunt -> Cousin 1
    ('M2', 'N2'),   # Aunt -> Cousin 2
]

example_twin_sets = [
    {'C1', 'C2'},   # Child 1 and Child 2 are twins
    {'N1', 'N2'},   # Cousin 1 and Cousin 2 are twins
]

# Visualize the example pedigree
visualize_pedigree_with_twins(example_relationships, example_twin_sets)

# Create a more complex example with multiple generations and twin sets
complex_relationships = [
    # Generation 1 -> 2
    ('G1-1', 'G2-1'),
    ('G1-1', 'G2-2'),
    ('G1-2', 'G2-1'),
    ('G1-2', 'G2-2'),
    ('G1-3', 'G2-3'),
    ('G1-3', 'G2-4'),
    ('G1-4', 'G2-3'),
    ('G1-4', 'G2-4'),
    
    # Generation 2 -> 3
    ('G2-1', 'G3-1'),
    ('G2-1', 'G3-2'),
    ('G2-2', 'G3-3'),
    ('G2-2', 'G3-4'),
    ('G2-3', 'G3-5'),
    ('G2-3', 'G3-6'),
    ('G2-4', 'G3-7'),
    ('G2-4', 'G3-8'),
    
    # Generation 3 -> 4
    ('G3-1', 'G4-1'),
    ('G3-2', 'G4-2'),
    ('G3-3', 'G4-3'),
    ('G3-4', 'G4-4'),
    ('G3-5', 'G4-5'),
    ('G3-6', 'G4-6'),
    ('G3-7', 'G4-7'),
    ('G3-8', 'G4-8'),
]

complex_twin_sets = [
    {'G1-1', 'G1-2'},  # Twins in generation 1
    {'G2-3', 'G2-4'},  # Twins in generation 2
    {'G3-1', 'G3-2'},  # Twins in generation 3
    {'G3-5', 'G3-6', 'G3-7'},  # Triplets in generation 3
    {'G4-1', 'G4-2'},  # Twins in generation 4
    {'G4-7', 'G4-8'},  # Twins in generation 4
]

# Visualize the complex pedigree
visualize_pedigree_with_twins(complex_relationships, complex_twin_sets)

## Real-World Application: Twin Detection in Genetic Genealogy

In real-world genetic genealogy scenarios, accurate twin detection and handling is critical for several reasons:

1. **Accuracy in Relationship Inference**: Failing to recognize twins can lead to inconsistent or incorrect relationship predictions, especially when one twin has more complete data than another.

2. **Pedigree Integrity**: Without proper twin handling, pedigrees might contain contradictory relationships or implausible genetic patterns.

3. **Privacy Considerations**: Identical twins have nearly identical genetic profiles, which has implications for privacy and consent in genetic research and testing.

4. **Medical Applications**: In medical genetics, distinguishing between identical twins and other closely related individuals like parent-child pairs is essential for accurate health risk assessments.

Genetic testing companies and genealogy services that use technologies like Bonsai need to implement robust twin detection algorithms to address these challenges. Some practical strategies include:

- Combining genetic data with non-genetic information (age, sex, birth dates, etc.)
- Explicitly asking users to indicate known twin relationships
- Using advanced statistical methods to distinguish between relationship types with similar genetic sharing patterns
- Implementing special handling for twin sets in pedigree visualization and relationship reporting

In Bonsai v3, the twin detection and handling mechanisms we've explored provide a solid foundation for addressing these real-world needs.

## Self-Assessment Questions

Test your understanding with these questions:

1. What is the TWIN_THRESHOLD value in Bonsai v3, and what does it represent in the context of twin detection?

2. Why does Bonsai check both half-IBD and full-IBD sharing when determining if individuals are twins?

3. What challenge do fraternal twins of different sexes pose for Bonsai's twin detection algorithm?

4. How does the `get_twin_sets` function handle the transitive property of twin relationships?

5. What are two key benefits of organizing twins into sets during pedigree construction?

*Answers to self-assessment questions can be found at the end of the lab document.*

## Summary

In this lab, we explored how Bonsai v3 handles identical twins and other extremely close genetic relatives through its specialized `twins.py` module. Key takeaways include:

1. Twin detection in Bonsai relies on both genetic criteria (high IBD sharing above the TWIN_THRESHOLD of 95% of the genome) and non-genetic criteria (matching sex and age).

2. Bonsai organizes twins into coherent sets that maintain the transitive property of twin relationships, ensuring that if A and B are twins, and B and C are twins, then A and C are also treated as twins.

3. In pedigree construction, twin sets are treated as units with identical relationships to other individuals, which simplifies relationship inference and ensures consistency.

4. Visualizing pedigrees with twins requires special consideration to clearly indicate twin relationships and maintain the correct family structure.

5. Real-world applications of twin detection algorithms face challenges like distinguishing identical twins from parent-child pairs and handling fraternal twins of different sexes.

### Connections to Other Labs

The concepts covered in this lab connect to:
- **Lab 9: Pedigree Data Structures** - Twin sets are integrated into pedigree structures
- **Lab 12: Relationship Assessment** - Twin relationships affect how other relationships are inferred
- **Lab 16: Merging Pedigrees** - Twin detection is important when combining pedigree fragments

### Further Reading

To deepen your understanding of these topics, consider exploring:

- Visscher, P. M. (2006). "Variation of estimates of SNP and haplotype diversity and linkage disequilibrium in samples from the same population due to experimental and evolutionary sample size." *Human Heredity*, 61(4), 189-196.
- Koskenvuo, M., et al. (2012). "Comparison of Classic Twin Model Fitting Approaches: in Genetic and Epigenetic Studies." *Twin Research and Human Genetics*, 15(3), 415-423.
- Levine, A. P., & Pritchard, J. K. (2014). "Statistics for distinguishing closely related pairs of genomes." *Bioinformatics*, 30(16), 2272-2279.

---

## Answer Key (for instructors)

### Exercise 1
The solution is provided in the notebook. The function should implement the twin detection criteria as defined in Bonsai's `is_twin_pair` function:

```python
def check_twin_status(total_half_ibd, total_full_ibd, age1, age2, sex1, sex2):
    TWIN_THRESHOLD = 0.95 * 3545
    
    # Check for None values in IBD sharing
    if total_half_ibd is None or total_full_ibd is None:
        return False, "Missing IBD data"
    
    # Check IBD thresholds
    if total_half_ibd < TWIN_THRESHOLD:
        return False, f"Half-IBD sharing ({total_half_ibd:.2f} cM) below threshold ({TWIN_THRESHOLD:.2f} cM)"
    
    if total_full_ibd < TWIN_THRESHOLD:
        return False, f"Full-IBD sharing ({total_full_ibd:.2f} cM) below threshold ({TWIN_THRESHOLD:.2f} cM)"
    
    # Check sex matching
    if sex1 != sex2:
        return False, f"Sex mismatch: {sex1} vs {sex2} (must match for twins)"
    
    # Check age matching (only if both ages are known)
    if age1 is not None and age2 is not None and age1 != age2:
        return False, f"Age mismatch: {age1} vs {age2} (must match for twins)"
    
    # If all criteria pass, classify as twins
    return True, "Meets all twin criteria: high IBD sharing, matching sex and age"
```

### Exercise 2
The solution is provided in the notebook. The `build_twin_sets` function should handle all cases of adding twins to sets and merging sets when necessary:

```python
def build_twin_sets(twin_pairs):
    list_of_twin_sets = []
    
    for id1, id2 in twin_pairs:
        set1 = None
        set2 = None
        
        for twin_set in list_of_twin_sets:
            if id1 in twin_set:
                set1 = twin_set
            if id2 in twin_set:
                set2 = twin_set
        
        if set1 is None and set2 is None:
            list_of_twin_sets.append({id1, id2})
        elif set1 is not None and set2 is None:
            set1.add(id2)
        elif set1 is None and set2 is not None:
            set2.add(id1)
        elif set1 is not set2:
            set1.update(set2)
            list_of_twin_sets.remove(set2)
    
    return list_of_twin_sets
```

### Exercise 3
The solution is provided in the notebook. The `visualize_pedigree_with_twins` function should:
1. Create a directed graph representation of the pedigree
2. Assign colors to twin sets for visual distinction
3. Visualize the pedigree with appropriate styling and a legend

### Self-Assessment Answers

1. The TWIN_THRESHOLD value in Bonsai v3 is 0.95 * GENOME_LENGTH, which equals approximately 3367.75 cM. This represents the minimum amount of genetic sharing (in centiMorgans) required for two individuals to be classified as twins. It's set at 95% of the autosomal genome length.

2. Bonsai checks both half-IBD and full-IBD sharing because both types of sharing must exceed the threshold for twin detection. This ensures that the genetic similarity is high across the entire genome, not just in certain segments, which helps distinguish twins from other close relatives.

3. Fraternal twins of different sexes pose a challenge because Bonsai's twin detection algorithm requires matching sex for twin classification. This means that fraternal twins of different sexes (which is biologically common) will not be classified as twins in the system, despite being born at the same time to the same parents.

4. The `get_twin_sets` function handles the transitive property by merging twin sets when it finds an individual who belongs to more than one set. When two individuals from different twin sets are identified as twins, the function combines their respective sets and updates all mapping information to maintain consistency.

5. Two key benefits of organizing twins into sets during pedigree construction are:
   - Relationship consistency: All twins in a set have identical relationships with other individuals in the pedigree
   - Computational efficiency: The algorithm can treat the entire set as a unit, reducing redundancy and simplifying the inference process

In [ ]:
# Optional: Convert this notebook to PDF
# Uncomment and run this cell if you want to generate a PDF version

# !jupyter nbconvert --to pdf "Lab23_Handling_Twins.ipynb"