# Lab 11: Finding Connection Points Between Individuals

## Overview

In this lab, we'll explore how Bonsai v3 identifies potential connection points between individuals or pedigrees. Finding optimal ways to connect individuals is a core challenge in computational genetic genealogy, and understanding these algorithms is essential for pedigree reconstruction.

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from IPython.display import display, HTML, Markdown
import inspect
import importlib

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    bonsaitree_dir = os.path.join(utils_dir, 'bonsaitree')
    
    # Add to path if it exists and isn't already there
    if os.path.exists(bonsaitree_dir) and bonsaitree_dir not in sys.path:
        sys.path.append(bonsaitree_dir)
        print(f"Added {bonsaitree_dir} to sys.path")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Some Bonsai functionality may be limited.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring modules
def display_module_classes(module_name):
    """Display classes and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all classes
        classes = inspect.getmembers(module, inspect.isclass)
        
        # Filter classes defined in this module (not imported)
        classes = [(name, cls) for name, cls in classes if cls.__module__ == module_name]
        
        # Print info for each class
        for name, cls in classes:
            print(f"\
## {name}")
            
            # Get docstring
            doc = inspect.getdoc(cls)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
            
            # Get methods
            methods = inspect.getmembers(cls, inspect.isfunction)
            if methods:
                print("\
Methods:")
                for method_name, method in methods:
                    if not method_name.startswith('_'):  # Skip private methods
                        print(f"- {method_name}")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def display_module_functions(module_name):
    """Display functions and their docstrings from a module"""
    try:
        # Import the module
        module = importlib.import_module(module_name)
        
        # Find all functions
        functions = inspect.getmembers(module, inspect.isfunction)
        
        # Filter functions defined in this module (not imported)
        functions = [(name, func) for name, func in functions if func.__module__ == module_name]
        
        # Print info for each function
        for name, func in functions:
            if name.startswith('_'):  # Skip private functions
                continue
                
            print(f"\
## {name}")
            
            # Get signature
            sig = inspect.signature(func)
            print(f"Signature: {name}{sig}")
            
            # Get docstring
            doc = inspect.getdoc(func)
            if doc:
                print(f"Docstring: {doc}")
            else:
                print("No docstring available")
    except ImportError as e:
        print(f"Error importing module {module_name}: {e}")
    except Exception as e:
        print(f"Error processing module {module_name}: {e}")

def view_source(obj):
    """Display the source code of an object (function or class)"""
    try:
        source = inspect.getsource(obj)
        display(Markdown(f"```python\
{source}\
```"))
    except Exception as e:
        print(f"Error retrieving source: {e}")

## Check Bonsai Installation

Let's verify that the Bonsai v3 module is available for import:

In [None]:
try:
    from utils.bonsaitree.bonsaitree import v3
    print("✅ Successfully imported Bonsai v3 module")
except ImportError as e:
    print(f"❌ Failed to import Bonsai v3 module: {e}")
    print("This lab requires access to the Bonsai v3 codebase.")
    print("Make sure you've properly set up your environment with the Bonsai repository.")

## Lab 11: Finding Connection Points Between Individuals

Finding connection points between individuals or pedigrees is a fundamental operation in genetic genealogy. When we have evidence that two people are related (e.g., from shared DNA segments), we need to determine how they might be connected in a family tree.

In this lab, we'll explore:

1. How Bonsai v3 identifies potential connection points within pedigrees
2. Algorithms for finding optimal connection points based on genetic data
3. Methods for evaluating and ranking different connection hypotheses

## Part 1: Identifying Potential Connection Points

Let's begin by examining how Bonsai v3 identifies potential connection points within a pedigree. A connection point is a position in the pedigree where two individuals or subtrees can be connected.

In [ ]:
# First, let's examine the actual Bonsai v3 implementations of connection point functions
if not is_jupyterlite():
    try:
        # Import the functions for examination
        from utils.bonsaitree.bonsaitree.v3.pedigrees import (
            get_possible_connection_point_set as actual_get_possible_connection_point_set,
            restrict_connection_point_set as actual_restrict_connection_point_set,
            get_likely_con_pt_set as actual_get_likely_con_pt_set,
            get_partner_id_set as actual_get_partner_id_set,
            reverse_node_dict as actual_reverse_node_dict
        )
        
        # Display the source code for these functions
        print("Source code for get_possible_connection_point_set in Bonsai v3:")
        view_source(actual_get_possible_connection_point_set)
        
        print("\
Source code for get_partner_id_set in Bonsai v3:")
        view_source(actual_get_partner_id_set)
        
        print("\
Source code for restrict_connection_point_set in Bonsai v3:")
        view_source(actual_restrict_connection_point_set)
        
        print("\
Source code for get_likely_con_pt_set in Bonsai v3:")
        view_source(actual_get_likely_con_pt_set)
        
        # Now import for use
        from utils.bonsaitree.bonsaitree.v3.pedigrees import (
            get_possible_connection_point_set,
            restrict_connection_point_set,
            get_likely_con_pt_set,
            get_partner_id_set,
            reverse_node_dict
        )
        
        print("\
Successfully imported connection point functions from Bonsai v3")
        
    except Exception as e:
        print(f"Error accessing Bonsai v3 implementations: {e}")
        print("Using simplified implementations instead")
        
        # Define simplified versions as fallback
        def reverse_node_dict(dct):
            """Reverse a node dict. If it's a down dict make it an up dict and vice versa."""
            rev_dct = {}
            for i, info in dct.items():
                for a, d in info.items():
                    if a not in rev_dct:
                        rev_dct[a] = {}
                    rev_dct[a][i] = d
            return rev_dct
        
        def get_partner_id_set(node, up_dct):
            """Find the set of partners of node in pedigree up_dct."""
            down_dct = reverse_node_dict(up_dct)
            child_id_set = {c for c, d in down_dct.get(node, {}).items() if d == 1}
            partner_id_set = set()
            for cid in child_id_set:
                pids = {p for p, d in up_dct.get(cid, {}).items() if d == 1}
                partner_id_set |= pids
            partner_id_set -= {node}
            return partner_id_set
        
        def get_possible_connection_point_set(ped):
            """Find all possible points through which a pedigree can be connected to another pedigree."""
            point_set = set()
            all_ids = set()
            for node, parents in ped.items():
                all_ids.add(node)
                all_ids.update(parents.keys())
                
            for a in all_ids:
                parent_to_deg = ped.get(a, {})
                if len(parent_to_deg) < 2:
                    point_set.add((a, None, 1))  # Can connect upward
                    
                partners = get_partner_id_set(a, ped)
                point_set.add((a, None, 0))  # Can connect downward
                for partner in partners:
                    if (partner, a, 0) not in point_set:
                        point_set.add((a, partner, 0))
                    point_set.add((a, partner, None))
                    
                point_set.add((a, None, None))  # Can replace node
                
            return point_set
        
        def restrict_connection_point_set(up_dct, con_pt_set, id_set):
            """Simplified version for fallback"""
            # In a real implementation, this would restrict the connection points
            # based on the subtree connecting id_set
            # For simplicity, we'll just return the original set
            return con_pt_set
        
        def get_likely_con_pt_set(up_dct, id_to_shared_ibd, rel_dict, con_pt_set, max_con_pts=5):
            """Simplified version for fallback"""
            # This would normally rank connection points based on IBD sharing
            # For simplicity, we'll just return the first max_con_pts points
            return set(list(con_pt_set)[:max_con_pts])

# For JupyterLite compatibility
elif is_jupyterlite():
    print("Running in JupyterLite environment. Using simplified implementations.")
    
    def reverse_node_dict(dct):
        """Reverse a node dict. If it's a down dict make it an up dict and vice versa."""
        rev_dct = {}
        for i, info in dct.items():
            for a, d in info.items():
                if a not in rev_dct:
                    rev_dct[a] = {}
                rev_dct[a][i] = d
        return rev_dct
    
    def get_partner_id_set(node, up_dct):
        """Find the set of partners of node in pedigree up_dct."""
        down_dct = reverse_node_dict(up_dct)
        child_id_set = {c for c, d in down_dct.get(node, {}).items() if d == 1}
        partner_id_set = set()
        for cid in child_id_set:
            pids = {p for p, d in up_dct.get(cid, {}).items() if d == 1}
            partner_id_set |= pids
        partner_id_set -= {node}
        return partner_id_set
    
    def get_possible_connection_point_set(ped):
        """Find all possible points through which a pedigree can be connected to another pedigree."""
        point_set = set()
        all_ids = set()
        for node, parents in ped.items():
            all_ids.add(node)
            all_ids.update(parents.keys())
            
        for a in all_ids:
            parent_to_deg = ped.get(a, {})
            if len(parent_to_deg) < 2:
                point_set.add((a, None, 1))  # Can connect upward
                
            partners = get_partner_id_set(a, ped)
            point_set.add((a, None, 0))  # Can connect downward
            for partner in partners:
                if (partner, a, 0) not in point_set:
                    point_set.add((a, partner, 0))
                point_set.add((a, partner, None))
                
            point_set.add((a, None, None))  # Can replace node
            
        return point_set
    
    def restrict_connection_point_set(up_dct, con_pt_set, id_set):
        """Simplified version for JupyterLite"""
        # In a real implementation, this would restrict the connection points
        # based on the subtree connecting id_set
        # For simplicity, we'll just return the original set
        return con_pt_set
    
    def get_likely_con_pt_set(up_dct, id_to_shared_ibd, rel_dict, con_pt_set, max_con_pts=5):
        """Simplified version for JupyterLite"""
        # This would normally rank connection points based on IBD sharing
        # For simplicity, we'll just return the first max_con_pts points
        return set(list(con_pt_set)[:max_con_pts])

### 1.1 Understanding Connection Points

Let's examine the core function `get_possible_connection_point_set` to understand how Bonsai finds connection points:

In [None]:
# View the source code of get_possible_connection_point_set if not in JupyterLite
if not is_jupyterlite():
    print("Source code for get_possible_connection_point_set:")
    view_source(get_possible_connection_point_set)
else:
    print("Using simplified implementation for JupyterLite environment")

A connection point in Bonsai v3 is represented as a tuple of the form `(id1, id2, dir)`, where:

- `id1`: The primary individual through which the connection is made
- `id2`: An optional secondary individual (usually a partner of id1)
- `dir`: The direction of the connection
  - `0`: Connect downward from the node (add as child)
  - `1`: Connect upward from the node (add as parent)
  - `None`: Replace the node or connect laterally

Let's create some example pedigrees and find their connection points:

In [None]:
# Define a function to visualize pedigrees
def visualize_pedigree(up_node_dict, title="Pedigree", highlight_nodes=None):
    """Visualize a pedigree from an up_node_dict using networkx."""
    # Create a directed graph (edges point from child to parent)
    G = nx.DiGraph()
    
    # Add all nodes to the graph (combine all IDs from keys and values)
    all_ids = set(up_node_dict.keys())
    for parents in up_node_dict.values():
        all_ids.update(parents.keys())
    
    # Create node labels
    node_labels = {node_id: str(node_id) for node_id in all_ids}
    
    # Create a color map - blue for genotyped (positive IDs), gray for ungenotyped (negative IDs),
    # red for highlighted nodes
    highlight_nodes = highlight_nodes or set()
    color_map = [
        'red' if node_id in highlight_nodes else 
        'lightblue' if node_id > 0 else 'lightgray' 
        for node_id in all_ids
    ]
    
    # Add edges (from child to parent)
    edges = []
    for child, parents in up_node_dict.items():
        for parent in parents:
            edges.append((child, parent))
    
    G.add_edges_from(edges)
    
    # Create plot
    plt.figure(figsize=(10, 6))
    plt.title(title)
    
    # Layout: By default, parents are shown above children (opposite arrow direction)
    pos = nx.spring_layout(G, seed=42)  # For reproducibility, use a fixed seed
    
    # Draw nodes
    nx.draw(G, pos, with_labels=True, labels=node_labels, node_color=color_map, 
            node_size=800, font_weight='bold')
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.5, arrows=True)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Create a simple example pedigree
simple_pedigree = {
    5: {1: 1, 2: 1},  # Individual 5 has parents 1 and 2
    6: {3: 1, 4: 1},  # Individual 6 has parents 3 and 4
    7: {5: 1, 6: 1},  # Individual 7 has parents 5 and 6
    1: {},            # Founder
    2: {},            # Founder
    3: {},            # Founder
    4: {}             # Founder
}

# Visualize the pedigree
visualize_pedigree(simple_pedigree, title="Simple Three-Generation Pedigree")

In [None]:
# Find all possible connection points in the simple pedigree
connection_points = get_possible_connection_point_set(simple_pedigree)

# Convert to a more readable format
connection_points_df = pd.DataFrame([
    {
        "Primary ID": cp[0],
        "Secondary ID": cp[1] if cp[1] is not None else "None",
        "Direction": {
            0: "Down (add child)",
            1: "Up (add parent)",
            None: "Replace/Lateral"
        }.get(cp[2], str(cp[2]))
    }
    for cp in connection_points
])

# Display connection points
print(f"Found {len(connection_points)} possible connection points in the simple pedigree:")
display(connection_points_df)

Let's analyze some specific types of connection points:

In [None]:
# Group connection points by type
up_connections = [cp for cp in connection_points if cp[2] == 1]
down_connections = [cp for cp in connection_points if cp[2] == 0]
lateral_connections = [cp for cp in connection_points if cp[2] is None]

print(f"Found {len(up_connections)} upward connections (adding parents):")
for cp in up_connections[:5]:  # Show just first 5 for brevity
    print(f"  - Connect upward from individual {cp[0]}")

print(f"\
Found {len(down_connections)} downward connections (adding children):")
for cp in down_connections[:5]:  # Show just first 5 for brevity
    if cp[1] is None:
        print(f"  - Connect downward from individual {cp[0]}")
    else:
        print(f"  - Connect downward from individual {cp[0]} and partner {cp[1]}")

print(f"\
Found {len(lateral_connections)} lateral connections (replacing/connecting laterally):")
for cp in lateral_connections[:5]:  # Show just first 5 for brevity
    if cp[1] is None:
        print(f"  - Replace individual {cp[0]}")
    else:
        print(f"  - Connect laterally to the pair {cp[0]} and {cp[1]}")

Let's visualize one of these connection points to understand what it means:

In [None]:
# Let's examine the upward connection from individual 1
# In this case, we could add a parent to individual 1
visualize_pedigree(simple_pedigree, title="Upward Connection from Individual 1", highlight_nodes={1})

# Create a modified pedigree showing this connection
import copy
modified_pedigree = copy.deepcopy(simple_pedigree)
modified_pedigree[1][-1] = 1  # Add parent -1 to individual 1
modified_pedigree[-1] = {}    # Add -1 as a founder

visualize_pedigree(modified_pedigree, title="After Adding Parent to Individual 1", highlight_nodes={1, -1})

In [None]:
# Now let's examine a downward connection from individuals 5 and 6
# In this case, we could add a child to the couple 5 and 6
visualize_pedigree(simple_pedigree, title="Downward Connection from Individuals 5 and 6", highlight_nodes={5, 6})

# Create a modified pedigree showing this connection
modified_pedigree = copy.deepcopy(simple_pedigree)
modified_pedigree[8] = {5: 1, 6: 1}  # Add child 8 to individuals 5 and 6

visualize_pedigree(modified_pedigree, title="After Adding Child to Individuals 5 and 6", highlight_nodes={5, 6, 8})

### 1.2 Finding Partners for Connection Points

When looking for connection points, Bonsai needs to identify partners of individuals, as many connections involve couples rather than single individuals. Let's examine how Bonsai identifies partners:

In [None]:
# View the source code of get_partner_id_set if not in JupyterLite
if not is_jupyterlite():
    print("Source code for get_partner_id_set:")
    view_source(get_partner_id_set)
else:
    print("Using simplified implementation for JupyterLite environment")

In [None]:
# Find partners for each individual in the simple pedigree
partner_info = []
for node_id in simple_pedigree.keys():
    partners = get_partner_id_set(node_id, simple_pedigree)
    partner_info.append({
        "Individual": node_id,
        "Partners": sorted(partners),
        "Number of Partners": len(partners)
    })

# Display partner information
partner_df = pd.DataFrame(partner_info)
display(partner_df)

Now, let's create a more complex pedigree with multiple partnerships to see how Bonsai handles them:

In [None]:
# Create a pedigree with multiple partnerships
complex_pedigree = {
    # Generation 3
    7: {5: 1, 6: 1},    # Child of 5 and 6
    8: {5: 1, 6: 1},    # Child of 5 and 6 (sibling of 7)
    9: {5: 1, -1: 1},   # Child of 5 and -1 (half-sibling of 7 and 8)
    10: {4: 1, -2: 1},  # Child of 4 and -2
    
    # Generation 2
    5: {1: 1, 2: 1},    # Child of 1 and 2
    6: {3: 1, 4: 1},    # Child of 3 and 4
    -1: {},             # Ungenotyped individual
    -2: {},             # Ungenotyped individual
    
    # Generation 1
    1: {},              # Founder
    2: {},              # Founder
    3: {},              # Founder
    4: {}               # Founder
}

# Visualize the complex pedigree
visualize_pedigree(complex_pedigree, title="Complex Pedigree with Multiple Partnerships")

In [None]:
# Find partners for each individual in the complex pedigree
partner_info = []
for node_id in complex_pedigree.keys():
    partners = get_partner_id_set(node_id, complex_pedigree)
    partner_info.append({
        "Individual": node_id,
        "Partners": sorted(partners),
        "Number of Partners": len(partners)
    })

# Display partner information, sorted by number of partners
partner_df = pd.DataFrame(partner_info).sort_values(by="Number of Partners", ascending=False)
display(partner_df)

# Highlight individuals with multiple partners
multiple_partner_ids = [row["Individual"] for _, row in partner_df.iterrows() if row["Number of Partners"] > 1]
if multiple_partner_ids:
    print(f"Individuals with multiple partners: {multiple_partner_ids}")
    visualize_pedigree(complex_pedigree, title="Individuals with Multiple Partners", highlight_nodes=set(multiple_partner_ids))

## Part 2: Restricting and Prioritizing Connection Points

When working with real genetic data, Bonsai needs to prioritize the most likely connection points based on IBD (Identity by Descent) sharing and other evidence. Let's explore how Bonsai restricts and prioritizes connection points.

### 2.1 Restricting Connection Points

The `restrict_connection_point_set` function focuses the search on connection points that could connect to a specific set of individuals (typically those sharing IBD with another pedigree):

In [None]:
# View the source code of restrict_connection_point_set if not in JupyterLite
if not is_jupyterlite():
    print("Source code for restrict_connection_point_set:")
    view_source(restrict_connection_point_set)
else:
    print("Using simplified implementation for JupyterLite environment")

In [None]:
# Get all possible connection points in the complex pedigree
all_connection_points = get_possible_connection_point_set(complex_pedigree)
print(f"Found {len(all_connection_points)} possible connection points in the complex pedigree")

# Let's say individuals 7 and 9 share IBD with another pedigree
# We want to restrict the connection points to those that could explain this sharing
ibd_sharing_individuals = {7, 9}

# Restrict the connection points to those relevant for individuals 7 and 9
restricted_connection_points = restrict_connection_point_set(
    up_dct=complex_pedigree,
    con_pt_set=all_connection_points,
    id_set=ibd_sharing_individuals
)

print(f"After restriction, we have {len(restricted_connection_points)} connection points")

# Convert to a DataFrame for easier analysis
restricted_cp_df = pd.DataFrame([
    {
        "Primary ID": cp[0],
        "Secondary ID": cp[1] if cp[1] is not None else "None",
        "Direction": {
            0: "Down (add child)",
            1: "Up (add parent)",
            None: "Replace/Lateral"
        }.get(cp[2], str(cp[2]))
    }
    for cp in restricted_connection_points
])

# Display restricted connection points
print("\
Restricted connection points:")
display(restricted_cp_df)

# Visualize the pedigree with the IBD-sharing individuals highlighted
visualize_pedigree(complex_pedigree, title="Individuals Sharing IBD", highlight_nodes=ibd_sharing_individuals)

### 2.2 Finding the Most Likely Connection Points

Once we have a set of potential connection points, Bonsai needs to rank them based on how well they explain the observed IBD sharing. This is done using the `get_likely_con_pt_set` function:

In [None]:
# View the source code of get_likely_con_pt_set if not in JupyterLite
if not is_jupyterlite():
    print("Source code for get_likely_con_pt_set:")
    view_source(get_likely_con_pt_set)
else:
    print("Using simplified implementation for JupyterLite environment")

In [ ]:
# First, let's examine the actual Bonsai v3 implementations for relationship calculation
if not is_jupyterlite():
    try:
        # Import the functions for examination
        from utils.bonsaitree.bonsaitree.v3.pedigrees import (
            get_rel_dict as actual_get_rel_dict,
            get_simple_rel_tuple as actual_get_simple_rel_tuple
        )
        
        # Display the source code for these functions
        print("Source code for get_simple_rel_tuple in Bonsai v3:")
        view_source(actual_get_simple_rel_tuple)
        
        print("\
Source code for get_rel_dict in Bonsai v3:")
        view_source(actual_get_rel_dict)
        
        # Now import for use
        from utils.bonsaitree.bonsaitree.v3.pedigrees import get_rel_dict, get_simple_rel_tuple
        
        print("\
Successfully imported relationship functions from Bonsai v3")
        
    except Exception as e:
        print(f"Error accessing Bonsai v3 implementations: {e}")
        print("Using simplified implementations instead")
        
        # Define simplified versions as fallback
        def get_simple_rel_tuple(up_node_dict, i, j):
            """Get relationship tuple (up, down, num_ancs) between individuals i and j."""
            if i == j:
                return (0, 0, 2)
            
            # Simple implementation for fallback - this would be more complex in reality
            if j in up_node_dict.get(i, {}):
                return (1, 0, 1)  # i is child of j
            elif i in up_node_dict.get(j, {}):
                return (0, 1, 1)  # i is parent of j
            
            # Check for siblings/cousins (simplified)
            i_parents = set(up_node_dict.get(i, {}).keys())
            j_parents = set(up_node_dict.get(j, {}).keys())
            common_parents = i_parents.intersection(j_parents)
            
            if common_parents:
                if len(common_parents) == 2:
                    return (1, 1, 2)  # Full siblings
                else:
                    return (1, 1, 1)  # Half siblings
            
            # Default - no relationship found
            return None
        
        def get_rel_dict(up_dct):
            """Get dict mapping each ID pair to their relationship tuple."""
            all_ids = set()
            for node, parents in up_dct.items():
                all_ids.add(node)
                all_ids.update(parents.keys())
            
            rel_dict = {}
            for i in all_ids:
                rel_dict[i] = {}
                for j in all_ids:
                    rel = get_simple_rel_tuple(up_dct, i, j)
                    if rel is not None:
                        rel_dict[i][j] = rel
            
            return rel_dict
            
# For JupyterLite compatibility
elif is_jupyterlite():
    print("Running in JupyterLite environment. Using simplified implementations.")
    
    def get_simple_rel_tuple(up_node_dict, i, j):
        """Get relationship tuple (up, down, num_ancs) between individuals i and j."""
        if i == j:
            return (0, 0, 2)
        
        # Simple implementation for JupyterLite - this would be more complex in reality
        if j in up_node_dict.get(i, {}):
            return (1, 0, 1)  # i is child of j
        elif i in up_node_dict.get(j, {}):
            return (0, 1, 1)  # i is parent of j
        
        # Check for siblings/cousins (simplified)
        i_parents = set(up_node_dict.get(i, {}).keys())
        j_parents = set(up_node_dict.get(j, {}).keys())
        common_parents = i_parents.intersection(j_parents)
        
        if common_parents:
            if len(common_parents) == 2:
                return (1, 1, 2)  # Full siblings
            else:
                return (1, 1, 1)  # Half siblings
        
        # Default - no relationship found
        return None
    
    def get_rel_dict(up_dct):
        """Get dict mapping each ID pair to their relationship tuple."""
        all_ids = set()
        for node, parents in up_dct.items():
            all_ids.add(node)
            all_ids.update(parents.keys())
        
        rel_dict = {}
        for i in all_ids:
            rel_dict[i] = {}
            for j in all_ids:
                rel = get_simple_rel_tuple(up_dct, i, j)
                if rel is not None:
                    rel_dict[i][j] = rel
        
        return rel_dict

In [None]:
# To use get_likely_con_pt_set, we need:
# 1. A dictionary mapping IDs to the amount of IBD they share with another pedigree
# 2. A relationship dictionary mapping each pair of IDs to their relationship tuple

# Let's create a hypothetical IBD sharing scenario
# Individuals 7 and 9 share IBD with another pedigree, with individual 5 being 
# the common ancestor through whom the IBD is inherited
id_to_shared_ibd = {
    5: 100.0,  # Individual 5 would share the most IBD if tested
    7: 50.0,   # Child of 5, inherits about half the IBD
    9: 50.0,   # Another child of 5, also inherits about half
    8: 50.0,   # Sibling of 7, similar IBD sharing
    1: 25.0,   # Parent of 5, shares less IBD
    2: 25.0,   # Other parent of 5, similar sharing
    6: 0.0,    # Unrelated to the common ancestor, no IBD sharing
    10: 0.0,   # Unrelated to the common ancestor, no IBD sharing
    3: 0.0,    # Unrelated to the common ancestor, no IBD sharing
    4: 0.0,    # Unrelated to the common ancestor, no IBD sharing
    -1: 0.0,   # Ungenotyped, but partner of 5, no IBD sharing
    -2: 0.0    # Ungenotyped, unrelated, no IBD sharing
}

# Calculate relationship dictionary
rel_dict = get_rel_dict(complex_pedigree)

# Find the most likely connection points
likely_connection_points = get_likely_con_pt_set(
    up_dct=complex_pedigree,
    id_to_shared_ibd=id_to_shared_ibd,
    rel_dict=rel_dict,
    con_pt_set=restricted_connection_points,
    max_con_pts=5  # Return the top 5 most likely connection points
)

# Convert to a DataFrame for easier analysis
likely_cp_df = pd.DataFrame([
    {
        "Primary ID": cp[0],
        "Secondary ID": cp[1] if cp[1] is not None else "None",
        "Direction": {
            0: "Down (add child)",
            1: "Up (add parent)",
            None: "Replace/Lateral"
        }.get(cp[2], str(cp[2]))
    }
    for cp in likely_connection_points
])

# Display the most likely connection points
print(f"Found {len(likely_connection_points)} most likely connection points:")
display(likely_cp_df)

# Get the IDs of the most likely connection points for visualization
likely_cp_ids = {cp[0] for cp in likely_connection_points}
likely_cp_ids.update(cp[1] for cp in likely_connection_points if cp[1] is not None)

# Visualize the pedigree with the most likely connection points highlighted
visualize_pedigree(complex_pedigree, title="Most Likely Connection Points", highlight_nodes=likely_cp_ids)

## Part 3: Implementing Connection Points

Now that we understand how to identify and prioritize connection points, let's explore how to actually implement these connections by modifying pedigrees.

### 3.1 Case Study: Connecting Two Pedigrees

Let's create two separate pedigrees and demonstrate how to connect them using a connection point:

In [None]:
# Create two separate pedigrees
pedigree1 = {
    3: {1: 1, 2: 1},  # Individual 3 has parents 1 and 2
    1: {},            # Founder
    2: {}             # Founder
}

pedigree2 = {
    6: {4: 1, 5: 1},  # Individual 6 has parents 4 and 5
    4: {},            # Founder
    5: {}             # Founder
}

# Visualize both pedigrees
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
visualize_pedigree(pedigree1, title="Pedigree 1")

plt.subplot(1, 2, 2)
visualize_pedigree(pedigree2, title="Pedigree 2")

In [None]:
# Let's implement a function to connect pedigrees based on a connection point
def connect_pedigrees(pedigree1, pedigree2, connection_point):
    """Connect two pedigrees using the specified connection point.
    
    Args:
        pedigree1: First pedigree (up-node dictionary)
        pedigree2: Second pedigree (up-node dictionary)
        connection_point: Tuple (id1, id2, dir) specifying how to connect
            id1: ID in pedigree1
            id2: Optional partner ID in pedigree1 (can be None)
            dir: Direction of connection (0=down, 1=up, None=replace/lateral)
            
    Returns:
        combined_pedigree: The combined pedigree as an up-node dictionary
    """
    import copy
    combined_pedigree = copy.deepcopy(pedigree1)
    pedigree2_copy = copy.deepcopy(pedigree2)
    
    # Extract connection information
    id1, id2, direction = connection_point
    
    # Get lowest ID in both pedigrees to use for new ungenotyped individuals
    all_ids1 = set(pedigree1.keys()).union(*[set(parents.keys()) for parents in pedigree1.values()])
    all_ids2 = set(pedigree2.keys()).union(*[set(parents.keys()) for parents in pedigree2.values()])
    min_id = min(min(all_ids1), min(all_ids2)) - 1
    if min_id > 0:  # Ensure negative ID for ungenotyped individuals
        min_id = -1
    
    # Adjust IDs in pedigree2 to avoid conflicts
    id_map = {}
    for old_id in all_ids2:
        if old_id in all_ids1:  # If ID already exists in pedigree1
            if old_id > 0:  # Only remap genotyped IDs
                new_id = min(all_ids1) - 1  # Generate a new ID
                if new_id > 0:  # Ensure it's negative for ungenotyped
                    new_id = min_id
                    min_id -= 1
                id_map[old_id] = new_id
    
    # Apply the ID mapping to pedigree2
    if id_map:
        remapped_pedigree2 = {}
        for node, parents in pedigree2_copy.items():
            new_node = id_map.get(node, node)
            remapped_parents = {id_map.get(p, p): d for p, d in parents.items()}
            remapped_pedigree2[new_node] = remapped_parents
        pedigree2_copy = remapped_pedigree2
    
    # Connect based on direction
    if direction == 0:  # Connect downward (add as child)
        # Create a new individual as child of id1 (and id2 if provided)
        connector_id = min_id
        min_id -= 1
        
        # Add the connector as child of id1 (and id2 if provided)
        combined_pedigree[connector_id] = {id1: 1}
        if id2 is not None:
            combined_pedigree[connector_id][id2] = 1
        
        # Make the connector the parent of all founders in pedigree2
        # Get founders (nodes with no parents) in pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        for founder in founders2:
            pedigree2_copy[founder][connector_id] = 1
    
    elif direction == 1:  # Connect upward (add as parent)
        # Get founders in pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        
        if len(founders2) == 1:  # If pedigree2 has a single founder, connect directly
            combined_pedigree[id1][founders2[0]] = 1
        else:  # Otherwise, create a connector individual
            connector_id = min_id
            min_id -= 1
            
            # Add the connector as parent of id1
            combined_pedigree[id1][connector_id] = 1
            combined_pedigree[connector_id] = {}
            
            # Make the founders of pedigree2 parents of the connector
            for founder in founders2:
                combined_pedigree[connector_id][founder] = 1
    
    else:  # Replace/lateral connection
        # We'll implement this as replacing id1 with a founder from pedigree2
        founders2 = [node for node, parents in pedigree2_copy.items() if not parents]
        if founders2:  # If there are founders in pedigree2
            replaced_founder = founders2[0]
            
            # Replace all occurrences of id1 with replaced_founder
            for node, parents in combined_pedigree.items():
                if id1 in parents:
                    degree = parents.pop(id1)
                    parents[replaced_founder] = degree
            
            # Handle the connections of id1
            if id1 in combined_pedigree:
                # Transfer the parents of id1 to replaced_founder
                if replaced_founder not in combined_pedigree:
                    combined_pedigree[replaced_founder] = {}
                combined_pedigree[replaced_founder].update(combined_pedigree[id1])
                del combined_pedigree[id1]  # Remove id1 from the pedigree
    
    # Merge the modified pedigree2 into the combined pedigree
    for node, parents in pedigree2_copy.items():
        if node not in combined_pedigree:
            combined_pedigree[node] = parents
        else:
            combined_pedigree[node].update(parents)
    
    return combined_pedigree

In [None]:
# Now let's connect the pedigrees using different types of connection points

# 1. Connect downward from individual 3 in pedigree1
downward_connection = (3, None, 0)  # Connect downward from individual 3
combined_down = connect_pedigrees(pedigree1, pedigree2, downward_connection)
visualize_pedigree(combined_down, title="Connected Downward from Individual 3", highlight_nodes={3})

# 2. Connect upward from individual 1 in pedigree1
upward_connection = (1, None, 1)  # Connect upward from individual 1
combined_up = connect_pedigrees(pedigree1, pedigree2, upward_connection)
visualize_pedigree(combined_up, title="Connected Upward from Individual 1", highlight_nodes={1})

# 3. Replace individual 2 in pedigree1
replace_connection = (2, None, None)  # Replace individual 2
combined_replace = connect_pedigrees(pedigree1, pedigree2, replace_connection)
visualize_pedigree(combined_replace, title="Replaced Individual 2")

### 3.2 Evaluating Connection Points

After implementing connection points, we need to evaluate how well they explain the observed genetic data. Let's create a function to evaluate a connected pedigree based on how well it explains IBD sharing:

In [None]:
def evaluate_connection(combined_pedigree, id_to_shared_ibd):
    """Evaluate a connected pedigree based on how well it explains IBD sharing.
    
    Args:
        combined_pedigree: Up-node dictionary of the combined pedigree
        id_to_shared_ibd: Dict mapping individual IDs to the amount of IBD they share
        
    Returns:
        score: A score indicating how well the pedigree explains the IBD sharing
              Higher scores are better
    """
    # Calculate relationships in the combined pedigree
    rel_dict = get_rel_dict(combined_pedigree)
    
    # Compute a score based on correlation between relationship degrees and IBD sharing
    # This is a simplified version of what Bonsai actually does
    individuals = sorted(id_to_shared_ibd.keys())
    
    # For each pair of individuals, calculate their expected IBD sharing based on relationship
    expected_ibd = {}
    for i in individuals:
        expected_ibd[i] = 0.0
        if i not in rel_dict:
            continue
            
        for j, rel_tuple in rel_dict[i].items():
            if j not in id_to_shared_ibd or rel_tuple is None:
                continue
                
            # Calculate expected sharing based on relationship degree
            up, down, num_ancs = rel_tuple
            degree = up + down
            if degree == 0:  # Self
                expected_coef = 1.0
            else:
                expected_coef = num_ancs * (0.5 ** degree)
                
            # Add to expected IBD
            expected_ibd[i] += expected_coef * id_to_shared_ibd[j]
    
    # Calculate correlation between expected and observed IBD
    observed = [id_to_shared_ibd[i] for i in individuals]
    expected = [expected_ibd[i] for i in individuals]
    
    # Calculate Pearson correlation
    import numpy as np
    correlation = np.corrcoef(observed, expected)[0, 1]
    
    # Return a score based on correlation
    return correlation

# Create a sample IBD sharing scenario for evaluation
# Let's say individual 4 in pedigree2 is the one sharing IBD with individuals in pedigree1
sample_ibd_sharing = {
    1: 25.0,  # Moderate IBD sharing
    2: 0.0,   # No IBD sharing
    3: 12.5,  # Some IBD sharing
    4: 100.0, # Highest IBD sharing
    5: 0.0,   # No IBD sharing
    6: 50.0   # Significant IBD sharing
}

In [None]:
# Evaluate each of our connected pedigrees
connection_scores = [
    ("Downward from 3", evaluate_connection(combined_down, sample_ibd_sharing)),
    ("Upward from 1", evaluate_connection(combined_up, sample_ibd_sharing)),
    ("Replace 2", evaluate_connection(combined_replace, sample_ibd_sharing))
]

# Display the scores
score_df = pd.DataFrame(connection_scores, columns=["Connection Type", "Score"])
score_df = score_df.sort_values(by="Score", ascending=False).reset_index(drop=True)
display(score_df)

# Visualize the scores
plt.figure(figsize=(10, 6))
sns.barplot(x="Connection Type", y="Score", data=score_df)
plt.title("Scores for Different Connection Types")
plt.ylabel("Correlation Score (higher is better)")
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

# Visualize the best connection
best_connection = score_df.iloc[0]["Connection Type"]
best_pedigree = {
    "Downward from 3": combined_down,
    "Upward from 1": combined_up,
    "Replace 2": combined_replace
}[best_connection]

visualize_pedigree(best_pedigree, title=f"Best Connection: {best_connection}")

### 3.3 Real-World Application: Finding Connection Points in Complex Pedigrees

Let's simulate a more complex, multi-generational pedigree and demonstrate how the connection point algorithms can be applied in practice:

In [None]:
# Define a function to generate a random pedigree
def generate_random_pedigree(num_generations=3, max_children_per_couple=3, prob_multiple_partners=0.2):
    """Generate a random pedigree with the specified number of generations.
    
    Args:
        num_generations: Number of generations to create
        max_children_per_couple: Maximum number of children per couple
        prob_multiple_partners: Probability of an individual having multiple partners
        
    Returns:
        pedigree: Up-node dictionary representation of the pedigree
    """
    import random
    import numpy as np
    
    pedigree = {}
    next_id = 1
    generation_ids = []
    
    # Create first generation (founders)
    num_founders = max(4, 2 ** (num_generations - 1))  # Ensure enough founders
    founders = list(range(next_id, next_id + num_founders))
    next_id += num_founders
    
    for founder_id in founders:
        pedigree[founder_id] = {}  # Founders have no parents
    
    generation_ids.append(founders)
    
    # Create subsequent generations
    for gen in range(1, num_generations):
        previous_gen = generation_ids[-1]
        current_gen = []
        
        # Shuffle the previous generation for random partnerships
        random.shuffle(previous_gen)
        
        # Create partnerships and children
        while len(previous_gen) >= 2:
            parent1 = previous_gen.pop(0)
            parent2 = previous_gen.pop(0)
            
            # Determine number of children for this couple
            num_children = random.randint(1, max_children_per_couple)
            
            # Create children
            for _ in range(num_children):
                child_id = next_id
                next_id += 1
                pedigree[child_id] = {parent1: 1, parent2: 1}
                current_gen.append(child_id)
            
            # Possibly add parent1 back for another partnership
            if random.random() < prob_multiple_partners:
                previous_gen.append(parent1)
        
        generation_ids.append(current_gen)
    
    return pedigree

# Generate a complex pedigree
complex_random_pedigree = generate_random_pedigree(num_generations=4, max_children_per_couple=2, prob_multiple_partners=0.3)

# Visualize the complex pedigree
visualize_pedigree(complex_random_pedigree, title="Complex Random Pedigree")

In [None]:
# Find all possible connection points in the complex pedigree
complex_connection_points = get_possible_connection_point_set(complex_random_pedigree)
print(f"Found {len(complex_connection_points)} possible connection points in the complex pedigree")

# Categorize connection points by type
up_connections = [cp for cp in complex_connection_points if cp[2] == 1]
down_connections = [cp for cp in complex_connection_points if cp[2] == 0]
lateral_connections = [cp for cp in complex_connection_points if cp[2] is None]

print(f"Connection points by type:")
print(f"  - Upward connections: {len(up_connections)}")
print(f"  - Downward connections: {len(down_connections)}")
print(f"  - Lateral/replace connections: {len(lateral_connections)}")

# Plot the distribution of connection point types
plt.figure(figsize=(10, 6))
connection_types = [
    ("Upward", len(up_connections)),
    ("Downward", len(down_connections)),
    ("Lateral/Replace", len(lateral_connections))
]
types_df = pd.DataFrame(connection_types, columns=["Connection Type", "Count"])
sns.barplot(x="Connection Type", y="Count", data=types_df)
plt.title("Distribution of Connection Point Types")
plt.tight_layout()
plt.show()

In [None]:
# Let's find and highlight individuals with multiple partners
partner_info = []
for node_id in complex_random_pedigree.keys():
    partners = get_partner_id_set(node_id, complex_random_pedigree)
    partner_info.append({
        "Individual": node_id,
        "Partners": sorted(partners),
        "Number of Partners": len(partners)
    })

# Display partner information, sorted by number of partners
partner_df = pd.DataFrame(partner_info).sort_values(by="Number of Partners", ascending=False)
display(partner_df.head(10))  # Show top 10 individuals with most partners

# Highlight individuals with multiple partners
multiple_partner_ids = [row["Individual"] for _, row in partner_df.iterrows() if row["Number of Partners"] > 1]
if multiple_partner_ids:
    print(f"Found {len(multiple_partner_ids)} individuals with multiple partners: {multiple_partner_ids[:5]}...")
    visualize_pedigree(complex_random_pedigree, title="Individuals with Multiple Partners", highlight_nodes=set(multiple_partner_ids[:5]))

## Summary

In this lab, we've explored how Bonsai v3 identifies, evaluates, and implements connection points between individuals and pedigrees. Key takeaways include:

1. **Connection Point Definition**: A connection point in Bonsai is represented as a tuple (id1, id2, dir), where id1 is the primary individual, id2 is an optional partner, and dir indicates the direction of connection (up, down, or lateral/replace).

2. **Finding Connection Points**: The `get_possible_connection_point_set` function identifies all potential connection points in a pedigree, considering each individual and their possible partners.

3. **Restricting and Prioritizing**: When we have evidence of IBD sharing between pedigrees, Bonsai can restrict and prioritize connection points using the `restrict_connection_point_set` and `get_likely_con_pt_set` functions.

4. **Implementing Connections**: We demonstrated how to implement different types of connections (upward, downward, and lateral/replace) and evaluate how well they explain observed genetic data.

5. **Evaluation Metrics**: The quality of a connection can be measured by how well it explains the observed IBD sharing patterns across individuals.

Finding optimal connection points is a central challenge in computational genetic genealogy. The algorithms we've explored form the foundation of Bonsai's ability to reconstruct complex pedigrees from genetic data, allowing it to infer relationships and build family trees that explain observed patterns of DNA sharing.

In [None]:
# Convert this notebook to PDF using poetry
!poetry run jupyter nbconvert --to pdf Lab11_Finding_Connection_Points.ipynb

# Note: PDF conversion requires LaTeX to be installed on your system
# If you encounter errors, you may need to install it:
# On Ubuntu/Debian: sudo apt-get install texlive-xetex
# On macOS with Homebrew: brew install texlive