# Lab 7: Up-Node Dictionary and Advanced Pedigree Operations

## Overview

This lab provides a detailed exploration of the up-node dictionary data structure in Bonsai v3, which serves as the foundation for all pedigree operations in the system. Building on Lab06, we'll dive deeper into how this core structure is used throughout the Bonsai codebase, focusing on advanced manipulation operations, navigation algorithms, and applications in complex pedigree reconstruction scenarios.

Key topics include:

1. Core up-node dictionary representation and its mathematical properties
2. Advanced operations like re-rooting, trimming, and combining pedigrees
3. Algorithms for finding connection points between pedigrees
4. Strategies for handling genotyped vs. inferred (latent) individuals
5. Optimizations for large-scale pedigree operations

By the end of this lab, you'll have a comprehensive understanding of how the up-node dictionary enables Bonsai's powerful pedigree inference capabilities.

In [None]:
# Standard imports
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import inspect
import importlib
import json
from IPython.display import display, HTML, Markdown
import warnings
warnings.filterwarnings('ignore')

sys.path.append(os.path.dirname(os.getcwd()))

# Cross-compatibility setup
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite, save_results, save_plot

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook")

In [None]:
# Setup Bonsai module paths
if not is_jupyterlite():
    # In local environment, add the utils directory to system path
    utils_dir = os.getenv('PROJECT_UTILS_DIR', os.path.join(os.path.dirname(DATA_DIR), 'utils'))
    
    # Add to path if it exists and isn't already there
    if os.path.exists(utils_dir) and utils_dir not in sys.path:
        sys.path.append(utils_dir)
        print(f"Added {utils_dir} to sys.path")
    
    try:
        # Import Bonsai v3 modules using the proper import path
        from utils.bonsaitree.bonsaitree.v3 import pedigrees, connections
        print("✅ Successfully imported Bonsai v3 modules")
    except ImportError as e:
        print(f"❌ Failed to import Bonsai v3 modules: {e}")
        print("This lab requires access to the Bonsai v3 codebase.")
else:
    # In JupyterLite, use a simplified approach
    print("⚠️ Running in JupyterLite: Using simplified implementations of Bonsai functions.")
    print("This notebook is primarily designed for local execution where the Bonsai codebase is available.")

In [None]:
# Helper functions for exploring Bonsai modules and displaying function source code
def view_function_source(module, function_name):
    """Display the source code of a function"""
    try:
        # Get the function
        func = getattr(module, function_name)
        
        # Get the source code
        source = inspect.getsource(func)
        
        # Print the source code
        display(Markdown(f"```python\n{source}\n```"))
    except AttributeError:
        print(f"Function {function_name} not found in the module")
    except Exception as e:
        print(f"Error processing function {function_name}: {e}")

# Simplified implementation of key functions for JupyterLite environment
def impl_reverse_node_dict(dct):
    """Simplified implementation of reverse_node_dict"""
    rev_dct = {}
    for i, info in dct.items():
        for a, d in info.items():
            if a not in rev_dct:
                rev_dct[a] = {}
            rev_dct[a][i] = d
    return rev_dct

def impl_get_rel_set(node_dict, i):
    """Simplified implementation of get_rel_set"""
    rel_set = {i}
    for parent in node_dict.get(i, {}):
        rel_set |= impl_get_rel_set(node_dict, parent)
    return rel_set

def impl_get_founder_set(up_dct):
    """Simplified implementation of get_founder_set"""
    return {node for node, parents in up_dct.items() if not parents}

def impl_get_simple_rel_tuple(up_node_dict, i, j):
    """Very simplified implementation of get_simple_rel_tuple"""
    if i == j:
        return (0, 0, 2)
    
    # Check if j is an ancestor of i
    i_ancestors = impl_get_rel_set(up_node_dict, i)
    if j in i_ancestors:
        # Simple parent-child relationship
        if j in up_node_dict.get(i, {}):
            return (1, 0, 1)
        # Grandparent or higher
        return (2, 0, 1)  # Simplified - real implementation would calculate precise degrees
    
    # Check if i is an ancestor of j
    j_ancestors = impl_get_rel_set(up_node_dict, j)
    if i in j_ancestors:
        # Simple parent-child relationship
        if i in up_node_dict.get(j, {}):
            return (0, 1, 1)
        # Grandparent or higher
        return (0, 2, 1)  # Simplified
    
    # Check for sibling relationships
    common_ancestors = i_ancestors.intersection(j_ancestors)
    if common_ancestors:
        # Full siblings (simplified check)
        i_parents = set(up_node_dict.get(i, {}).keys())
        j_parents = set(up_node_dict.get(j, {}).keys())
        if i_parents and j_parents and i_parents == j_parents and len(i_parents) == 2:
            return (1, 1, 2)
        # Half siblings or cousins (simplified)
        return (1, 1, 1) if i_parents.intersection(j_parents) else (2, 2, 1)
    
    return None  # No relationship found

## 1. Up-Node Dictionary: The Core Representation of Pedigree Structures

The up-node dictionary is the fundamental data structure for representing pedigrees in Bonsai v3. Let's examine its structure and properties in detail.

### Structure Definition

An up-node dictionary has the following structure:

```python
{
    individual_id: {parent_id1: degree1, parent_id2: degree2, ...},
    ...
}
```

Where:
- `individual_id` is the ID of an individual in the pedigree (positive for observed/genotyped individuals, negative for inferred/latent ancestors)
- Each individual maps to a dictionary of their parents
- `parent_id1`, `parent_id2` are the IDs of the individual's parents
- `degree1`, `degree2` represent the genetic distance (typically 1 for a direct parent-child relationship)
- An empty dictionary (`{}`) indicates an individual with no recorded parents (a founder)

Let's create a simple example pedigree to work with:

In [None]:
def create_example_pedigree():
    """Create a simple example pedigree to work with."""
    # The structure we're creating:
    #
    #    -1      -2     -3      -4
    #     |       |      |       |
    #     +-------+      +-------+
    #         |              |
    #      1001            1002
    #         \            /
    #          \          /
    #           \        /
    #            \      /
    #             \    /
    #              1003
    #             /    \
    #            /      \
    #          1004     1005
    
    up_node_dict = {
        # Founders (no parents)
        -1: {},
        -2: {},
        -3: {},
        -4: {},
        
        # Generation 1 (genotyped grandparents)
        1001: {-1: 1, -2: 1},  # 1001 has parents -1 and -2
        1002: {-3: 1, -4: 1},  # 1002 has parents -3 and -4
        
        # Generation 2 (genotyped parent)
        1003: {1001: 1, 1002: 1},  # 1003 has parents 1001 and 1002
        
        # Generation 3 (genotyped children)
        1004: {1003: 1},  # 1004 has parent 1003 (missing second parent)
        1005: {1003: 1}   # 1005 has parent 1003 (missing second parent)
    }
    
    return up_node_dict

# Create our example pedigree
example_pedigree = create_example_pedigree()

# Display the structure
print("Example up-node dictionary pedigree:")
for individual, parents in example_pedigree.items():
    if parents:
        parent_list = [f"{parent} (d={degree})" for parent, degree in parents.items()]
        print(f"Individual {individual} has parents: {', '.join(parent_list)}")
    else:
        genotype_status = "latent/inferred" if individual < 0 else "genotyped"
        print(f"Individual {individual} is a founder ({genotype_status})")

### Key Properties of the Up-Node Dictionary

The up-node dictionary has several important properties that make it ideal for pedigree representation and manipulation:

1. **Compact Representation**: Only stores actual relationships, making it memory-efficient for sparse pedigrees.
2. **Intuitive Direction**: Matches the natural direction of family relationships (individuals pointing to their parents).
3. **ID Scheme**: Positive IDs represent observed/genotyped individuals, while negative IDs represent inferred/latent ancestors.
4. **Flexible Structure**: Can represent any pedigree structure, including those with missing data.
5. **Bidirectional Navigation**: Can be converted to a down-node dictionary (children instead of parents) for descendant traversal.

Let's examine these properties with our example pedigree:

In [None]:
def visualize_pedigree(up_node_dict, title="Pedigree Structure"):
    """Create a visualization of a pedigree from an up-node dictionary."""
    G = nx.DiGraph()
    
    # Add nodes and edges
    for child, parents in up_node_dict.items():
        # Add node with different color based on ID sign
        if child > 0:  # Genotyped/observed
            G.add_node(child, color='lightblue')
        else:  # Latent/inferred
            G.add_node(child, color='lightgray')
            
        # Add parent-child relationships
        for parent in parents:
            if parent not in G.nodes():
                # Add parent node if not already in graph
                if parent > 0:  # Genotyped/observed
                    G.add_node(parent, color='lightblue')
                else:  # Latent/inferred
                    G.add_node(parent, color='lightgray')
                    
            # Add edge from parent to child
            G.add_edge(parent, child)
            
    # Prepare layout and colors
    try:
        pos = nx.nx_agraph.graphviz_layout(G, prog='dot')
    except:
        pos = nx.spring_layout(G, seed=42)
        
    node_colors = [G.nodes[node]['color'] for node in G.nodes()]
    
    # Draw the graph
    plt.figure(figsize=(12, 8))
    nx.draw(G, pos, with_labels=True, node_color=node_colors, 
            node_size=700, font_size=10, arrows=True, arrowsize=15)
    
    # Add legend
    import matplotlib.patches as mpatches
    genotyped_patch = mpatches.Patch(color='lightblue', label='Genotyped')
    inferred_patch = mpatches.Patch(color='lightgray', label='Inferred')
    plt.legend(handles=[genotyped_patch, inferred_patch])
    
    plt.title(title)
    plt.axis('off')
    plt.tight_layout()
    plt.show()

# Visualize our example pedigree
visualize_pedigree(example_pedigree, "Example Pedigree Structure")

### Converting to a Down-Node Dictionary

One of the powerful features of this representation is the ability to convert between up-node and down-node dictionaries. A down-node dictionary reverses the relationships, mapping individuals to their children instead of their parents.

Let's examine the `reverse_node_dict` function from Bonsai v3 which performs this conversion:

In [None]:
# Display the reverse_node_dict function source code
if not is_jupyterlite():
    view_function_source(pedigrees, 'reverse_node_dict')
else:
    print("Simplified implementation of reverse_node_dict:")
    print(inspect.getsource(impl_reverse_node_dict))

In [None]:
# Convert our up-node dict to a down-node dict
if not is_jupyterlite():
    down_node_dict = pedigrees.reverse_node_dict(example_pedigree)
else:
    down_node_dict = impl_reverse_node_dict(example_pedigree)

# Display the down-node dict structure
print("Down-node dictionary (individuals mapped to their children):")
for parent, children in down_node_dict.items():
    if children:
        child_list = [f"{child} (d={degree})" for child, degree in children.items()]
        print(f"Individual {parent} has children: {', '.join(child_list)}")
    else:
        print(f"Individual {parent} has no children")

## 2. Advanced Pedigree Navigation and Querying

Bonsai v3 provides numerous functions for navigating and querying pedigree structures. Let's explore some of the most important ones.

### Finding Related Individuals

The `get_rel_set` function returns the set of all individuals related to a given individual in the pedigree. When used with an up-node dictionary, it returns all ancestors (including the individual themselves), and when used with a down-node dictionary, it returns all descendants.

In [None]:
# Display the get_rel_set function
if not is_jupyterlite():
    view_function_source(pedigrees, 'get_rel_set')
else:
    print("Simplified implementation of get_rel_set:")
    print(inspect.getsource(impl_get_rel_set))

In [None]:
# Find all ancestors of individual 1003 (the parent generation)
if not is_jupyterlite():
    ancestors_1003 = pedigrees.get_rel_set(example_pedigree, 1003)
else:
    ancestors_1003 = impl_get_rel_set(example_pedigree, 1003)
    
print(f"Ancestors of individual 1003: {ancestors_1003}")

# Find all descendants of individual 1003 using the down-node dictionary
if not is_jupyterlite():
    descendants_1003 = pedigrees.get_rel_set(down_node_dict, 1003)
else:
    descendants_1003 = impl_get_rel_set(down_node_dict, 1003)
    
print(f"Descendants of individual 1003: {descendants_1003}")

### Finding Founders

Founders are individuals in the pedigree that have no parents. The `get_founder_set` function identifies all such individuals.

In [None]:
# Display the get_founder_set function
if not is_jupyterlite():
    view_function_source(pedigrees, 'get_founder_set')
else:
    print("Simplified implementation of get_founder_set:")
    print(inspect.getsource(impl_get_founder_set))

In [None]:
# Find all founders in our pedigree
if not is_jupyterlite():
    founders = pedigrees.get_founder_set(example_pedigree)
else:
    founders = impl_get_founder_set(example_pedigree)

print(f"Founders in the pedigree: {founders}")

# Categorize founders by type (genotyped vs. inferred)
genotyped_founders = {f for f in founders if f > 0}
inferred_founders = {f for f in founders if f < 0}

print(f"Genotyped founders: {genotyped_founders}")
print(f"Inferred (latent) founders: {inferred_founders}")

### Determining Relationship Types

One of the most important features of Bonsai v3 is its ability to determine the precise relationship between any two individuals in the pedigree. This is accomplished with the `get_simple_rel_tuple` function, which returns a tuple of the form `(up, down, num_ancs)` where:

- `up`: Number of generations up from the first individual to their common ancestor with the second individual
- `down`: Number of generations down from the common ancestor to the second individual
- `num_ancs`: Number of common ancestors (1 for a half-relationship, 2 for a full relationship)

This tuple representation is powerful because it can encode any genealogical relationship in a compact form.

In [None]:
# Display the get_simple_rel_tuple function
if not is_jupyterlite():
    view_function_source(pedigrees, 'get_simple_rel_tuple')
else:
    print("Simplified implementation of get_simple_rel_tuple:")
    print(inspect.getsource(impl_get_simple_rel_tuple))

In [None]:
def describe_relationship(rel_tuple):
    """Convert a relationship tuple to a human-readable description"""
    if rel_tuple is None:
        return "Unrelated"
    
    up, down, num_ancs = rel_tuple
    
    if up == 0 and down == 0 and num_ancs == 2:
        return "Self"
    elif up == 0 and down == 1 and num_ancs == 1:
        return "Parent"
    elif up == 1 and down == 0 and num_ancs == 1:
        return "Child"
    elif up == 1 and down == 1 and num_ancs == 2:
        return "Full Sibling"
    elif up == 1 and down == 1 and num_ancs == 1:
        return "Half Sibling"
    elif up == 0 and down == 2 and num_ancs == 1:
        return "Grandparent"
    elif up == 2 and down == 0 and num_ancs == 1:
        return "Grandchild"
    elif up == 1 and down == 2 and num_ancs == 1:
        return "Aunt/Uncle"
    elif up == 2 and down == 1 and num_ancs == 1:
        return "Niece/Nephew"
    elif up == 2 and down == 2 and num_ancs == 2:
        return "Full First Cousin"
    elif up == 2 and down == 2 and num_ancs == 1:
        return "Half First Cousin"
    else:
        # For more distant or complex relationships
        return f"Complex Relationship (up={up}, down={down}, num_ancs={num_ancs})"

# Test various relationships in our example pedigree
relationships_to_test = [
    (1003, 1003),  # Self
    (1003, 1004),  # Parent-child
    (1004, 1003),  # Child-parent
    (1004, 1005),  # Siblings
    (1001, 1003),  # Parent-child
    (1001, 1004),  # Grandparent-grandchild
    (1001, 1002),  # Unrelated (or very distant)
    (-1, 1003),    # Grandparent-grandchild
    (-1, -3)       # Unrelated inferred ancestors
]

print("Relationships in the example pedigree:")
for id1, id2 in relationships_to_test:
    if not is_jupyterlite():
        rel_tuple = pedigrees.get_simple_rel_tuple(example_pedigree, id1, id2)
    else:
        rel_tuple = impl_get_simple_rel_tuple(example_pedigree, id1, id2)
        
    relationship = describe_relationship(rel_tuple)
    print(f"{id1} → {id2}: {relationship} {rel_tuple}")

## 3. Advanced Pedigree Operations

Now let's explore some of the more advanced operations that can be performed with up-node dictionaries in Bonsai v3.

### Re-rooting a Pedigree

One of the most powerful operations is "re-rooting" a pedigree, which reorients the entire pedigree structure around a specific individual. This is accomplished with the `re_root_up_node_dict` function.

Re-rooting is useful for several reasons:
1. It provides a perspective centered on a specific individual
2. It simplifies the calculation of relationships between that individual and others
3. It helps identify key connection points for merging pedigrees

In [None]:
# Display the re_root_up_node_dict function
if not is_jupyterlite():
    view_function_source(pedigrees, 're_root_up_node_dict')
else:
    print("The re_root_up_node_dict function is a complex operation that creates a view of the pedigree centered on a specific node.")
    print("In JupyterLite, we'll use a simplified implementation that returns the original pedigree.")

In [None]:
def simplified_re_root(up_dct, node):
    """A very simplified implementation of re_root_up_node_dict for JupyterLite."""
    # This is much simpler than the actual implementation
    # It simply returns a subgraph containing only the node and its descendants
    import copy
    
    # In a real implementation, we would compute the full reorientation
    # Here we'll just extract the subgraph containing the node and its ancestors/descendants
    if node not in up_dct:
        return {}
    
    # Start with the node's ancestors
    ancestors = impl_get_rel_set(up_dct, node)
    subgraph = {}
    for anc in ancestors:
        if anc in up_dct:
            # Only include parents that are also in the ancestry
            subgraph[anc] = {p: d for p, d in up_dct[anc].items() if p in ancestors}
    
    # Also include descendants (very simplified approach)
    down_dct = impl_reverse_node_dict(up_dct)
    descendants = impl_get_rel_set(down_dct, node)
    for desc in descendants:
        if desc in up_dct:
            # Add connection, but only if parents are in our subset
            parent_dict = {p: d for p, d in up_dct[desc].items() if p in ancestors or p in descendants}
            if parent_dict:  # Only add if there are parents
                subgraph[desc] = parent_dict
    
    return subgraph

# Re-root the pedigree at individual 1004 (a child)
if not is_jupyterlite():
    rerooted_pedigree = pedigrees.re_root_up_node_dict(example_pedigree, 1004)
else:
    rerooted_pedigree = simplified_re_root(example_pedigree, 1004)

# Visualize the re-rooted pedigree
visualize_pedigree(rerooted_pedigree, "Pedigree Re-rooted at Individual 1004")

### Trimming Pedigrees to Genotyped Individuals

The `trim_to_proximal` function truncates the pedigree at the most proximal genotyped nodes, creating a simplified version that can be more useful for visualization or further analysis.

In [None]:
# Display the trim_to_proximal function
if not is_jupyterlite():
    view_function_source(pedigrees, 'trim_to_proximal')
else:
    print("The trim_to_proximal function truncates the pedigree at the most proximal genotyped nodes.")
    print("In JupyterLite, we'll use a simplified approach.")

In [None]:
def simplified_trim_to_proximal(down_dct, root):
    """Simplified implementation of trim_to_proximal for JupyterLite."""
    if root not in down_dct:
        return {}
    
    import copy
    result = {root: {}}
    
    # Add only direct genotyped children
    for child, degree in down_dct.get(root, {}).items():
        if child > 0:  # Genotyped
            if root not in result:
                result[root] = {}
            result[root][child] = degree
        else:  # Ungenotyped - recurse
            subtree = simplified_trim_to_proximal(down_dct, child)
            if subtree and child in subtree and subtree[child]:
                if root not in result:
                    result[root] = {}
                result[root][child] = degree
                result.update(subtree)
    
    return result

# First create a down-node dictionary from our example pedigree
if not is_jupyterlite():
    down_dict = pedigrees.reverse_node_dict(example_pedigree)
    # Trim the pedigree starting from a latent ancestor
    trimmed_down_dict = pedigrees.trim_to_proximal(down_dict, -1)
else:
    down_dict = impl_reverse_node_dict(example_pedigree)
    # Use simplified implementation
    trimmed_down_dict = simplified_trim_to_proximal(down_dict, -1)

# Convert back to up-node dictionary for visualization
if not is_jupyterlite():
    trimmed_up_dict = pedigrees.reverse_node_dict(trimmed_down_dict)
else:
    trimmed_up_dict = impl_reverse_node_dict(trimmed_down_dict)

# Visualize the trimmed pedigree
visualize_pedigree(trimmed_up_dict, "Pedigree Trimmed to Proximal Genotyped Individuals")

## 4. Finding Connection Points Between Pedigrees

One of the most important applications of the up-node dictionary in Bonsai v3 is identifying potential connection points between separate pedigrees. This is crucial for merging multiple small pedigrees into a larger, more complete family structure based on shared IBD segments.

Let's explore the key functions involved in this process.

In [None]:
# Display the get_possible_connection_point_set function
if not is_jupyterlite():
    view_function_source(pedigrees, 'get_possible_connection_point_set')
else:
    print("The get_possible_connection_point_set function finds all possible points through which a pedigree")
    print("can be connected to another pedigree. A connection point is a tuple (id, partner_id, direction).")

In [None]:
# Let's create another separate pedigree to demonstrate connection points
def create_second_pedigree():
    """Create a second pedigree for demonstrating connection finding."""
    # Structure:
    #    -10      -11
    #     |        |
    #     +--------+
    #         |
    #       2001
    #        /\
    #       /  \
    #     2002  2003
    
    return {
        -10: {},  # Inferred founder
        -11: {},  # Inferred founder
        2001: {-10: 1, -11: 1},  # Genotyped parent
        2002: {2001: 1},  # Genotyped child 1 (missing one parent)
        2003: {2001: 1}   # Genotyped child 2 (missing one parent)
    }

# Create the second pedigree
second_pedigree = create_second_pedigree()

# Visualize it
visualize_pedigree(second_pedigree, "Second Pedigree")

# Find possible connection points in each pedigree
def simplified_get_connection_points(up_dct):
    """Simplified implementation of get_possible_connection_point_set for JupyterLite."""
    points = set()
    
    for node in up_dct:
        # For every node that doesn't have two parents, it can be connected upward
        if node in up_dct and len(up_dct[node]) < 2:
            points.add((node, None, 1))  # Direction 1 = connect upward
        
        # Every node can be connected downward
        points.add((node, None, 0))  # Direction 0 = connect downward
        
        # Can also connect "on" the node (replacing it with another node)
        points.add((node, None, None))  # None = connect on the node
        
    return points

if not is_jupyterlite():
    points1 = pedigrees.get_possible_connection_point_set(example_pedigree)
    points2 = pedigrees.get_possible_connection_point_set(second_pedigree)
else:
    points1 = simplified_get_connection_points(example_pedigree)
    points2 = simplified_get_connection_points(second_pedigree)

# Display some of the connection points
print(f"Found {len(points1)} possible connection points in the first pedigree")
print(f"Found {len(points2)} possible connection points in the second pedigree")

# Display a few examples from each
print("\nExample connection points from first pedigree:")
for point in list(points1)[:5]:
    id, partner, direction = point
    direction_str = "upward" if direction == 1 else "downward" if direction == 0 else "on"
    print(f"  ID: {id}, Partner: {partner}, Direction: {direction_str}")

print("\nExample connection points from second pedigree:")
for point in list(points2)[:5]:
    id, partner, direction = point
    direction_str = "upward" if direction == 1 else "downward" if direction == 0 else "on"
    print(f"  ID: {id}, Partner: {partner}, Direction: {direction_str}")

### Combining Pedigrees Through Connection Points

Once potential connection points are identified, Bonsai uses the `connect_pedigrees_through_points` function from the `connections` module to actually join the pedigrees. Let's explore how this works:

In [None]:
# Display the connect_pedigrees_through_points function
if not is_jupyterlite():
    view_function_source(connections, 'connect_pedigrees_through_points')
else:
    print("The connect_pedigrees_through_points function joins two pedigrees through specific connection points.")
    print("It establishes relationships between individuals in different pedigrees based on the specified connection.")

In [None]:
def simplified_connect_pedigrees(id1, id2, pedigree1, pedigree2, relationship):
    """Very simplified implementation of pedigree connection for JupyterLite."""
    import copy
    
    # Create copies to avoid modifying originals
    pedigree1 = copy.deepcopy(pedigree1)
    pedigree2 = copy.deepcopy(pedigree2)
    
    # Shift negative IDs in pedigree2 to avoid collisions
    min_id1 = min([id for id in pedigree1 if id < 0], default=0) 
    offset = min_id1 - 1
    
    # Create mapping for pedigree2 IDs
    id_map = {}
    for node in pedigree2:
        if node < 0:  # Only shift negative IDs
            id_map[node] = node + offset
    
    # Apply the mapping to pedigree2
    shifted_pedigree2 = {}
    for node, parents in pedigree2.items():
        new_node = id_map.get(node, node)  # Use original ID if not in map
        shifted_pedigree2[new_node] = {}
        for parent, degree in parents.items():
            new_parent = id_map.get(parent, parent)
            shifted_pedigree2[new_node][new_parent] = degree
    
    # Update id2 if needed
    shifted_id2 = id_map.get(id2, id2)
    
    # Create the combined pedigree
    combined = {**pedigree1, **shifted_pedigree2}
    
    # Now establish the relationship between id1 and shifted_id2
    up, down, num_ancs = relationship
    
    # For this simplified version, we'll just handle a few basic cases:
    if up == 0 and down == 1:  # id1 is parent of id2
        if shifted_id2 not in combined:
            combined[shifted_id2] = {}
        combined[shifted_id2][id1] = 1
    elif up == 1 and down == 0:  # id1 is child of id2
        if id1 not in combined:
            combined[id1] = {}
        combined[id1][shifted_id2] = 1
    elif up == 1 and down == 1 and num_ancs >= 1:  # Siblings
        # Create a common parent
        common_parent = min(min(combined.keys()), -100) - 1
        combined[common_parent] = {}
        
        # Connect both to the common parent
        if id1 not in combined:
            combined[id1] = {}
        combined[id1][common_parent] = 1
        
        if shifted_id2 not in combined:
            combined[shifted_id2] = {}
        combined[shifted_id2][common_parent] = 1
        
        # For full siblings, add another common parent
        if num_ancs == 2:
            common_parent2 = common_parent - 1
            combined[common_parent2] = {}
            combined[id1][common_parent2] = 1
            combined[shifted_id2][common_parent2] = 1
    
    return combined

# Example: Connect the two pedigrees by making individual 1003 from the first pedigree
# a parent of individual 2003 from the second pedigree
if not is_jupyterlite():
    # In a real scenario, we would use connections.connect_pedigrees_through_points
    # For simplicity, we'll use a direct approach
    combined_pedigree = connections.connect_pedigrees_through_points(
        id1=1003,       # from first pedigree
        id2=2003,       # from second pedigree
        pid1=None,      # no partner
        pid2=None,      # no partner
        up_dct1=example_pedigree,
        up_dct2=second_pedigree,
        deg1=0,         # 0 steps up from id1
        deg2=1,         # 1 step down to id2
        num_ancs=1      # one common ancestor
    )[0]  # Take the first result
else:
    # Use our simplified implementation
    combined_pedigree = simplified_connect_pedigrees(
        id1=1003,       # from first pedigree
        id2=2003,       # from second pedigree
        pedigree1=example_pedigree,
        pedigree2=second_pedigree,
        relationship=(0, 1, 1)  # (up, down, num_ancs)
    )

# Visualize the combined pedigree
visualize_pedigree(combined_pedigree, "Combined Pedigree (1003 as parent of 2003)")

## 5. Extending Pedigrees with Inferred Ancestors

One of the powerful features of Bonsai v3 is its ability to infer missing ancestors and extend the pedigree with these inferred individuals. This is particularly useful when dealing with incomplete data.

In [None]:
# Display the add_parent function
if not is_jupyterlite():
    view_function_source(pedigrees, 'add_parent')
else:
    print("The add_parent function adds an ungenotyped (latent) parent to a node in the pedigree.")

In [None]:
def simplified_add_parent(node, up_dct):
    """Simplified implementation of add_parent for JupyterLite."""
    import copy
    up_dct = copy.deepcopy(up_dct)
    
    # Check if node exists
    if node not in up_dct:
        return up_dct, None
    
    # Check if node already has two parents
    if len(up_dct[node]) >= 2:
        return up_dct, None
    
    # Find the minimum ID in the pedigree
    min_id = min([id for id in up_dct if id < 0], default=0)
    new_parent_id = min(min_id, -1) - 1
    
    # Add the new parent
    up_dct[node][new_parent_id] = 1
    up_dct[new_parent_id] = {}
    
    return up_dct, new_parent_id

# Create a simple pedigree with missing parents
incomplete_pedigree = {
    1001: {},       # No parents
    1002: {1001: 1} # Only one parent
}

# Add a parent to individual 1001
if not is_jupyterlite():
    updated_pedigree, new_parent1 = pedigrees.add_parent(1001, incomplete_pedigree)
else:
    updated_pedigree, new_parent1 = simplified_add_parent(1001, incomplete_pedigree)

print(f"Added parent {new_parent1} to individual 1001")

# Add another parent to individual 1001
if not is_jupyterlite():
    updated_pedigree, new_parent2 = pedigrees.add_parent(1001, updated_pedigree, min_id=new_parent1)
else:
    updated_pedigree, new_parent2 = simplified_add_parent(1001, updated_pedigree)

print(f"Added second parent {new_parent2} to individual 1001")

# Add a parent to individual 1002 (who already has one parent)
if not is_jupyterlite():
    updated_pedigree, new_parent3 = pedigrees.add_parent(1002, updated_pedigree, min_id=new_parent2)
else:
    updated_pedigree, new_parent3 = simplified_add_parent(1002, updated_pedigree)

print(f"Added second parent {new_parent3} to individual 1002")

# Visualize the extended pedigree
visualize_pedigree(updated_pedigree, "Extended Pedigree with Inferred Ancestors")

## 6. Optimizing Pedigree Operations for Large-Scale Applications

In practice, Bonsai v3 must handle large pedigrees with hundreds or thousands of individuals. Several optimizations make this possible:

### 1. Efficient Genotyped ID Access

Quickly identifying which individuals in a pedigree are genotyped is crucial for many operations:

In [None]:
# Display the get_gt_id_set function
if not is_jupyterlite():
    view_function_source(pedigrees, 'get_gt_id_set')
else:
    print("The get_gt_id_set function returns the set of all genotyped (positive ID) individuals in the pedigree.")

In [None]:
def simplified_get_gt_id_set(up_dct):
    """Simplified implementation of get_gt_id_set for JupyterLite."""
    # Get all IDs in the pedigree
    all_ids = set(up_dct.keys())
    for parents in up_dct.values():
        all_ids.update(parents.keys())
    
    # Filter to only positive IDs (genotyped individuals)
    return {id for id in all_ids if id > 0}

# Get the genotyped IDs from our example pedigree
if not is_jupyterlite():
    genotyped_ids = pedigrees.get_gt_id_set(example_pedigree)
else:
    genotyped_ids = simplified_get_gt_id_set(example_pedigree)

print(f"Genotyped individuals in the example pedigree: {genotyped_ids}")

# Calculate the percentage of individuals that are genotyped
all_ids = set(example_pedigree.keys()).union(*[set(parents.keys()) for parents in example_pedigree.values()])
genotyped_percentage = (len(genotyped_ids) / len(all_ids)) * 100

print(f"Total individuals: {len(all_ids)}")
print(f"Genotyped individuals: {len(genotyped_ids)} ({genotyped_percentage:.1f}%)")

### 2. Working with Very Large Pedigrees

For large pedigrees, operations like finding all relationship paths can be computationally expensive. Let's explore how Bonsai handles this challenge:

In [None]:
# Create a larger pedigree with multiple generations
def create_large_pedigree(num_generations=4, family_size=2):
    """Create a larger example pedigree with multiple generations."""
    import random
    
    up_dict = {}
    next_id = 1000  # Start genotyped IDs at 1000
    next_latent_id = -1  # Start latent IDs at -1
    
    # First generation (founders)
    founders = []
    for _ in range(family_size * 2):  # Need pairs for first generation
        founder_id = next_id
        up_dict[founder_id] = {}  # No parents
        founders.append(founder_id)
        next_id += 1
    
    current_generation = []
    # Create the first children generation from founder pairs
    for i in range(0, len(founders), 2):
        if i + 1 < len(founders):  # Make sure we have a pair
            for _ in range(family_size):  # Each pair has family_size children
                child_id = next_id
                up_dict[child_id] = {founders[i]: 1, founders[i+1]: 1}
                current_generation.append(child_id)
                next_id += 1
    
    # Create remaining generations
    for _ in range(2, num_generations):  # Skip first two generations already created
        next_generation = []
        # Randomly pair individuals from current generation
        random.shuffle(current_generation)
        
        for i in range(0, len(current_generation), 2):
            if i + 1 < len(current_generation):  # Make sure we have a pair
                for _ in range(family_size):  # Each pair has family_size children
                    child_id = next_id
                    up_dict[child_id] = {current_generation[i]: 1, current_generation[i+1]: 1}
                    next_generation.append(child_id)
                    next_id += 1
            else:  # Odd individual out - create latent partner
                latent_id = next_latent_id
                next_latent_id -= 1
                up_dict[latent_id] = {}  # No parents for latent individual
                
                for _ in range(family_size):  # Create family_size children
                    child_id = next_id
                    up_dict[child_id] = {current_generation[i]: 1, latent_id: 1}
                    next_generation.append(child_id)
                    next_id += 1
        
        current_generation = next_generation
    
    return up_dict

# Create a larger pedigree
large_pedigree = create_large_pedigree(num_generations=4, family_size=2)

# Get some stats about the large pedigree
all_ids = set(large_pedigree.keys()).union(*[set(parents.keys()) for parents in large_pedigree.values()])
genotyped_ids = {id for id in all_ids if id > 0}
latent_ids = {id for id in all_ids if id < 0}

print(f"Large pedigree statistics:")
print(f"Total individuals: {len(all_ids)}")
print(f"Genotyped individuals: {len(genotyped_ids)}")
print(f"Latent individuals: {len(latent_ids)}")

# Visualize a subset of the large pedigree (it's too big to visualize completely)
subset_ids = {next(iter(genotyped_ids))}  # Start with one genotyped ID
for _ in range(5):  # Add a few more connected individuals
    for id in list(subset_ids):
        # Add parents if they exist
        if id in large_pedigree:
            subset_ids.update(large_pedigree[id].keys())
        # Add children if they exist
        for child, parents in large_pedigree.items():
            if id in parents:
                subset_ids.add(child)
        if len(subset_ids) >= 20:  # Limit to at most 20 individuals
            break
    if len(subset_ids) >= 20:
        break

# Create a subgraph with just these IDs
subset_pedigree = {}
for id in subset_ids:
    if id in large_pedigree:
        subset_pedigree[id] = {p: d for p, d in large_pedigree[id].items() if p in subset_ids}

# Visualize the subset
visualize_pedigree(subset_pedigree, "Subset of Large Pedigree")

### 3. Measuring Computational Efficiency

Let's look at the time complexity of some key operations on large pedigrees:

In [None]:
import time

# Choose two individuals from the large pedigree to test relationship finding
genotyped_list = list(genotyped_ids)
individual1 = genotyped_list[0]
individual2 = genotyped_list[-1]

print(f"Testing relationship computation between individuals {individual1} and {individual2}")

# Time the get_simple_rel_tuple function
start_time = time.time()
if not is_jupyterlite():
    relationship = pedigrees.get_simple_rel_tuple(large_pedigree, individual1, individual2)
else:
    relationship = impl_get_simple_rel_tuple(large_pedigree, individual1, individual2)
end_time = time.time()

print(f"Relationship: {relationship}")
print(f"Time to compute relationship: {(end_time - start_time)*1000:.2f} ms")

# Time finding all ancestors of an individual
start_time = time.time()
if not is_jupyterlite():
    ancestors = pedigrees.get_rel_set(large_pedigree, individual1)
else:
    ancestors = impl_get_rel_set(large_pedigree, individual1)
end_time = time.time()

print(f"Number of ancestors found: {len(ancestors)}")
print(f"Time to find all ancestors: {(end_time - start_time)*1000:.2f} ms")

## Summary and Conclusion

In this lab, we've explored the up-node dictionary data structure in depth, including:

1. The structure and properties of the up-node dictionary representation
2. Core operations for navigating and querying pedigrees
3. Advanced operations like re-rooting and trimming pedigrees
4. Finding connection points and combining separate pedigrees
5. Extending pedigrees with inferred ancestors
6. Optimizations for handling large-scale pedigree operations

The up-node dictionary is the central data structure in Bonsai v3, enabling efficient pedigree representation, manipulation, and analysis. Its design strikes a balance between simplicity, flexibility, and performance, making it suitable for both small, focused analyses and large-scale pedigree reconstruction projects.

Key advantages of this representation include:
- Intuitive mapping of individuals to their parents
- Clear distinction between observed (genotyped) and inferred individuals
- Memory efficiency for sparse pedigrees
- Support for bidirectional navigation (via conversion to down-node dictionary)
- Ability to represent complex, multi-generational family structures

In the next lab, we'll build on this foundation to explore how Bonsai uses these data structures to perform full pedigree reconstruction from IBD segment data.