# Lab 0: Defining Genetic Networks for Simulation

## Overview

In this lab, you will learn how to create pedigree definitions for simulating genetic data. These definitions are essential inputs for both Ped-Sim (forward-time) and msprime (coalescent) simulations. We will:

1. Set up the computational environment
2. Create pedigree structures using R and Python
3. Convert pedigrees into standard formats for simulation
4. Visualize the pedigree structures
5. Export them for use in the simulation labs

This lab is a prerequisite for Lab 8 (Simulating Data with Ped-Sim) and Lab 10 (Simulating Data with msprime).

In [None]:
import os
from collections import Counter
import logging
import sys
from pathlib import Path
import subprocess
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import IPython
import pandas as pd
import numpy as np
from tqdm import tqdm
import networkx as nx

from dotenv import load_dotenv

In [None]:
# Environment setup for cross-compatibility
from scripts_support.lab_cross_compatibility import setup_environment, is_jupyterlite

# Set up environment-specific paths
DATA_DIR, RESULTS_DIR = setup_environment()

# Now you can use DATA_DIR and RESULTS_DIR consistently across environments


In [None]:
def configure_logging(log_filename, log_file_debug_level="INFO", console_debug_level="INFO"):
    """
    Configure logging for both file and console handlers.

    Args:
        log_filename (str): Path to the log file where logs will be written.
        log_file_debug_level (str): Logging level for the file handler.
        console_debug_level (str): Logging level for the console handler.
    """
    # Create a root logger
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)  # Capture all messages at the root level

    # Convert level names to numeric levels
    file_level = getattr(logging, log_file_debug_level.upper(), logging.INFO)
    console_level = getattr(logging, console_debug_level.upper(), logging.INFO)

    # File handler: Logs messages at file_level and above to the file
    file_handler = logging.FileHandler(log_filename)
    file_handler.setLevel(file_level)
    file_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    file_handler.setFormatter(file_formatter)

    # Console handler: Logs messages at console_level and above to the console
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setLevel(console_level)
    console_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    console_handler.setFormatter(console_formatter)

    # Add handlers to the root logger
    logger.addHandler(file_handler)
    logger.addHandler(console_handler)
    
def clear_logger():
    """Remove all handlers from the root logger."""
    logger = logging.getLogger()
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
        
log_filename = os.path.join(results_directory, "lab0_pedigree.log")
print(f"The Lab 0 Pedigree log file is located at {log_filename}.")

# Ensure the results_directory exists
if not os.path.exists(results_directory):
    os.makedirs(results_directory)

# Check if the file exists; if not, create it
if not os.path.exists(log_filename):
    with open(log_filename, 'w') as file:
        pass  # The file is now created.
    
clear_logger() # Clear the logger before reconfiguring it
configure_logging(log_filename, log_file_debug_level="INFO", console_debug_level="INFO")

## Check R and Required R Packages

Go to Lab0_Code_Environment and run the set of code cells for **Install R** and **Install liftover** if you haven't already done so.

Also, rerun `poetry install --no-root` to install any needed packages for this lab.

In [None]:
%load_ext rpy2.ipython

Now you can use cell magic for R. Test it with a simple calculation:

In [None]:
%%R

x <- c(1, 2, 3, 4, 5)
mean(x)

Don't worry about the warning message about libraries containing no packages.

## Install the `pedsuite` package

The pedsuite package provides powerful tools for creating and manipulating pedigree structures in R.

In [None]:
%%R

# Function to check and install remotes package
# Provides functions to install R packages from GitHub, GitLab, Bitbucket, and other non-CRAN sources.
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes", repos = "https://cloud.r-project.org/")
}

# Install pedsuite from GitHub (with dependencies)
if (!requireNamespace("pedsuite", quietly = TRUE)) {
  remotes::install_github("magnusdv/pedsuite", dependencies = TRUE)
}

# Load the package
library(pedsuite)
print("pedsuite loaded successfully!")

# Part 1: Creating Genetic Family Trees with R

The `pedsuite` package in R provides a rich set of functions for creating and manipulating pedigree structures. We'll explore various approaches to build family trees, starting with pre-defined structures and then customizing them.

## Method 1: Using Pre-defined Pedigree Structures

The `pedsuite` package includes several pre-defined pedigree structures that serve as excellent starting points. We'll begin with the `cousinPed` function that creates a cousin pedigree of a specified degree.

In [None]:
%%R

# Create a cousin pedigree of degree 4
x = cousinPed(degree = 4)

# Plot the pedigree
plot(x)

**Understanding the Visualization:**

In the pedigree plot, each person has an assigned number which serves as their identifier. The visualization follows standard pedigree conventions:
- Squares represent males (sex = 1)
- Circles represent females (sex = 2)
- Lines connect parents to their children
- Horizontal lines connect couples (spouses/partners)

## Method 2: Expanding Pre-defined Pedigrees

Now, let's customize the pre-defined pedigree by adding more individuals. We'll use the `addChildren` function to add children to existing couples in the pedigree.

Parameters:
- `father`: ID of the father (optional)
- `mother`: ID of the mother (optional)
- `nch`: Number of children to add
- `sex`: Sex of the child(ren): 1=male, 2=female

In [None]:
%%R

# Add children to the base pedigree
x = addChildren(x, father = 3, mother = 4, nch = 1, sex = 2)

x = addChildren(x, father = 5, mother = 6, nch = 1, sex = 2)
x = addChildren(x, father = 7, mother = 8, nch = 1, sex = 2)
x = addChildren(x, father = 9, mother = 10, nch = 1, sex = 2)
x = addChildren(x, father = 11, mother = 12, nch = 1, sex = 2)
x = addChildren(x, father = 13, mother = 14, nch = 1, sex = 2)
x = addChildren(x, father = 15, mother = 16, nch = 1, sex = 2)
x = addChildren(x, father = 17, mother = 18, nch = 1, sex = 2)

# Plot the updated pedigree
plot(x)

Notice how the pedigree has changed. We've added one female child to each of the existing couples in the original pedigree. Take a moment to understand these changes by comparing with the previous plot.

Let's continue expanding our pedigree by adding more generations:

In [None]:
%%R

# Add another generation of children (Note: only specifying mother creates single-parent families)
x = addChildren(x, mother = 21, nch = 1, sex = 1)
x = addChildren(x, mother = 22, nch = 1, sex = 2)
x = addChildren(x, mother = 23, nch = 1, sex = 1)
x = addChildren(x, mother = 24, nch = 1, sex = 2)

x = addChildren(x, mother = 25, nch = 1, sex = 1)
x = addChildren(x, mother = 26, nch = 1, sex = 2)

# Plot the updated pedigree
plot(x)

In [None]:
%%R

# Add children to the new individuals
x = addChildren(x, father = 30, nch = 1, sex = 2)
x = addChildren(x, father = 34, nch = 1, sex = 2)
x = addChildren(x, mother = 32, nch = 1, sex = 1)
x = addChildren(x, mother = 36, nch = 1, sex = 1)

# Plot the updated pedigree
plot(x)

In [None]:
%%R

# Final additions to the pedigree
x = addChildren(x, mother = 42, nch = 1, sex = 1)
x = addChildren(x, father = 46, nch = 1, sex = 2)

# Plot the final pedigree
plot(x)

## Method 3: Creating Complex Family Structures

Let's explore how to create more complex family structures with specific relationships. Here are some examples of relationships you might want to model:

1. Half-siblings: Two individuals sharing one parent but not the other
2. Complex consanguinity: Marriage between relatives (e.g., cousin marriages)
3. Multi-generational families with specific patterns of relationship

In [None]:
%%R

# Create a new pedigree with half-siblings and cousin marriage
# Start with two nuclear families
half_sibs = nuclearPed(1, sex = 1) # Father with one son

# Add a second mother and child (half-sibling)
half_sibs = addChildren(half_sibs, father = 1, nch = 1, sex = 2)

# Show the half-siblings family
plot(half_sibs, title = "Half-siblings example")

In [None]:
%%R

# Create a pedigree with cousin marriage
# Start with two sibling pairs
cousin_marriage = relabel(cousinPed(1), c(1:6))

# Add children to the cousins who married
cousin_marriage = addChildren(cousin_marriage, father = 5, mother = 6, nch = 2, sex = c(1,2))

# Plot the cousin marriage pedigree
plot(cousin_marriage, title = "First-cousin marriage example")

In [None]:
%%R

# Create a pedigree with multiple spouses
# Start with a nuclear family
multi_spouse = nuclearPed(1)

# Add a second family with the same father
multi_spouse = addChildren(multi_spouse, father = 1, nch = 2, sex = c(1,2))

# Add a third family with the same father
multi_spouse = addChildren(multi_spouse, father = 1, nch = 1, sex = 1)

# Plot the multi-spouse pedigree
plot(multi_spouse, title = "Multiple spouse example")

## Exporting the Pedigree Structure

Now let's extract the pedigree structure to a format we can use in Python for our simulation tools.

In [None]:
%%R

# Print the pedigree to check the structure
print(x)

In the output, you should see 52 rows. Each row represents one person in the tree you built. `id` is the individual identifier, `fid` is the father identifier and `mid` is the mother identifier. `sex` is the chromosomal sex where `1` = male and `2` = female.

In [None]:
%%R -o fam_df

# Extract data from the pedigree
individual_id <- as.character(x$ID)
father_id <- ifelse(x$FIDX == "*", 0, x$FIDX)
mother_id <- ifelse(x$MIDX == "*", 0, x$MIDX)
sex <- as.character(x$SEX)

# Create data frame
fam_df <- data.frame(
  individual_id = individual_id,
  father_id = father_id,
  mother_id = mother_id,
  sex = sex
)

## Checking the Pedigree DataFrame in Python

Now let's inspect the dataframe in Python to ensure it looks as expected.

In [None]:
fam_df = fam_df.copy()
fam_df.info()

In [None]:
# Inspect the first 10 rows
# Compare them to the pedsuite pedigree
fam_df.head(10)

# Part 2: Creating Pedigree Definition Files for Simulation

Both Ped-sim and msprime require special formatting for their pedigree input files. Let's create these files from our pedigree structure.

## Creating a FAM File for Genetic Simulation

The PLINK FAM format is a standard format for pedigree data in genetic analysis. It includes six columns:
1. Family ID (FID)
2. Individual ID (IID)
3. Father ID (if available, 0 otherwise)
4. Mother ID (if available, 0 otherwise)
5. Sex (1=male, 2=female)
6. Phenotype (often set to -9 for missing)

In [None]:
# Rename the columns to match the required headers:
# FAM_ID, INDIV_ID, FATHER_ID, MOTHER_ID, SEX, PHENO
fam_df.rename(columns={
    "individual_id": "INDIV_ID",
    "father_id": "FATHER_ID",
    "mother_id": "MOTHER_ID",
    "sex": "SEX"
}, inplace=True)

fam_df["FAM_ID"] = "FAM"
fam_df["PHENO"] = -9

# Reorder columns to match the standard FAM format
fam_df = fam_df[["FAM_ID", "INDIV_ID", "FATHER_ID", "MOTHER_ID", "SEX", "PHENO"]]

# Display the first few rows to verify
display(fam_df.head())

# Save the updated file without header and index, using tab separation
fam_output_path = os.path.join(results_directory, "pedigree.fam")
fam_df.to_csv(fam_output_path, sep="\	", index=False, header=False)

print(f"Saved FAM file to {fam_output_path}")

## Converting FAM to Ped-sim DEF File

Ped-sim requires a specific format for its pedigree definition files. Fortunately, it comes with a utility script to convert FAM files to DEF files.

In [None]:
!{utils_directory}/ped-sim/fam2def.py -i {results_directory}/pedigree.fam -o {results_directory}/pedigree.def

Let's examine the DEF file that was created:

In [None]:
!cat {results_directory}/pedigree.def | head -n 20

## Specifying the Number of Simulated Pedigrees

In the DEF file, the first line specifies the number of pedigrees to simulate. Let's update it to a specified value.

In [None]:
# Get user input for the number of pedigrees
num_pedigrees = int(input("Enter the number of pedigrees to generate: "))

# Define input file path
input_def_file = os.path.join(results_directory, "pedigree.def")

# Read the file contents
with open(input_def_file, "r") as file:
    lines = file.readlines()

# Modify the first line (only changing the second value)
if lines[0].startswith("def"):
    parts = lines[0].split()  # Split the first line into parts
    parts[2] = str(num_pedigrees)  # Update only the second value
    lines[0] = " ".join(parts) + "\
"  # Reconstruct the modified line

# Write back to the file
with open(input_def_file, "w") as file:
    file.writelines(lines)

print(f"Updated {input_def_file} with num_pedigrees = {num_pedigrees}")

# Part 3: Creating Pedigrees Programmatically with Python

While the R-based approach using pedsuite is powerful for interactive pedigree creation, you may want to create pedigrees programmatically in Python, especially when building complex or random pedigree structures.

In [None]:
def create_pedigree_definition(output_path, num_founders=10, num_generations=5, max_children=3, cousin_marriage_prob=0.1, random_seed=None):
    """
    Create a complex pedigree definition file with multiple generations, including potential cousin marriages.
    
    Parameters:
    - output_path: Path to save the pedigree definition file
    - num_founders: Number of founder individuals (should be even for male/female pairs)
    - num_generations: Number of generations to create
    - max_children: Maximum number of children per couple
    - cousin_marriage_prob: Probability of cousin marriages in later generations
    - random_seed: Seed for random number generation
    
    Returns:
    - Path to the created file
    - DataFrame with the pedigree information
    """
    if random_seed is not None:
        np.random.seed(random_seed)
    
    # Ensure even number of founders for male/female pairs
    if num_founders % 2 != 0:
        num_founders += 1
        print(f"Adjusted number of founders to {num_founders} to ensure male/female pairs")
    
    # Initialize tracking structures
    individuals = []
    individual_ids = set()
    generation_info = {}
    available_males = []
    available_females = []
    
    # Create founders (Generation 1)
    for i in range(1, num_founders + 1):
        individual_id = f"F{i}"
        sex = "1" if i % 2 == 1 else "2"  # Alternating male/female
        individuals.append((individual_id, "0", "0", sex))
        individual_ids.add(individual_id)
        generation_info[individual_id] = 1
        
        if sex == "1":
            available_males.append(individual_id)
        else:
            available_females.append(individual_id)
    
    # Helper function to create a unique ID
    def create_unique_id(prefix="P"):
        for i in range(1, 10000):
            potential_id = f"{prefix}{i}"
            if potential_id not in individual_ids:
                individual_ids.add(potential_id)
                return potential_id
        raise ValueError("Could not generate a unique ID")
    
    # Generate subsequent generations
    for gen in range(2, num_generations + 1):
        # Lists to track individuals in this generation
        gen_males = []
        gen_females = []
        
        # Find individuals from previous generations who could marry
        potential_mates = []
        
        # 1. Create marriages between available individuals
        # For each male, find a female partner
        for male in available_males.copy():
            # If no more females available, break
            if not available_females:
                break
                
            # Get a female partner
            female = available_females.pop(0)
            available_males.remove(male)
            
            # Determine number of children for this couple
            num_children = np.random.randint(1, max_children + 1)
            
            # Create children
            for j in range(num_children):
                child_id = create_unique_id()
                # Randomly assign sex
                sex = str(np.random.choice([1, 2]))
                individuals.append((child_id, male, female, sex))
                generation_info[child_id] = gen
                
                if sex == "1":
                    gen_males.append(child_id)
                else:
                    gen_females.append(child_id)
        
        # 2. Potentially create cousin marriages in later generations
        if gen >= 3 and gen_males and gen_females and np.random.random() < cousin_marriage_prob:
            # Find potential cousin pairs
            cousins = []
            for male in gen_males:
                for female in gen_females:
                    # Simple check - different parents but from same generation
                    male_parents = next((p[1], p[2]) for p in individuals if p[0] == male)
                    female_parents = next((p[1], p[2]) for p in individuals if p[0] == female)
                    if male_parents != female_parents:
                        cousins.append((male, female))
            
            # If we found cousin pairs, randomly select one pair for marriage
            if cousins:
                male, female = cousins[np.random.randint(0, len(cousins))]
                
                # Remove from available mates for this generation
                if male in gen_males:
                    gen_males.remove(male)
                if female in gen_females:
                    gen_females.remove(female)
                    
                # Create children for the cousin marriage
                num_children = np.random.randint(1, max_children + 1)
                for j in range(num_children):
                    child_id = create_unique_id()
                    sex = str(np.random.choice([1, 2]))
                    individuals.append((child_id, male, female, sex))
                    generation_info[child_id] = gen + 1
        
        # Update available individuals for next generation
        available_males = gen_males.copy()
        available_females = gen_females.copy()
    
    # Create a DataFrame with the pedigree
    pedigree_df = pd.DataFrame(individuals, columns=["individual_id", "father_id", "mother_id", "sex"])
    
    # Add generation information
    pedigree_df["generation"] = pedigree_df["individual_id"].map(generation_info)
    
    # Write the pedigree definition to file
    with open(output_path, 'w') as f:
        f.write("# Pedigree definition: individual_id father_id mother_id sex (1=male, 2=female)\
")
        for individual in individuals:
            f.write(" ".join(individual) + "\
")
    
    print(f"Created pedigree definition with {len(individuals)} individuals across {num_generations} generations")
    print(f"Saved to: {output_path}")
    
    return output_path, pedigree_df

In [None]:
# Create a complex pedigree with Python
python_pedigree_path = os.path.join(results_directory, "python_pedigree.txt")
python_pedigree_file, python_ped_df = create_pedigree_definition(
    python_pedigree_path, 
    num_founders=12, 
    num_generations=4, 
    max_children=3, 
    cousin_marriage_prob=0.15,
    random_seed=42
)

In [None]:
# Display pedigree statistics
print("\
Pedigree Statistics:")
print(f"Total individuals: {len(python_ped_df)}")
print(f"Individuals by generation:\
{python_ped_df['generation'].value_counts().sort_index()}")
print(f"Sex distribution: {python_ped_df['sex'].value_counts()}")

# See a sample of the pedigree
python_ped_df.head(10)

## Converting Python-generated Pedigree to FAM and DEF Files

Now we'll convert our Python-generated pedigree to the proper format for genetic simulations.

In [None]:
# Convert to FAM format
python_fam_df = python_ped_df.copy()
python_fam_df.rename(columns={
    "individual_id": "INDIV_ID",
    "father_id": "FATHER_ID",
    "mother_id": "MOTHER_ID",
    "sex": "SEX"
}, inplace=True)

python_fam_df["FAM_ID"] = "FAM"
python_fam_df["PHENO"] = -9

# Reorder columns and select only those needed for FAM format
python_fam_df = python_fam_df[["FAM_ID", "INDIV_ID", "FATHER_ID", "MOTHER_ID", "SEX", "PHENO"]]

# Save the FAM file
python_fam_path = os.path.join(results_directory, "python_pedigree.fam")
python_fam_df.to_csv(python_fam_path, sep="\	", index=False, header=False)

print(f"Saved FAM file to {python_fam_path}")

In [None]:
# Convert FAM to DEF for Ped-sim
!{utils_directory}/ped-sim/fam2def.py -i {results_directory}/python_pedigree.fam -o {results_directory}/python_pedigree.def

In [None]:
# Update the number of pedigrees in the DEF file
python_def_file = os.path.join(results_directory, "python_pedigree.def")

# Read the file contents
with open(python_def_file, "r") as file:
    lines = file.readlines()

# Modify the first line to set 10 pedigrees
if lines[0].startswith("def"):
    parts = lines[0].split()  
    parts[2] = "10"  # Set to 10 pedigrees
    lines[0] = " ".join(parts) + "\
"

# Write back to the file
with open(python_def_file, "w") as file:
    file.writelines(lines)

print(f"Updated {python_def_file} with num_pedigrees = 10")

# Part 4: Visualizing Pedigrees

Let's create functions to visualize pedigrees using NetworkX, which gives us better control over the visualization than the basic plots we've seen so far.

In [None]:
def create_pedigree_graph(fam_file_path):
    """
    Create a NetworkX directed graph from a FAM file.
    
    Parameters:
    - fam_file_path: Path to the FAM file
    
    Returns:
    - NetworkX DiGraph representing the pedigree
    """
    # Load the FAM file
    fam_data = pd.read_csv(fam_file_path, sep="\\s+", header=None)
    fam_data.columns = ["fam_id", "individual_id", "father_id", "mother_id", "sex", "phenotype"]
    
    # Create a directed graph
    G = nx.DiGraph()
    
    # Add all individuals as nodes
    for _, row in fam_data.iterrows():
        G.add_node(
            row["individual_id"], 
            sex=int(row["sex"]),
            phenotype=row["phenotype"]
        )
        
    # Add edges for parent-child relationships
    for _, row in fam_data.iterrows():
        indiv_id = row["individual_id"]
        father_id = row["father_id"]
        mother_id = row["mother_id"]
        
        # Add edges from parents to child
        if father_id != "0" and father_id in G:
            G.add_edge(father_id, indiv_id, relation="father")
        if mother_id != "0" and mother_id in G:
            G.add_edge(mother_id, indiv_id, relation="mother")
    
    # Assign generations based on topological sort
    generations = {}
    for node in nx.topological_sort(G):
        # Get parents
        parents = list(G.predecessors(node))
        if not parents:
            generations[node] = 0  # Founder
        else:
            # One generation after parents
            generations[node] = max(generations.get(p, 0) for p in parents) + 1
    
    # Add generation attribute to nodes
    for node, gen in generations.items():
        G.nodes[node]['generation'] = gen
    
    return G

In [None]:
def visualize_pedigree(G, title="Pedigree Visualization", figsize=(12, 8), node_size=300, font_size=10):
    """
    Visualize a pedigree graph with generation-based layout.
    
    Parameters:
    - G: NetworkX DiGraph representing the pedigree
    - title: Title for the plot
    - figsize: Size of the figure
    - node_size: Size of the nodes
    - font_size: Size of the node labels
    """
    plt.figure(figsize=figsize)
    
    # Get generation information
    generations = nx.get_node_attributes(G, 'generation')
    if not generations:
        # If generations are not defined, use topological generations
        generations = {node: 0 for node in G.nodes()}
        for node in nx.topological_sort(G):
            parents = list(G.predecessors(node))
            if parents:
                generations[node] = max(generations[p] for p in parents) + 1
    
    # Position nodes by generation (horizontal layers)
    pos = {}
    gen_counts = {gen: 0 for gen in set(generations.values())}
    
    # First pass: count nodes per generation
    for node, gen in generations.items():
        gen_counts[gen] += 1
    
    # Second pass: position nodes
    gen_positions = {gen: 0 for gen in gen_counts}
    
    for node, gen in sorted(generations.items(), key=lambda x: (x[1], x[0])):
        # Horizontal position based on count within generation
        width = max(1, gen_counts[gen])
        x_pos = gen_positions[gen] / width
        gen_positions[gen] += 1
        
        # Vertical position based on generation
        y_pos = -gen  # Negative to have founders at the top
        
        pos[node] = (x_pos, y_pos)
    
    # Get node colors based on sex
    sexes = nx.get_node_attributes(G, 'sex')
    node_colors = []
    for node in G.nodes():
        sex = sexes.get(node, 0)
        if sex == 1:  # Male
            node_colors.append('lightblue')
        elif sex == 2:  # Female
            node_colors.append('lightpink')
        else:  # Unknown
            node_colors.append('lightgray')
    
    # Draw the pedigree
    nx.draw(
        G, 
        pos=pos, 
        with_labels=True, 
        node_color=node_colors,
        node_size=node_size,
        font_size=font_size,
        arrows=True
    )
    
    plt.title(title)
    plt.tight_layout()
    
    return pos

In [None]:
# Visualize the R-created pedigree
r_pedigree_path = os.path.join(results_directory, "pedigree.fam")
r_pedigree_graph = create_pedigree_graph(r_pedigree_path)
visualize_pedigree(r_pedigree_graph, title="R-Created Pedigree", figsize=(15, 10))

# Save the visualization
plt.savefig(os.path.join(results_directory, "r_pedigree_visualization.png"), dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Visualize the Python-created pedigree
python_pedigree_path = os.path.join(results_directory, "python_pedigree.fam")
python_pedigree_graph = create_pedigree_graph(python_pedigree_path)
visualize_pedigree(python_pedigree_graph, title="Python-Created Pedigree", figsize=(15, 10))

# Save the visualization
plt.savefig(os.path.join(results_directory, "python_pedigree_visualization.png"), dpi=300, bbox_inches='tight')
plt.show()

# Part 5: Analyzing Pedigree Relationships

Let's create functions to analyze relationships within the pedigree, which will be important for IBD analysis.

In [None]:
def find_common_ancestors(G, person1, person2):
    """
    Find all common ancestors between two individuals in a family tree.
    
    Parameters:
    - G: NetworkX DiGraph representing the pedigree
    - person1: ID of first person
    - person2: ID of second person
    
    Returns:
    - Set of common ancestors
    """
    # Create a new graph with edges reversed to trace ancestors
    G_reversed = G.reverse()
    
    # Find all ancestors for each person (including the person themselves)
    ancestors1 = set(nx.descendants(G_reversed, person1)) | {person1}
    ancestors2 = set(nx.descendants(G_reversed, person2)) | {person2}
    
    # Find common ancestors
    common_ancestors = ancestors1.intersection(ancestors2)
    
    return common_ancestors

def find_most_recent_common_ancestor(G, person1, person2):
    """
    Find the most recent common ancestor (MRCA) between two individuals.
    
    Parameters:
    - G: NetworkX DiGraph representing the pedigree
    - person1: ID of first person
    - person2: ID of second person
    
    Returns:
    - MRCA: Most recent common ancestor
    - dist1: Number of generations from MRCA to person1
    - dist2: Number of generations from MRCA to person2
    """
    common_ancestors = find_common_ancestors(G, person1, person2)
    
    if not common_ancestors:
        return None, None, None
    
    # Get generation information
    generations = nx.get_node_attributes(G, 'generation')
    
    # If generation info is not available, compute it
    if not generations:
        generations = {}
        for node in nx.topological_sort(G):
            parents = list(G.predecessors(node))
            if not parents:
                generations[node] = 0  # Founder
            else:
                generations[node] = max(generations.get(p, 0) for p in parents) + 1
    
    # Find the MRCA (the one with the highest generation number)
    mrca = max(common_ancestors, key=lambda x: generations.get(x, 0))
    
    # Calculate distances (number of generations)
    dist1 = abs(generations.get(person1, 0) - generations.get(mrca, 0))
    dist2 = abs(generations.get(person2, 0) - generations.get(mrca, 0))
    
    return mrca, dist1, dist2

def identify_relationship(dist1, dist2):
    """
    Identify the relationship type based on generational distances.
    
    Parameters:
    - dist1: Generations from MRCA to person1
    - dist2: Generations from MRCA to person2
    
    Returns:
    - Relationship name
    - Degree of relationship
    """
    if dist1 is None or dist2 is None:
        return "Unrelated", float('inf')
    
    # Self
    if dist1 == 0 and dist2 == 0:
        return "Self", 0
    
    # Direct line (ancestor-descendant)
    if dist1 == 0 or dist2 == 0:
        distance = max(dist1, dist2)
        if distance == 1:
            return "Parent-Child", 1
        elif distance == 2:
            return "Grandparent-Grandchild", 2
        else:
            great = "Great-" * (distance - 2)
            return f"{great}Grandparent-Grandchild", distance
    
    # Same generation (siblings, cousins)
    if dist1 == dist2:
        if dist1 == 1:
            return "Siblings", 2
        else:
            cousin_degree = dist1 - 1
            ordinal = ["First", "Second", "Third", "Fourth", "Fifth", "Sixth"][min(cousin_degree - 1, 5)]
            return f"{ordinal} Cousins", cousin_degree * 2
    
    # Different generations
    min_dist = min(dist1, dist2)
    max_dist = max(dist1, dist2)
    difference = max_dist - min_dist
    
    if min_dist == 1:
        if difference == 1:
            return "Uncle/Aunt-Nephew/Niece", 3
        else:
            great = "Great-" * (difference - 1)
            return f"{great}Uncle/Aunt-Nephew/Niece", 2 + difference
    else:
        cousin_degree = min_dist - 1
        ordinal = ["First", "Second", "Third", "Fourth", "Fifth", "Sixth"][min(cousin_degree - 1, 5)]
        return f"{ordinal} Cousins {difference} times removed", (cousin_degree * 2) + difference

def analyze_pedigree_relationships(G, output_file=None):
    """
    Analyze all pairwise relationships in a pedigree.
    
    Parameters:
    - G: NetworkX DiGraph representing the pedigree
    - output_file: Optional path to save the relationship data
    
    Returns:
    - DataFrame with pairwise relationship information
    """
    relationships = []
    
    individuals = list(G.nodes())
    for i, person1 in enumerate(individuals):
        for person2 in individuals[i+1:]:  # Only unique pairs
            mrca, dist1, dist2 = find_most_recent_common_ancestor(G, person1, person2)
            relationship, degree = identify_relationship(dist1, dist2)
            
            relationships.append({
                'person1': person1,
                'person2': person2,
                'mrca': mrca,
                'dist1': dist1,
                'dist2': dist2,
                'relationship': relationship,
                'degree': degree,
                'meiotic_distance': dist1 + dist2 if dist1 is not None and dist2 is not None else None
            })
    
    # Create a DataFrame
    rel_df = pd.DataFrame(relationships)
    
    # Save to file if specified
    if output_file:
        rel_df.to_csv(output_file, index=False)
        print(f"Saved relationship data to {output_file}")
    
    return rel_df

## Analyzing Relationships in Our Pedigrees

Let's analyze the relationships in our pedigrees to understand the genetic relationships we'll be simulating.

In [None]:
# Analyze relationships in the Python-created pedigree
python_rels_output = os.path.join(results_directory, "python_pedigree_relationships.csv")
python_rels = analyze_pedigree_relationships(python_pedigree_graph, python_rels_output)

# Show relationship statistics
print("\
Relationship Distribution:")
rel_counts = python_rels['relationship'].value_counts().sort_index()
for rel, count in rel_counts.items():
    print(f"{rel}: {count}")

print("\
Meiotic Distance Distribution:")
md_counts = python_rels['meiotic_distance'].value_counts().sort_index()
for md, count in md_counts.items():
    print(f"Meiotic distance {md}: {count}")

In [None]:
# Visualize meiotic distance distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=python_rels, x='meiotic_distance', order=sorted(python_rels['meiotic_distance'].unique()))
plt.title('Distribution of Meiotic Distances in Python-Created Pedigree')
plt.xlabel('Meiotic Distance')
plt.ylabel('Count')
plt.grid(alpha=0.3)
plt.savefig(os.path.join(results_directory, "python_pedigree_meiotic_distances.png"), dpi=300, bbox_inches='tight')
plt.show()

## Summary of Expected IBD Sharing by Relationship

Let's create a reference table showing the expected genetic sharing (IBD) for different relationships. This will be useful for interpreting simulation results in the later labs.

In [None]:
# Create a reference table of expected IBD sharing
relationship_sharing = [
    {"relationship": "Identical twins", "meiotic_distance": 0, "expected_sharing": 100.0},
    {"relationship": "Parent-Child", "meiotic_distance": 1, "expected_sharing": 50.0},
    {"relationship": "Full siblings", "meiotic_distance": 2, "expected_sharing": 50.0},
    {"relationship": "Grandparent-Grandchild", "meiotic_distance": 2, "expected_sharing": 25.0},
    {"relationship": "Half-siblings", "meiotic_distance": 2, "expected_sharing": 25.0},
    {"relationship": "Uncle/Aunt-Nephew/Niece", "meiotic_distance": 3, "expected_sharing": 25.0},
    {"relationship": "First cousins", "meiotic_distance": 4, "expected_sharing": 12.5},
    {"relationship": "First cousins once removed", "meiotic_distance": 5, "expected_sharing": 6.25},
    {"relationship": "Second cousins", "meiotic_distance": 6, "expected_sharing": 3.125},
    {"relationship": "Second cousins once removed", "meiotic_distance": 7, "expected_sharing": 1.563},
    {"relationship": "Third cousins", "meiotic_distance": 8, "expected_sharing": 0.781},
    {"relationship": "Third cousins once removed", "meiotic_distance": 9, "expected_sharing": 0.391},
    {"relationship": "Fourth cousins", "meiotic_distance": 10, "expected_sharing": 0.195},
]

# Convert to DataFrame
sharing_df = pd.DataFrame(relationship_sharing)

# Display the table
display(sharing_df)

# Save to CSV
sharing_path = os.path.join(results_directory, "expected_ibd_sharing.csv")
sharing_df.to_csv(sharing_path, index=False)
print(f"Saved expected sharing data to {sharing_path}")

In [None]:
# Visualize expected sharing by meiotic distance
plt.figure(figsize=(12, 6))
sns.barplot(data=sharing_df, x='meiotic_distance', y='expected_sharing')
plt.title('Expected IBD Sharing by Meiotic Distance')
plt.xlabel('Meiotic Distance')
plt.ylabel('Expected Sharing (%)')
plt.grid(alpha=0.3)

# Add relationship labels
for i, row in sharing_df.iterrows():
    if i % 2 == 0:  # Skip every other label for clarity
        plt.text(row['meiotic_distance'], row['expected_sharing'] + 2, 
                row['relationship'], ha='center', va='bottom', 
                rotation=45, fontsize=8)

plt.tight_layout()
plt.savefig(os.path.join(results_directory, "expected_ibd_by_meiotic_distance.png"), dpi=300, bbox_inches='tight')
plt.show()

# Summary

In this lab, you've learned how to:

1. **Create pedigree structures** using both R and Python
2. **Convert these structures** to standard formats for genetic simulation
3. **Visualize pedigrees** using NetworkX
4. **Analyze relationships** within pedigrees
5. **Calculate expected IBD sharing** for different relationships

The files you've created will be used in Lab 8 (Simulating Data with Ped-Sim) and Lab 10 (Simulating Data with msprime) to generate simulated genetic data with known relationship structures. This will allow you to evaluate IBD detection methods and understand the patterns of genetic sharing between different types of relatives.