# Lab 10: 3D Genome Structure & RNA Folding

This notebook contains solutions for all exercises covering:
- **Exercise A**: RNA Secondary Structure with Nussinov Algorithm
- **Exercise B**: Hi-C Contact Matrix Analysis
- **Exercise C**: TAD Detection, Loop Detection & Annotation
- **Exercise D**: GWAS SNPs to Genes Mapping using Hi-C
- **Bonus 1**: Tiny VAE for Hi-C Denoising
- **Bonus 2**: RNA Pseudoknot Detection
- **Bonus 3**: Microbiome Co-occurrence Network

---
## Exercise A: RNA Secondary Structure with Nussinov Algorithm

**Background:**
RNA molecules fold into complex secondary structures through base pairing. The Nussinov algorithm is a classic dynamic programming approach to predict the secondary structure by maximizing the number of base pairs.

**Goals:**
1. Implement the Nussinov algorithm for RNA folding
2. Perform backtracking to obtain dot-bracket notation
3. Visualize the secondary structure as an arc diagram
4. Generate 3D helix coordinates and visualize with py3Dmol

**Allowed base pairs:**
- Watson-Crick: A-U, U-A, G-C, C-G
- Wobble: G-U, U-G

In [None]:
# Required libraries for Exercise A
import numpy as np
import math
import matplotlib.pyplot as plt
import py3Dmol

### Step 1: Define an RNA Sequence

We'll use a 50-nucleotide RNA sequence as our example. RNA uses the bases A (Adenine), U (Uracil), G (Guanine), and C (Cytosine).

In [None]:
# Example RNA sequence (50 nt)
seq = "GCAUUGGCUAGCUAGGCUAUGCUAUGGCUAGCUAUGGCAUAGCUAUGCU"
print(f"RNA Sequence ({len(seq)} nt): {seq}")

### Step 2: Implement the Nussinov Algorithm

The Nussinov algorithm uses dynamic programming to find the maximum number of base pairs:

1. **Base pairing function**: Checks if two nucleotides can form a valid pair
2. **DP table filling**: For each subsequence, compute the maximum number of pairs considering:
   - Position `i` is unpaired
   - Position `j` is unpaired  
   - Positions `i` and `j` pair together
   - The sequence is split (bifurcation)
3. **Backtracking**: Recover the actual structure from the DP table

In [None]:
def can_pair(a, b):
    """
    Check if two nucleotides can form a base pair.
    Allows Watson-Crick pairs (A-U, G-C) and wobble pairs (G-U).
    """
    valid_pairs = {
        ("A", "U"), ("U", "A"),  # Watson-Crick
        ("G", "C"), ("C", "G"),  # Watson-Crick
        ("G", "U"), ("U", "G")   # Wobble pairs
    }
    return (a, b) in valid_pairs


def nussinov(seq):
    """
    Nussinov dynamic programming algorithm for RNA secondary structure.
    
    The DP recurrence is:
        dp[i,j] = max of:
            - dp[i+1, j]           (i is unpaired)
            - dp[i, j-1]           (j is unpaired)
            - dp[i+1, j-1] + 1     (i,j pair, if valid)
            - dp[i,t] + dp[t+1,j]  (bifurcation at t)
    
    Returns:
        dp: N x N matrix where dp[i,j] = max pairs in subsequence seq[i:j+1]
    """
    n = len(seq)
    dp = np.zeros((n, n), dtype=int)

    # Fill DP table diagonally (by increasing subsequence length k)
    for k in range(1, n):  # k = j - i (subsequence length - 1)
        for i in range(n - k):
            j = i + k
            
            # Option 1: i is unpaired
            best = dp[i+1, j]
            
            # Option 2: j is unpaired
            best = max(best, dp[i, j-1])
            
            # Option 3: i and j form a base pair
            if can_pair(seq[i], seq[j]):
                best = max(best, dp[i+1, j-1] + 1)
            
            # Option 4: bifurcation - split the sequence
            for t in range(i+1, j):
                best = max(best, dp[i, t] + dp[t+1, j])
            
            dp[i, j] = best

    return dp


def backtrack(i, j, seq, dp, struct):
    """
    Backtrack through the DP table to recover the secondary structure.
    Modifies 'struct' list in place with '(' and ')' symbols.
    """
    if i >= j:
        return

    # Case 1: i is unpaired - move to subproblem [i+1, j]
    if dp[i, j] == dp[i+1, j]:
        backtrack(i+1, j, seq, dp, struct)
        return

    # Case 2: j is unpaired - move to subproblem [i, j-1]
    if dp[i, j] == dp[i, j-1]:
        backtrack(i, j-1, seq, dp, struct)
        return

    # Case 3: i and j are paired
    if can_pair(seq[i], seq[j]) and dp[i, j] == dp[i+1, j-1] + 1:
        struct[i] = "("
        struct[j] = ")"
        backtrack(i+1, j-1, seq, dp, struct)
        return

    # Case 4: bifurcation - find the split point
    for t in range(i+1, j):
        if dp[i, j] == dp[i, t] + dp[t+1, j]:
            backtrack(i, t, seq, dp, struct)
            backtrack(t+1, j, seq, dp, struct)
            return


def get_secondary_structure(seq):
    """
    Main function: predicts RNA secondary structure using Nussinov algorithm.
    
    Returns:
        Dot-bracket notation string where:
        - '.' = unpaired nucleotide
        - '(' = paired nucleotide (5' partner)
        - ')' = paired nucleotide (3' partner)
    """
    dp = nussinov(seq)
    struct = ["."] * len(seq)
    backtrack(0, len(seq)-1, seq, dp, struct)
    return "".join(struct)

### Step 3: Run the Algorithm and Get the Secondary Structure

Let's apply the Nussinov algorithm to our RNA sequence and obtain the predicted secondary structure in dot-bracket notation.

In [None]:
# Run the Nussinov algorithm on our sequence
structure = get_secondary_structure(seq)

# Display results
print("Sequence:     ", seq)
print("Dot-bracket:  ", structure)
print(f"\nSequence length: {len(seq)} nucleotides")
print(f"Number of base pairs: {structure.count('(')}")

# Extract base pair positions for visualization
def extract_pairs(structure):
    """Extract (i, j) pairs from dot-bracket notation."""
    pairs = []
    stack = []
    for i, char in enumerate(structure):
        if char == '(':
            stack.append(i)
        elif char == ')':
            if stack:
                j = stack.pop()
                pairs.append((j, i))
    return pairs

pairs = extract_pairs(structure)
print(f"\nBase pairs: {pairs[:10]}{'...' if len(pairs) > 10 else ''}")

### Step 4: Visualize Secondary Structure as Arc Diagram

An arc diagram shows the RNA sequence as a line with arcs connecting paired bases. This is a common way to visualize RNA secondary structure.

In [None]:
def plot_arc_diagram(seq, structure, pairs, figsize=(14, 5)):
    """
    Plot RNA secondary structure as an arc diagram.
    
    - Sequence displayed along the x-axis
    - Arcs connect paired bases above the sequence
    - Colors indicate base type
    """
    fig, ax = plt.subplots(figsize=figsize)
    
    n = len(seq)
    
    # Color scheme for nucleotides
    base_colors = {'A': '#FF6B6B', 'U': '#4ECDC4', 'G': '#45B7D1', 'C': '#96CEB4'}
    
    # Plot the sequence along the x-axis
    for i, base in enumerate(seq):
        color = base_colors.get(base, 'gray')
        ax.text(i, -0.5, base, ha='center', va='top', fontsize=8, 
                fontweight='bold', color=color)
        
        # Mark paired vs unpaired
        if structure[i] != '.':
            ax.plot(i, 0, 'o', color=color, markersize=6)
        else:
            ax.plot(i, 0, 'o', color='lightgray', markersize=4)
    
    # Draw arcs for base pairs
    for i, j in pairs:
        # Arc center and dimensions
        center = (i + j) / 2
        width = j - i
        height = width / 2  # Arc height proportional to span
        
        # Create arc
        arc = plt.matplotlib.patches.Arc(
            (center, 0), width, height,
            angle=0, theta1=0, theta2=180,
            color='steelblue', linewidth=1.5, alpha=0.7
        )
        ax.add_patch(arc)
    
    # Formatting
    ax.set_xlim(-1, n)
    ax.set_ylim(-2, max(n/4, 10))
    ax.set_xlabel('Position', fontsize=10)
    ax.set_title(f'RNA Secondary Structure Arc Diagram ({len(pairs)} base pairs)', fontsize=12)
    ax.axhline(0, color='gray', linewidth=0.5, alpha=0.3)
    ax.set_yticks([])
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    
    plt.tight_layout()
    plt.show()

# Visualize the secondary structure
plot_arc_diagram(seq, structure, pairs)

### Step 5: Generate 3D Helix Coordinates

RNA typically forms an A-form helix structure. We'll create a simplified 3D model where:
- Each nucleotide is represented by a single point (phosphate position)
- Paired nucleotides have a tighter radius (closer to the helix axis)
- Unpaired nucleotides have a larger radius (extend outward)
- The helix rises and twists along the z-axis

In [None]:
def generate_A_form_helix(seq, structure, rise=2.8, twist=32.7):
    """
    Build a simplified A-form RNA helix model.
    
    Parameters:
        seq: RNA sequence string
        structure: Dot-bracket structure string
        rise: Distance between consecutive nucleotides along z-axis (Å)
        twist: Rotation angle between consecutive nucleotides (degrees)
    
    Returns:
        List of (x, y, z) coordinates for each nucleotide
    """
    angle = 0
    coords = []

    for i, nt in enumerate(seq):
        # Paired nucleotides are closer to helix axis (r=6)
        # Unpaired nucleotides extend outward (r=10)
        radius = 6 if structure[i] in '()' else 10
        
        # Calculate x, y from helical angle
        x = radius * math.cos(math.radians(angle))
        y = radius * math.sin(math.radians(angle))
        z = i * rise  # Linear rise along z-axis
        
        coords.append((x, y, z))
        angle += twist  # Rotate for next nucleotide

    return coords


def coords_to_pdb(seq, coords, pairs):
    """
    Convert coordinates to PDB format with connectivity.
    
    Includes:
    - ATOM records for each nucleotide (as phosphate P atoms)
    - CONECT records to draw the backbone chain
    - CONECT records to draw base pair connections
    """
    pdb_lines = []
    
    # Add ATOM records
    for i, (x, y, z) in enumerate(coords, start=1):
        # PDB ATOM format: columns must be precise
        line = f"ATOM  {i:5d}  P   {seq[i-1]:>3s} A{i:4d}    {x:8.3f}{y:8.3f}{z:8.3f}  1.00  0.00           P\n"
        pdb_lines.append(line)
    
    # Add backbone CONECT records (connect consecutive nucleotides)
    for i in range(1, len(coords)):
        pdb_lines.append(f"CONECT{i:5d}{i+1:5d}\n")
    
    # Add base pair CONECT records
    for i, j in pairs:
        pdb_lines.append(f"CONECT{i+1:5d}{j+1:5d}\n")
    
    pdb_lines.append("END\n")
    return "".join(pdb_lines)


# Generate 3D coordinates
coords = generate_A_form_helix(seq, structure)
print(f"Generated {len(coords)} 3D coordinates")
print(f"First 3 coordinates: {coords[:3]}")
print(f"Z-range: {coords[0][2]:.1f} to {coords[-1][2]:.1f} Å")

### Step 6: 3D Visualization with py3Dmol

We'll use py3Dmol to create an interactive 3D visualization of our RNA helix model. The backbone is shown as sticks connecting consecutive nucleotides, and base pairs are shown as connections between paired positions.

In [None]:
# Generate PDB string with connectivity information
pdb_str = coords_to_pdb(seq, coords, pairs)

# Preview the PDB content (first 10 lines)
print("PDB content preview:")
for line in pdb_str.split('\n')[:10]:
    print(line)
print("...")

In [None]:
# Create interactive 3D visualization with py3Dmol
view = py3Dmol.view(width=600, height=500)
view.addModel(pdb_str, 'pdb')

# Style: spheres for atoms, sticks for connections
view.setStyle({'sphere': {'radius': 0.8, 'color': 'steelblue'}})
view.addStyle({'stick': {'radius': 0.2, 'color': 'gray'}})

# Set background and zoom
view.setBackgroundColor('white')
view.zoomTo()

# Display the view (use view.show() in Jupyter)
view

### Alternative: 3D Visualization with Matplotlib

If py3Dmol doesn't display properly, here's a matplotlib-based 3D plot as backup:

In [None]:
from mpl_toolkits.mplot3d import Axes3D

def plot_rna_3d_matplotlib(coords, pairs, seq, structure):
    """
    Plot RNA 3D structure using matplotlib.
    Shows backbone as a line and base pairs as connections.
    """
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    
    # Extract coordinates
    xs = [c[0] for c in coords]
    ys = [c[1] for c in coords]
    zs = [c[2] for c in coords]
    
    # Color based on pairing status
    colors = ['steelblue' if s in '()' else 'lightcoral' for s in structure]
    
    # Plot backbone
    ax.plot(xs, ys, zs, 'gray', alpha=0.5, linewidth=1, label='Backbone')
    
    # Plot nucleotides
    ax.scatter(xs, ys, zs, c=colors, s=50, alpha=0.8)
    
    # Plot base pairs as lines
    for i, j in pairs:
        ax.plot([xs[i], xs[j]], [ys[i], ys[j]], [zs[i], zs[j]], 
                'green', alpha=0.6, linewidth=1.5)
    
    # Add labels for a few nucleotides
    for idx in [0, len(seq)//2, len(seq)-1]:
        ax.text(xs[idx], ys[idx], zs[idx], f'{seq[idx]}{idx+1}', fontsize=8)
    
    ax.set_xlabel('X (Å)')
    ax.set_ylabel('Y (Å)')
    ax.set_zlabel('Z (Å)')
    ax.set_title(f'RNA 3D Helix Model\\n{len(pairs)} base pairs (green lines)')
    
    # Add legend
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker='o', color='w', markerfacecolor='steelblue', 
               markersize=8, label='Paired'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='lightcoral', 
               markersize=8, label='Unpaired'),
        Line2D([0], [0], color='green', linewidth=2, label='Base pair')
    ]
    ax.legend(handles=legend_elements, loc='upper left')
    
    plt.tight_layout()
    plt.show()

# Create matplotlib 3D visualization
plot_rna_3d_matplotlib(coords, pairs, seq, structure)

### Exercise A Summary

We implemented the **Nussinov algorithm** for RNA secondary structure prediction:

1. **Input**: A 50-nucleotide RNA sequence
2. **Algorithm**: Dynamic programming to maximize base pairs (O(n³) complexity)
3. **Output**: Dot-bracket notation showing paired positions
4. **Visualizations**: 
   - **Arc diagram** (2D): Shows base pair connections as arcs
   - **Helix model** (3D): Shows spatial arrangement with paired nucleotides closer to the helix axis

**Key concepts:**
- Watson-Crick pairs (A-U, G-C) and wobble pairs (G-U) are allowed
- The algorithm maximizes the number of pairs, not thermodynamic stability
- Real RNA structure prediction uses energy-based models (e.g., ViennaRNA)

---
## Exercise B: Hi-C Contact Matrix Analysis

**Goals:**
1. Inspect a Hi-C Contact Matrix
2. Apply ICE or KR Normalization (Simplified)
3. Detect Loops with a Simple Algorithm
4. PCA for A/B Compartment Detection

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def generate_toy_hic_matrix(N=100, decay=0.05, loops=None, noise_level=0.1, seed=None):
    """
    Generate a toy Hi-C contact matrix.
    
    Parameters:
        N (int): number of bins
        decay (float): exponential decay factor with genomic distance
        loops (list of tuple): list of (i,j) positions for artificial loops
        noise_level (float): standard deviation of Gaussian noise
        seed (int): random seed for reproducibility
    
    Returns:
        mat (np.ndarray): N x N Hi-C contact matrix
    """
    if seed is not None:
        np.random.seed(seed)
    
    # Distance-dependent decay
    mat = np.zeros((N, N))
    for i in range(N):
        for j in range(N):
            mat[i, j] = np.exp(-decay * abs(i-j))
    
    # Add loops
    if loops:
        for i, j in loops:
            mat[i, j] += 2
            mat[j, i] += 2  # symmetric
    
    # Optional: TAD-like domains
    for start in range(0, N, N//5):
        end = start + N//5
        mat[start:end, start:end] += 0.5
    
    # Add Gaussian noise
    mat += np.random.normal(0, noise_level, size=(N,N))
    mat[mat<0] = 0  # no negative contacts
    
    return mat

In [None]:
def plot_hic_matrix(mat, title="Hi-C Contact Map", cmap='Reds'):
    """Plot heatmap of Hi-C matrix"""
    plt.figure(figsize=(8, 8))
    plt.imshow(mat, origin='lower', cmap=cmap)
    plt.colorbar(label="Contact strength")
    plt.title(title)
    plt.xlabel("Bin")
    plt.ylabel("Bin")
    plt.tight_layout()
    plt.show()

In [None]:
# Generate synthetic Hi-C matrix
toy_hic = generate_toy_hic_matrix(
    N=100,
    loops=[(20, 70), (40, 60), (10, 90)],
    noise_level=0.05,
    seed=42
)

print(f"Hi-C matrix shape: {toy_hic.shape}")
print(f"Min value: {toy_hic.min():.4f}, Max value: {toy_hic.max():.4f}")

plot_hic_matrix(toy_hic, title="Synthetic Toy Hi-C Contact Map")

### ICE Normalization (Simplified)

In [None]:
def ice_normalization(mat, max_iter=100, tolerance=1e-5):
    """
    Simplified ICE (Iterative Correction and Eigenvector decomposition) normalization.
    """
    mat = mat.copy().astype(float)
    n = mat.shape[0]
    
    # Replace zeros with small value to avoid division issues
    mat[mat == 0] = 1e-10
    
    bias = np.ones(n)
    
    for iteration in range(max_iter):
        # Compute row sums
        row_sums = mat.sum(axis=1)
        
        # Avoid division by zero
        row_sums[row_sums == 0] = 1
        
        # Update bias
        target_sum = np.mean(row_sums)
        correction = target_sum / row_sums
        
        # Apply correction
        mat = mat * np.outer(correction, correction)
        bias *= correction
        
        # Check convergence
        if np.max(np.abs(correction - 1)) < tolerance:
            print(f"ICE converged after {iteration + 1} iterations")
            break
    
    return mat, bias


# Apply ICE normalization
ice_normalized, bias = ice_normalization(toy_hic)
plot_hic_matrix(ice_normalized, title="ICE Normalized Hi-C")

### PCA for A/B Compartment Detection

**A/B Compartments** are large-scale chromatin organization patterns:
- **Compartment A**: Gene-rich, active, open chromatin
- **Compartment B**: Gene-poor, inactive, closed chromatin

In Hi-C data, compartments create a **checkerboard pattern** where same-type regions have higher contact frequencies. We detect them using PCA on the O/E (observed/expected) correlation matrix.

First, let's generate a Hi-C matrix with clear compartment structure:

In [None]:
from sklearn.decomposition import PCA

def generate_compartment_hic(N=100, n_compartments=6, compartment_strength=0.5, 
                              decay=0.02, noise_level=0.02, seed=42):
    """
    Generate a Hi-C matrix with clear A/B compartment structure.
    
    Compartments create a checkerboard pattern where:
    - Same-type regions (A-A or B-B) have higher contact frequencies
    - Different-type regions (A-B) have lower contact frequencies
    """
    np.random.seed(seed)
    
    # Create alternating compartment labels: A=1, B=-1
    compartment_size = N // n_compartments
    compartment_labels = np.zeros(N)
    for i in range(N):
        region = i // compartment_size
        compartment_labels[i] = 1 if region % 2 == 0 else -1
    
    # Build the contact matrix
    mat = np.zeros((N, N))
    for i in range(N):
        for j in range(N):
            # Distance-dependent decay (polymer physics)
            distance_term = np.exp(-decay * abs(i - j))
            
            # Compartment term: same compartment = boost, different = reduce
            # This creates the checkerboard pattern
            same_compartment = compartment_labels[i] * compartment_labels[j]
            compartment_term = 1 + same_compartment * compartment_strength
            
            mat[i, j] = distance_term * compartment_term
    
    # Add noise and ensure symmetry
    noise = np.random.normal(0, noise_level, size=(N, N))
    noise = (noise + noise.T) / 2
    mat = mat + noise
    mat[mat < 0] = 0
    
    return mat, compartment_labels


# Generate Hi-C with compartment structure
compartment_hic, true_labels = generate_compartment_hic(
    N=100, 
    n_compartments=6,  # Creates 3 A and 3 B regions
    compartment_strength=0.6,
    decay=0.015,
    noise_level=0.02
)

print(f"Generated Hi-C matrix with {6} alternating compartment regions")
print(f"True labels: {int(np.sum(true_labels > 0))} bins in A, {int(np.sum(true_labels < 0))} bins in B")

# Visualize - should show checkerboard pattern
plot_hic_matrix(compartment_hic, title="Hi-C Matrix with A/B Compartments (Checkerboard Pattern)")

In [None]:
def detect_compartments(mat):
    """
    Detect A/B compartments using PCA on O/E correlation matrix.
    
    Steps:
    1. Compute expected contacts at each genomic distance
    2. Calculate O/E (observed/expected) matrix - removes distance decay
    3. Compute correlation matrix of O/E rows
    4. PC1 of correlation matrix separates A and B compartments
    """
    n = mat.shape[0]
    
    # Compute expected contacts at each distance
    expected = np.zeros_like(mat)
    for i in range(n):
        for j in range(n):
            dist = abs(i - j)
            diag_vals = np.diag(mat, dist)
            expected[i, j] = np.mean(diag_vals) if len(diag_vals) > 0 else 1
    
    expected[expected == 0] = 1e-10
    
    # O/E matrix removes the distance decay effect
    # Use log2 for better distribution
    oe_mat = np.log2(mat / expected + 1e-10)
    oe_mat = np.nan_to_num(oe_mat)
    
    # Correlation matrix: bins in the same compartment will be correlated
    corr_mat = np.corrcoef(oe_mat)
    corr_mat = np.nan_to_num(corr_mat)
    
    # PCA: PC1 captures the compartment signal
    pca = PCA(n_components=2)
    pca.fit(corr_mat)
    pc1 = pca.components_[0]
    
    # Assign compartments based on PC1 sign
    compartments = np.where(pc1 > 0, 'A', 'B')
    
    return pc1, compartments, corr_mat, oe_mat


# Detect compartments using PCA
pc1, compartments, corr_mat, oe_mat = detect_compartments(compartment_hic)

print(f"Detected compartments: A={np.sum(compartments=='A')}, B={np.sum(compartments=='B')}")

In [None]:
# Comprehensive visualization of compartment detection
fig, axes = plt.subplots(2, 3, figsize=(15, 9))

# Row 1: Hi-C, O/E, and Correlation matrices
ax1 = axes[0, 0]
im1 = ax1.imshow(compartment_hic, cmap='Reds', origin='lower')
ax1.set_title('Original Hi-C Matrix')
ax1.set_xlabel('Bin')
ax1.set_ylabel('Bin')
plt.colorbar(im1, ax=ax1, shrink=0.8)

ax2 = axes[0, 1]
im2 = ax2.imshow(oe_mat, cmap='RdBu_r', origin='lower', vmin=-2, vmax=2)
ax2.set_title('O/E Matrix (log2)')
ax2.set_xlabel('Bin')
plt.colorbar(im2, ax=ax2, shrink=0.8)

ax3 = axes[0, 2]
im3 = ax3.imshow(corr_mat, cmap='RdBu_r', vmin=-1, vmax=1, origin='lower')
ax3.set_title('Correlation Matrix\n(Checkerboard = Compartments)')
ax3.set_xlabel('Bin')
plt.colorbar(im3, ax=ax3, shrink=0.8)

# Row 2: PC1 signal, compartment track, and comparison with ground truth
ax4 = axes[1, 0]
ax4.plot(pc1, 'k-', linewidth=1)
ax4.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax4.fill_between(range(len(pc1)), pc1, where=pc1 > 0, alpha=0.5, color='red', label='A')
ax4.fill_between(range(len(pc1)), pc1, where=pc1 <= 0, alpha=0.5, color='blue', label='B')
ax4.set_xlabel('Bin')
ax4.set_ylabel('PC1 Value')
ax4.set_title('PC1 Signal (Compartment Eigenvector)')
ax4.legend(loc='upper right')

ax5 = axes[1, 1]
# Predicted compartments
pred_colors = [1 if c == 'A' else -1 for c in compartments]
ax5.imshow([pred_colors], cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
ax5.set_title('Predicted Compartments')
ax5.set_xlabel('Bin')
ax5.set_yticks([0])
ax5.set_yticklabels(['Predicted'])

ax6 = axes[1, 2]
# Compare with ground truth
ax6.imshow([true_labels], cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
ax6.set_title('Ground Truth Compartments')
ax6.set_xlabel('Bin')
ax6.set_yticks([0])
ax6.set_yticklabels(['True'])

plt.tight_layout()
plt.show()

# Calculate accuracy
# Note: PC1 sign might be flipped, so check both orientations
pred_binary = np.array([1 if c == 'A' else -1 for c in compartments])
accuracy1 = np.mean(pred_binary == true_labels)
accuracy2 = np.mean(pred_binary == -true_labels)  # Check if sign is flipped
accuracy = max(accuracy1, accuracy2)

print(f"\nCompartment detection accuracy: {accuracy:.1%}")
print(f"(Note: PC1 sign is arbitrary, so we check both orientations)")

---
## Exercise C: TAD Detection, Loop Detection & Annotation

**Goals:**
1. Build a Chromatin Contact Network
2. Detect TAD boundaries using insulation score
3. Detect loops with local maximum algorithm
4. Annotate loops as intra- or inter-TAD

In [None]:
from scipy.ndimage import maximum_filter

def compute_insulation_score(mat, window=5):
    """Compute insulation score for TAD boundary detection."""
    n = mat.shape[0]
    insulation = np.zeros(n)
    for i in range(n):
        start = max(0, i-window)
        end = min(n, i+window+1)
        submat = mat[start:end, start:end]
        insulation[i] = submat.sum()
    insulation = -insulation  # Lower values = boundaries
    return insulation


def detect_tad_boundaries(insulation):
    """Detect TAD boundaries as local minima in insulation score."""
    n = len(insulation)
    boundaries = [i for i in range(1, n-1)
                  if insulation[i] < insulation[i-1] and insulation[i] < insulation[i+1]]
    return boundaries


def get_tad_regions(boundaries, n_bins):
    """Convert TAD boundaries to a list of TAD regions (start, end)."""
    regions = []
    start = 0
    for b in boundaries:
        regions.append((start, b))
        start = b
    regions.append((start, n_bins))
    return regions

In [None]:
def detect_loops(mat, window=3, threshold_quantile=0.995):
    """Detect loops as local maxima above threshold."""
    threshold = np.quantile(mat, threshold_quantile)
    local_max = (mat == maximum_filter(mat, size=(window, window)))
    candidate_positions = np.argwhere(local_max & (mat >= threshold))
    candidate_positions = sorted([(i, j, mat[i, j]) for i, j in candidate_positions], 
                                  key=lambda x: -x[2])
    return candidate_positions


def annotate_loops_tad(loops, tad_regions):
    """Annotate each loop as intra-TAD or inter-TAD."""
    annotated = []
    for i, j, v in loops:
        region_i = next((idx for idx, (start, end) in enumerate(tad_regions) if start <= i < end), None)
        region_j = next((idx for idx, (start, end) in enumerate(tad_regions) if start <= j < end), None)
        if region_i == region_j:
            annotated.append((i, j, v, "intra-TAD"))
        else:
            annotated.append((i, j, v, "inter-TAD"))
    return annotated

In [None]:
def plot_tads_and_loops_annotated(mat, insulation, tad_boundaries, loops_annotated, top_n_loops=10):
    """Visualize Hi-C matrix with TAD boundaries and annotated loops."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Plot Hi-C with TADs and loops
    ax = axes[0]
    ax.imshow(mat, cmap='Reds', origin='lower')
    ax.set_title("Hi-C Contact Map with TADs and Loops")
    ax.set_xlabel("Bin")
    ax.set_ylabel("Bin")
    
    # Plot TAD boundaries
    for b in tad_boundaries:
        ax.axvline(b, color='blue', linestyle='--', alpha=0.7)
        ax.axhline(b, color='blue', linestyle='--', alpha=0.7)
    
    # Plot loops with colors
    color_map = {"intra-TAD": "green", "inter-TAD": "orange"}
    for i, j, _, loop_type in loops_annotated[:top_n_loops]:
        ax.plot(j, i, 'o', color=color_map[loop_type], markersize=8)
        ax.plot(i, j, 'o', color=color_map[loop_type], markersize=8)  # symmetric
    
    # Add legend
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], color='blue', linestyle='--', label='TAD boundary'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='green', markersize=10, label='Intra-TAD loop'),
        Line2D([0], [0], marker='o', color='w', markerfacecolor='orange', markersize=10, label='Inter-TAD loop')
    ]
    ax.legend(handles=legend_elements, loc='upper right')
    
    # Plot insulation score
    ax2 = axes[1]
    ax2.plot(insulation, 'b-')
    ax2.set_xlabel('Bin')
    ax2.set_ylabel('Insulation Score')
    ax2.set_title('Insulation Score (TAD boundaries at minima)')
    for b in tad_boundaries:
        ax2.axvline(b, color='red', linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    plt.show()

In [None]:
def run_tad_loop_pipeline(hic_mat, tad_window=5, loop_window=3, loop_threshold=0.995, top_n_loops=10):
    """Full TAD and loop detection pipeline."""
    # TAD detection
    insulation = compute_insulation_score(hic_mat, window=tad_window)
    tad_boundaries = detect_tad_boundaries(insulation)
    tad_regions = get_tad_regions(tad_boundaries, hic_mat.shape[0])
    print(f"Detected {len(tad_boundaries)} TAD boundaries at bins: {tad_boundaries}")
    print(f"TAD regions: {tad_regions}")
    
    # Loop detection
    loops = detect_loops(hic_mat, window=loop_window, threshold_quantile=loop_threshold)
    annotated_loops = annotate_loops_tad(loops, tad_regions)
    
    # Print top loops
    print(f"\nTop {top_n_loops} loops (bin_i, bin_j, value, type):")
    for loop in annotated_loops[:top_n_loops]:
        print(f"  {loop}")
    
    # Count loop types
    intra = sum(1 for l in annotated_loops if l[3] == 'intra-TAD')
    inter = sum(1 for l in annotated_loops if l[3] == 'inter-TAD')
    print(f"\nLoop summary: {intra} intra-TAD, {inter} inter-TAD")
    
    # Plot
    plot_tads_and_loops_annotated(hic_mat, insulation, tad_boundaries, annotated_loops, top_n_loops)
    
    return insulation, tad_boundaries, annotated_loops


# Run the pipeline
insulation, tad_boundaries, annotated_loops = run_tad_loop_pipeline(
    toy_hic,
    tad_window=5,
    loop_window=3,
    loop_threshold=0.995,
    top_n_loops=10
)

---
## Exercise D: Map GWAS SNPs to Target Genes Using Hi-C

**Goals:**
1. Map GWAS SNPs and genes to Hi-C bins
2. Find genes in contact with each SNP using Hi-C data
3. Pathway enrichment on mapped genes

In [None]:
import requests

# Use the synthetic Hi-C matrix
hic_mat = toy_hic

# Assume bins are evenly spaced across a chromosome, e.g., 1 Mb total
chrom_size = 1_000_000
bin_size = chrom_size // hic_mat.shape[0]

print(f"Chromosome size: {chrom_size:,} bp")
print(f"Bin size: {bin_size:,} bp")
print(f"Number of bins: {hic_mat.shape[0]}")

In [None]:
def fetch_gwas_snps(trait="type 2 diabetes", n_snps=10, chrom_size=1_000_000):
    """
    Fetch top SNPs from GWAS Catalog REST API.
    Falls back to synthetic SNPs if API fails.
    """
    try:
        url = "https://www.ebi.ac.uk/gwas/rest/api/search"
        params = {"query": trait, "size": n_snps}
        r = requests.get(url, params=params, timeout=10)
        if r.status_code == 200:
            data = r.json()
            snps = []
            for hit in data.get("_embedded", {}).get("associations", []):
                if "variant" in hit and "position" in hit.get("variant", {}):
                    snps.append(hit["variant"]["position"])
            if len(snps) > 0:
                return np.array(snps[:n_snps])
    except Exception as e:
        print(f"API error: {e}")
    
    # Generate synthetic SNPs
    print("Using synthetic SNP positions.")
    np.random.seed(123)
    return np.random.randint(0, chrom_size, size=n_snps)


# Fetch SNPs
snp_positions = fetch_gwas_snps(n_snps=5, chrom_size=chrom_size)
print(f"SNP positions (bp): {snp_positions}")

In [None]:
def fetch_genes(chrom="1", start=0, end=1_000_000):
    """
    Fetch genes from ENSEMBL REST API for a chromosome region.
    Falls back to synthetic genes if API fails.
    """
    try:
        server = "https://rest.ensembl.org"
        ext = f"/overlap/region/human/{chrom}:{start}-{end}?feature=gene"
        headers = {"Content-Type": "application/json"}
        r = requests.get(server + ext, headers=headers, timeout=10)
        if r.ok:
            genes = r.json()
            simplified_genes = [{"id": g.get("external_name", g["id"]), 
                                  "start": g["start"], 
                                  "end": g["end"]} for g in genes[:20]]
            if len(simplified_genes) > 0:
                return simplified_genes
    except Exception as e:
        print(f"API error: {e}")
    
    # Generate synthetic genes
    print("Using synthetic genes.")
    gene_starts = np.linspace(0, end - 50000, 15, dtype=int)
    gene_ends = gene_starts + np.random.randint(10000, 50000, size=15)
    gene_names = [f"Gene{i}" for i in range(15)]
    return [{"id": gene_names[i], "start": int(gene_starts[i]), "end": int(gene_ends[i])} 
            for i in range(15)]


# Fetch genes
genes = fetch_genes(end=chrom_size)
print(f"Number of genes: {len(genes)}")
print("\nFirst 5 genes:")
for g in genes[:5]:
    print(f"  {g['id']}: {g['start']:,} - {g['end']:,}")

In [None]:
def coord_to_bin(coord, bin_size, n_bins):
    """Convert genomic coordinate to Hi-C bin."""
    return min(coord // bin_size, n_bins - 1)


# Map SNPs and genes to bins
n_bins = hic_mat.shape[0]
snp_bins = [coord_to_bin(pos, bin_size, n_bins) for pos in snp_positions]
gene_bins = {g["id"]: (coord_to_bin(g["start"], bin_size, n_bins), 
                        coord_to_bin(g["end"], bin_size, n_bins)) 
             for g in genes}

print("SNP bin assignments:")
for i, (pos, bin_idx) in enumerate(zip(snp_positions, snp_bins)):
    print(f"  SNP_{i}: position {pos:,} -> bin {bin_idx}")

print("\nGene bin assignments (first 5):")
for gene_id, (start_bin, end_bin) in list(gene_bins.items())[:5]:
    print(f"  {gene_id}: bins {start_bin}-{end_bin}")

In [None]:
def snp_to_gene_contacts(hic_mat, snp_bins, gene_bins, threshold_quantile=0.95):
    """
    For each SNP, find genes whose bins have Hi-C contacts above threshold.
    """
    threshold = np.quantile(hic_mat, threshold_quantile)
    print(f"Contact threshold (q={threshold_quantile}): {threshold:.4f}")
    
    contacts = {}
    for snp_idx, snp_bin in enumerate(snp_bins):
        contacted_genes = []
        for gene_id, (start_bin, end_bin) in gene_bins.items():
            # Check if any bin in gene range has contact above threshold
            for gbin in range(start_bin, end_bin + 1):
                if hic_mat[snp_bin, gbin] >= threshold:
                    contacted_genes.append(gene_id)
                    break
        contacts[f"SNP_{snp_idx}"] = contacted_genes
    return contacts


# Find SNP-gene contacts
contacts = snp_to_gene_contacts(hic_mat, snp_bins, gene_bins)

print("\nSNP to gene contacts (via Hi-C):")
for snp, genes_hit in contacts.items():
    print(f"  {snp}: {len(genes_hit)} genes -> {genes_hit[:5]}{'...' if len(genes_hit) > 5 else ''}")

In [None]:
def plot_snp_gene_contacts(hic_mat, snp_bins, gene_bins, contacts, top_n_snps=5):
    """
    Visualize Hi-C matrix with SNP-gene connections.
    """
    fig, ax = plt.subplots(figsize=(10, 10))
    ax.imshow(hic_mat, origin='lower', cmap='Reds')
    ax.set_title("Hi-C Contact Map with SNP-Gene Contacts")
    ax.set_xlabel("Bin")
    ax.set_ylabel("Bin")
    
    # Plot SNPs along diagonal
    for snp_idx, snp_bin in enumerate(snp_bins[:top_n_snps]):
        ax.plot(snp_bin, snp_bin, 'bs', markersize=10, 
                label="SNP" if snp_idx == 0 else "")
    
    # Plot gene midpoints along diagonal
    plotted_gene = False
    for gene_id, (start_bin, end_bin) in gene_bins.items():
        mid_bin = (start_bin + end_bin) // 2
        ax.plot(mid_bin, mid_bin, 'g^', markersize=6, 
                label="Gene" if not plotted_gene else "")
        plotted_gene = True
    
    # Plot lines connecting SNPs to contacted genes
    for snp_idx, snp_bin in enumerate(snp_bins[:top_n_snps]):
        snp_id = f"SNP_{snp_idx}"
        for gene_id in contacts[snp_id]:
            if gene_id in gene_bins:
                start_bin, end_bin = gene_bins[gene_id]
                gene_bin = (start_bin + end_bin) // 2
                ax.plot([snp_bin, gene_bin], [snp_bin, gene_bin], 'c-', alpha=0.5, linewidth=2)
    
    ax.legend(loc='upper right')
    plt.tight_layout()
    plt.show()


# Visualize
plot_snp_gene_contacts(hic_mat, snp_bins, gene_bins, contacts, top_n_snps=5)

### Pathway Enrichment on Mapped Genes

In [None]:
# Collect all contacted genes
all_contacted_genes = set()
for genes_list in contacts.values():
    all_contacted_genes.update(genes_list)

print(f"Total unique genes contacted by SNPs: {len(all_contacted_genes)}")
print(f"Genes: {sorted(all_contacted_genes)}")

# Note: For real pathway enrichment, you would use tools like:
# - gseapy (GSEA in Python)
# - gprofiler
# - DAVID API
print("\n(For real analysis, use pathway enrichment tools like gseapy or gprofiler)")

---
## Bonus 1: Tiny VAE for Hi-C Denoising

Using PyTorch to train a simple Variational Autoencoder (VAE) for denoising Hi-C matrices.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Generate synthetic Hi-C matrix for VAE
N_vae = 50
np.random.seed(42)
hic_mat_vae = np.random.rand(N_vae, N_vae)
hic_mat_vae = (hic_mat_vae + hic_mat_vae.T) / 2  # Make symmetric

hic_tensor = torch.tensor(hic_mat_vae, dtype=torch.float32)
print(f"Hi-C matrix shape for VAE: {hic_mat_vae.shape}")

In [None]:
class TinyVAE(nn.Module):
    """Simple Variational Autoencoder for Hi-C denoising."""
    
    def __init__(self, input_dim):
        super().__init__()
        # Encoder
        self.fc1 = nn.Linear(input_dim, 32)
        self.fc_mu = nn.Linear(32, 16)
        self.fc_logvar = nn.Linear(32, 16)
        # Decoder
        self.fc2 = nn.Linear(16, 32)
        self.fc3 = nn.Linear(32, input_dim)
    
    def encode(self, x):
        h1 = torch.relu(self.fc1(x))
        return self.fc_mu(h1), self.fc_logvar(h1)
    
    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def decode(self, z):
        h2 = torch.relu(self.fc2(z))
        return torch.sigmoid(self.fc3(h2))
    
    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar


def vae_loss(recon_x, x, mu, logvar):
    """VAE loss = reconstruction loss + KL divergence."""
    BCE = nn.functional.mse_loss(recon_x, x, reduction='sum')
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return BCE + KLD

In [None]:
# Train VAE
input_dim = N_vae
model = TinyVAE(input_dim)
optimizer = optim.Adam(model.parameters(), lr=1e-2)

x = hic_tensor  # shape [N, N]
losses = []

print("Training VAE...")
for epoch in range(200):
    optimizer.zero_grad()
    recon, mu, logvar = model(x)
    loss = vae_loss(recon, x, mu, logvar)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
    if epoch % 50 == 0:
        print(f"  Epoch {epoch}, Loss: {loss.item():.2f}")

print(f"Final loss: {losses[-1]:.2f}")

In [None]:
# Get denoised output
with torch.no_grad():
    recon, _, _ = model(x)
denoised_hic = recon.numpy()

# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].imshow(hic_mat_vae, cmap='Reds', origin='lower')
axes[0].set_title('Original Hi-C')

axes[1].imshow(denoised_hic, cmap='Reds', origin='lower')
axes[1].set_title('VAE Denoised Hi-C')

axes[2].plot(losses)
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Loss')
axes[2].set_title('Training Loss')

plt.tight_layout()
plt.show()

print(f"Denoised Hi-C matrix shape: {denoised_hic.shape}")

---
## Bonus 2: RNA Pseudoknot Detection

Heuristic approach to detect pseudoknot-like base pairs in RNA sequences.

In [None]:
def can_pair_rna(b1, b2):
    """Check if two bases can form a Watson-Crick or wobble pair."""
    pairs = [('A', 'U'), ('U', 'A'), ('G', 'C'), ('C', 'G'), ('G', 'U'), ('U', 'G')]
    return (b1, b2) in pairs


def find_base_pairs_simple(seq):
    """
    Simple heuristic to find base pairs including potential pseudoknots.
    """
    n = len(seq)
    pairs = []
    
    for i in range(n):
        for j in range(i + 4, n):  # Minimum loop size of 3
            if can_pair_rna(seq[i], seq[j]):
                # Check if this pair doesn't conflict with existing pairs
                conflict = False
                for pi, pj in pairs:
                    # Check for crossing (pseudoknot)
                    if (pi < i < pj < j) or (i < pi < j < pj):
                        # This would be a pseudoknot - we allow it in this heuristic
                        pass
                    # Check for direct conflict (same position paired twice)
                    if i in (pi, pj) or j in (pi, pj):
                        conflict = True
                        break
                
                if not conflict:
                    pairs.append((i, j))
    
    return pairs


def detect_pseudoknots(pairs):
    """Identify which pairs form pseudoknots (crossing pairs)."""
    pseudoknots = []
    for i, (p1i, p1j) in enumerate(pairs):
        for p2i, p2j in pairs[i+1:]:
            # Check if pairs cross
            if (p1i < p2i < p1j < p2j) or (p2i < p1i < p2j < p1j):
                pseudoknots.append(((p1i, p1j), (p2i, p2j)))
    return pseudoknots

In [None]:
# Example RNA sequences
rna_sequences = [
    "GGGAAAUCC",
    "GCGCAAAGCGC",
    "GGGUUUAAACCCAAAGGG"
]

for rna_seq in rna_sequences:
    print(f"\nSequence: {rna_seq}")
    pairs = find_base_pairs_simple(rna_seq)
    print(f"Predicted base pairs: {pairs}")
    
    pseudoknots = detect_pseudoknots(pairs)
    if pseudoknots:
        print(f"Pseudoknot pairs detected: {pseudoknots}")
    else:
        print("No pseudoknots detected")

In [None]:
def visualize_rna_pairs(seq, pairs):
    """Simple arc diagram visualization of RNA base pairs."""
    fig, ax = plt.subplots(figsize=(12, 4))
    
    # Plot sequence
    for i, base in enumerate(seq):
        color = {'A': 'red', 'U': 'blue', 'G': 'green', 'C': 'orange'}.get(base, 'black')
        ax.text(i, 0, base, ha='center', va='center', fontsize=12, fontweight='bold', color=color)
    
    # Plot arcs for base pairs
    pseudoknots = detect_pseudoknots(pairs)
    pk_pairs = set()
    for pk in pseudoknots:
        pk_pairs.add(pk[0])
        pk_pairs.add(pk[1])
    
    for i, j in pairs:
        mid = (i + j) / 2
        height = (j - i) / 4
        color = 'red' if (i, j) in pk_pairs else 'gray'
        linestyle = '--' if (i, j) in pk_pairs else '-'
        
        arc = plt.matplotlib.patches.Arc((mid, 0), j - i, height * 2, 
                                          angle=0, theta1=0, theta2=180,
                                          color=color, linestyle=linestyle, linewidth=2)
        ax.add_patch(arc)
    
    ax.set_xlim(-1, len(seq))
    ax.set_ylim(-0.5, max(3, len(seq)/4))
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(f'RNA: {seq}\n(Red dashed = pseudoknot pairs)')
    plt.tight_layout()
    plt.show()


# Visualize for a longer sequence
test_seq = "GGGUUUAAACCCAAAGGG"
test_pairs = find_base_pairs_simple(test_seq)
visualize_rna_pairs(test_seq, test_pairs)

---
## Bonus 3: Microbiome Co-occurrence Network

Build a co-occurrence network from a synthetic abundance table.

In [None]:
import networkx as nx

# Synthetic abundance table: 10 taxa x 20 samples
np.random.seed(42)
n_taxa = 10
n_samples = 20
abundance = np.random.rand(n_taxa, n_samples)
taxa = [f"Taxon{i}" for i in range(n_taxa)]

print(f"Abundance table shape: {abundance.shape}")
print(f"Taxa: {taxa}")

In [None]:
# Compute correlation matrix
corr = np.corrcoef(abundance)

# Visualize correlation matrix
plt.figure(figsize=(8, 6))
plt.imshow(corr, cmap='RdBu_r', vmin=-1, vmax=1)
plt.colorbar(label='Correlation')
plt.xticks(range(n_taxa), taxa, rotation=45, ha='right')
plt.yticks(range(n_taxa), taxa)
plt.title('Taxa Correlation Matrix')
plt.tight_layout()
plt.show()

In [None]:
def build_cooccurrence_network(corr, taxa, threshold=0.5):
    """
    Build co-occurrence network from correlation matrix.
    Only keep edges with correlation above threshold.
    """
    G = nx.Graph()
    
    # Add nodes
    for taxon in taxa:
        G.add_node(taxon)
    
    # Add edges for strong correlations
    for i in range(len(taxa)):
        for j in range(i + 1, len(taxa)):
            if abs(corr[i, j]) > threshold:
                G.add_edge(taxa[i], taxa[j], 
                          weight=corr[i, j],
                          correlation=corr[i, j])
    
    return G


# Build network
threshold = 0.3  # Lower threshold to see more edges with random data
G = build_cooccurrence_network(corr, taxa, threshold=threshold)

print(f"Network has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
print(f"\nEdges (with correlation):")
for u, v, data in G.edges(data=True):
    print(f"  {u} -- {v}: r = {data['correlation']:.3f}")

In [None]:
# Plot network
plt.figure(figsize=(10, 8))

pos = nx.spring_layout(G, seed=42)

# Edge colors based on correlation sign
edge_colors = ['green' if G[u][v]['weight'] > 0 else 'red' for u, v in G.edges()]
edge_widths = [abs(G[u][v]['weight']) * 3 for u, v in G.edges()]

# Draw network
nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=700)
nx.draw_networkx_labels(G, pos, font_size=10)
nx.draw_networkx_edges(G, pos, edge_color=edge_colors, width=edge_widths, alpha=0.7)

plt.title(f"Microbiome Co-Occurrence Network (threshold={threshold})\nGreen=positive, Red=negative correlation")
plt.axis('off')
plt.tight_layout()
plt.show()

In [None]:
# Network analysis
print("Network Statistics:")
print(f"  Number of nodes: {G.number_of_nodes()}")
print(f"  Number of edges: {G.number_of_edges()}")

if G.number_of_edges() > 0:
    print(f"  Density: {nx.density(G):.3f}")
    print(f"  Average clustering coefficient: {nx.average_clustering(G):.3f}")
    
    # Degree centrality
    degree_cent = nx.degree_centrality(G)
    print("\nDegree centrality:")
    for taxon, cent in sorted(degree_cent.items(), key=lambda x: -x[1]):
        print(f"  {taxon}: {cent:.3f}")
    
    # Connected components
    components = list(nx.connected_components(G))
    print(f"\nNumber of connected components: {len(components)}")
    for i, comp in enumerate(components):
        print(f"  Component {i+1}: {comp}")
else:
    print("  No edges in network (try lowering threshold)")

---
## Summary

This notebook covered:

**Main Exercises:**
- **A**: Nussinov algorithm for RNA secondary structure prediction with 3D visualization
- **B**: Hi-C contact matrix analysis, ICE normalization, and A/B compartment detection via PCA
- **C**: TAD boundary detection using insulation scores, loop detection, and annotation
- **D**: Mapping GWAS SNPs to target genes using Hi-C contact data

**Bonus Exercises:**
- **Bonus 1**: Tiny VAE for Hi-C matrix denoising using PyTorch
- **Bonus 2**: RNA pseudoknot detection with a heuristic approach
- **Bonus 3**: Microbiome co-occurrence network from abundance data