# Ablation Study: Sparsification vs. Weighting

## Introduction
This notebook conducts a rigorous ablation study to disentangle the effects of graph structure modification (sparsification) from edge weighting. We compare **two sparsification approaches**:

1. **Threshold-based sparsification**: Remove edges below a score threshold (configurable retention ratio)
2. **Metric backbone sparsification**: Remove semi-metric edges that violate triangle inequality (fixed retention ratio)

**Objectives:**
1. Compare four distinct scenarios for each sparsification method:
    *   **A:** Baseline (Full Graph + Binary Weights)
    *   **B:** Structure Only (Sparse Graph + Binary Weights)
    *   **C:** Weighting Only (Full Graph + Weighted Edges)
    *   **D:** Combined (Sparse Graph + Weighted Edges)
2. Compare different edge metrics: Jaccard, Adamic-Adar, Approximate Effective Resistance
3. Compare different GNN architectures: GCN, GAT, GraphSAGE
4. Analyze with multiple seeds for statistical significance
5. Measure: accuracy, training time, sparsification time, memory usage

In [None]:
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent.parent))
from src import *

In [None]:
set_global_seed(42)

if torch.cuda.is_available():
    DEVICE = "cuda"
elif torch.backends.mps.is_available():
    DEVICE = "mps"
else:
    DEVICE = "cpu"

print(f"Using device: {DEVICE}")

## 1. Load Datasets

In [None]:
from src.data import DatasetLoader, SAFE_DATASETS

loader = DatasetLoader(root="../data")

# Use ALL safe datasets (excluding polblogs which has 0 node features)
DATASETS_NO_FEATURES = ["polblogs"]
DATASET_NAMES = [d for d in SAFE_DATASETS if d not in DATASETS_NO_FEATURES]

print(f"Running experiments on {len(DATASET_NAMES)} datasets:")
print(f"  {DATASET_NAMES}")

datasets = {}
stats_summary = {}

for name in DATASET_NAMES:
    print(f"Loading {name}...", end=" ")
    data, num_features, num_classes = loader.get_dataset(name, DEVICE)
    datasets[name] = {
        "data": data,
        "num_features": num_features,
        "num_classes": num_classes,
    }
    stats_summary[name] = {
        "Nodes": int(data.num_nodes),
        "Edges": int(data.edge_index.shape[1]),
        "Features": int(num_features),
        "Classes": int(num_classes)
    }
    print(f"({data.num_nodes:,} nodes, {data.edge_index.shape[1]:,} edges)")

print_text_table(stats_summary, title="Dataset Summary")

## 2. Single Ablation Study: Threshold vs Metric Backbone

First, let's run a single comparison between threshold-based (60% retention) and metric backbone sparsification using Jaccard similarity on the Cora dataset.

In [None]:
# Run single comparison on Cora (smallest dataset for quick demo)
demo_dataset = "cora"
demo_data = datasets[demo_dataset]["data"]
demo_features = datasets[demo_dataset]["num_features"]
demo_classes = datasets[demo_dataset]["num_classes"]

study = AblationStudy(
    data=demo_data,
    num_features=demo_features,
    num_classes=demo_classes,
    device=DEVICE,
)

# Run threshold-based sparsification (60% retention)
print("=" * 60)
print(f"THRESHOLD-BASED SPARSIFICATION (60% retention) - {demo_dataset.upper()}")
print("=" * 60)
threshold_results_df = study.run_full_study(
    model_name="gcn",
    metric="jaccard",
    retention_ratio=0.6,
    hidden_channels=64,
    epochs=200,
    patience=20,
    use_metric_backbone=False,
)
threshold_results_df["SparsificationType"] = "Threshold (60%)"
threshold_results_df["Dataset"] = demo_dataset

# Run metric backbone sparsification
print("\n" + "=" * 60)
print(f"METRIC BACKBONE SPARSIFICATION - {demo_dataset.upper()}")
print("=" * 60)
backbone_results_df = study.run_full_study(
    model_name="gcn",
    metric="jaccard",
    retention_ratio=0.6,  # Ignored when use_metric_backbone=True
    hidden_channels=64,
    epochs=200,
    patience=20,
    use_metric_backbone=True,
)
backbone_results_df["SparsificationType"] = "Metric Backbone"
backbone_results_df["Dataset"] = demo_dataset

# Combine results
single_comparison_df = pd.concat([threshold_results_df, backbone_results_df], ignore_index=True)
print("\n" + "=" * 60)
print("COMPARISON: Threshold vs Metric Backbone")
print("=" * 60)
print(single_comparison_df.to_string(index=False))

### 2.1 Decomposition of Effects

**(Retrospective)** The ablation experiment has completed, and we have obtained the raw accuracy scores for all four scenarios (A, B, C, D) displayed in the table above.

**(Prospective)** We will now mathematically decompose these results to isolate specific contributions. Using the formulas defined below, we will calculate the exact percentage gain or loss attributed to **Structure** (metric backbone pruning) versus **Weighting** (edge importance).

* **Structure Effect:** $Acc_B - Acc_A$ (Impact of metric backbone sparsification)
* **Weighting Effect:** $Acc_C - Acc_A$ (Impact of re-weighting edges)
* **Combined Effect:** $Acc_D - Acc_A$ (Impact of doing both)
* **Interaction:** $Acc_D - Acc_B - Acc_C + Acc_A$ (Synergy between structure and weighting)

In [None]:
# Results are already printed in the previous cell
# This cell shows the effect analysis for the single comparison

# Extract accuracies for effect calculation (using threshold results)
baseline = threshold_results_df[threshold_results_df["Scenario"] == "A: Full + Binary"]["Accuracy"].values[0]
sparse_binary = threshold_results_df[threshold_results_df["Scenario"] == "B: Sparse + Binary"]["Accuracy"].values[0]
full_weighted = threshold_results_df[threshold_results_df["Scenario"] == "C: Full + Weighted"]["Accuracy"].values[0]
sparse_weighted = threshold_results_df[threshold_results_df["Scenario"] == "D: Sparse + Weighted"]["Accuracy"].values[0]

print("=" * 60)
print("EFFECT ANALYSIS (Threshold-based, 60% retention)")
print("=" * 60)
print(f"Baseline (A):              {baseline:.2%}")
print(f"Structure effect (B - A):  {sparse_binary - baseline:+.2%}")
print(f"Weighting effect (C - A):  {full_weighted - baseline:+.2%}")
print(f"Combined effect (D - A):   {sparse_weighted - baseline:+.2%}")
print(f"Interaction (D-B-C+A):     {sparse_weighted - sparse_binary - full_weighted + baseline:+.2%}")

### 2.2 Effect Calculation

To isolate specific contributions, we calculate:
*   **Structure Effect** = B - A (Impact of metric backbone sparsification)
*   **Weighting Effect** = C - A (Impact of re-weighting edges)
*   **Combined Effect** = D - A (Impact of doing both)
*   **Interaction** = D - B - C + A (Synergy between structure and weighting)

**Interpretation of Interaction:**
- If **positive**: The two interventions are **synergistic** (combined > sum of parts)
- If **zero**: The effects are **purely additive** (combined = sum of parts)
- If **negative**: There is **interference** (combined < sum of parts)

We will now calculate these values to quantify the impact of each intervention.

In [None]:
# Compare effect analysis between threshold and metric backbone
print("=" * 70)
print("EFFECT COMPARISON: Threshold vs Metric Backbone")
print("=" * 70)

for name, df in [("Threshold (60%)", threshold_results_df), ("Metric Backbone", backbone_results_df)]:
    baseline = df[df["Scenario"] == "A: Full + Binary"]["Accuracy"].values[0]
    sparse_binary = df[df["Scenario"] == "B: Sparse + Binary"]["Accuracy"].values[0]
    full_weighted = df[df["Scenario"] == "C: Full + Weighted"]["Accuracy"].values[0]
    sparse_weighted = df[df["Scenario"] == "D: Sparse + Weighted"]["Accuracy"].values[0]
    retention = df["ActualRetention"].mean()
    
    print(f"\n{name} (Retention: {retention:.1%}):")
    print(f"  Baseline (A):              {baseline:.2%}")
    print(f"  Structure effect (B - A):  {sparse_binary - baseline:+.2%}")
    print(f"  Weighting effect (C - A):  {full_weighted - baseline:+.2%}")
    print(f"  Combined effect (D - A):   {sparse_weighted - baseline:+.2%}")
    print(f"  Interaction (D-B-C+A):     {sparse_weighted - sparse_binary - full_weighted + baseline:+.2%}")

### 2.3 Visualization

In [None]:
# Visualize comparison between threshold and metric backbone
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = ["#2ecc71", "#3498db", "#e74c3c", "#9b59b6"]

for ax, (name, df) in zip(axes, [("Threshold (60%)", threshold_results_df), ("Metric Backbone", backbone_results_df)]):
    scenarios = df["Scenario"].tolist()
    accuracies = df["Accuracy"].tolist()
    retention = df["ActualRetention"].mean()
    baseline_acc = df[df["Scenario"] == "A: Full + Binary"]["Accuracy"].values[0]
    
    bars = ax.bar(range(4), accuracies, color=colors, edgecolor="black", linewidth=1.5)
    ax.axhline(baseline_acc, color="gray", linestyle="--", linewidth=1, label="Baseline")
    
    for bar, acc in zip(bars, accuracies):
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.005,
                f"{acc:.1%}", ha="center", va="bottom", fontsize=10, fontweight="bold")
    
    ax.set_xticks(range(4))
    ax.set_xticklabels(["A", "B", "C", "D"])
    ax.set_ylabel("Test Accuracy", fontsize=11)
    ax.set_title(f"{name}\n(Retention: {retention:.1%})", fontsize=12)
    ax.set_ylim(0.5, 0.95)
    ax.legend(loc="lower right")
    ax.grid(axis="y", alpha=0.3)

plt.suptitle("Ablation Study: Cora + GCN + Jaccard", fontsize=14)
plt.tight_layout()
plt.show()

## 3. Multi-Configuration Study

Run comprehensive ablation studies comparing:
- **Sparsification methods**: Threshold-based vs Metric Backbone
- **Edge metrics**: Jaccard, Adamic-Adar, Approximate Effective Resistance
- **Retention ratios**: 10% to 100% (for threshold-based)
- **Seeds**: 42, 123, 456 (for statistical significance)

In [None]:
# Configuration for comprehensive experiments
METRICS = ["jaccard", "approx_er", "adamic_adar"]
RETENTION_RATIOS = [0.3, 0.5, 0.7, 0.9]  # For threshold-based
SEEDS = [42, 123, 456]
MODEL = "gcn"

all_results_list = []

# Run experiments on ALL datasets
for dataset_name in DATASET_NAMES:
    print(f"\n{'#' * 70}")
    print(f"# DATASET: {dataset_name.upper()}")
    print(f"{'#' * 70}")
    
    ds = datasets[dataset_name]
    study = AblationStudy(
        data=ds["data"],
        num_features=ds["num_features"],
        num_classes=ds["num_classes"],
        device=DEVICE,
    )
    
    # Run threshold-based experiments
    print(f"\n{'=' * 60}")
    print(f"THRESHOLD-BASED SPARSIFICATION - {dataset_name.upper()}")
    print(f"{'=' * 60}")
    threshold_df = study.run_multi_config_study(
        model_names=[MODEL],
        metrics=METRICS,
        retention_ratios=RETENTION_RATIOS,
        hidden_channels=64,
        epochs=200,
        patience=20,
        seeds=SEEDS,
        use_metric_backbone=False,
    )
    threshold_df["SparsificationType"] = "Threshold"
    threshold_df["Dataset"] = dataset_name
    all_results_list.append(threshold_df)
    
    # Run metric backbone experiments
    print(f"\n{'=' * 60}")
    print(f"METRIC BACKBONE SPARSIFICATION - {dataset_name.upper()}")
    print(f"{'=' * 60}")
    backbone_df = study.run_multi_config_study(
        model_names=[MODEL],
        metrics=METRICS,
        retention_ratios=[1.0],  # Placeholder, ignored when use_metric_backbone=True
        hidden_channels=64,
        epochs=200,
        patience=20,
        seeds=SEEDS,
        use_metric_backbone=True,
    )
    backbone_df["SparsificationType"] = "MetricBackbone"
    backbone_df["Dataset"] = dataset_name
    all_results_list.append(backbone_df)
    
    print(f"Completed {dataset_name}: {len(threshold_df) + len(backbone_df)} experiments")

# Combine all results
all_results_df = pd.concat(all_results_list, ignore_index=True)
print(f"\n{'=' * 70}")
print(f"TOTAL EXPERIMENTS COMPLETED: {len(all_results_df)}")
print(f"Datasets: {all_results_df['Dataset'].unique().tolist()}")
print(f"Columns: {all_results_df.columns.tolist()}")

In [None]:
all_results_df

### 3.1 Accuracy vs Retention Ratio (with Metric Backbone Reference)

Plot accuracy across retention ratios for threshold-based methods, with metric backbone as a horizontal reference line.

In [None]:
# Aggregate results by computing mean and std across seeds
threshold_df = all_results_df[all_results_df["SparsificationType"] == "Threshold"]
backbone_df = all_results_df[all_results_df["SparsificationType"] == "MetricBackbone"]

# Group threshold results
threshold_agg = threshold_df.groupby(["Dataset", "Metric", "Retention", "Scenario"]).agg({
    "Accuracy": ["mean", "std"],
    "ActualRetention": "mean",
}).reset_index()
threshold_agg.columns = ["Dataset", "Metric", "Retention", "Scenario", "Accuracy_mean", "Accuracy_std", "ActualRetention"]

# Group backbone results
backbone_agg = backbone_df.groupby(["Dataset", "Metric", "Scenario"]).agg({
    "Accuracy": ["mean", "std"],
    "ActualRetention": "mean",
}).reset_index()
backbone_agg.columns = ["Dataset", "Metric", "Scenario", "Accuracy_mean", "Accuracy_std", "ActualRetention"]

# Plot: Accuracy vs Retention for Scenario D by Dataset
scenario_d = "D: Sparse + Weighted"
colors = {"jaccard": "#2ecc71", "approx_er": "#e74c3c", "adamic_adar": "#3498db"}
n_datasets = len(DATASET_NAMES)
n_metrics = len(METRICS)

fig, axes = plt.subplots(n_datasets, n_metrics, figsize=(5 * n_metrics, 4 * n_datasets))

for i, dataset_name in enumerate(DATASET_NAMES):
    for j, metric in enumerate(METRICS):
        ax = axes[i, j] if n_datasets > 1 else axes[j]
        
        # Threshold data
        mask = (threshold_agg["Dataset"] == dataset_name) & \
               (threshold_agg["Metric"] == metric) & \
               (threshold_agg["Scenario"] == scenario_d)
        metric_data = threshold_agg[mask]
        
        if len(metric_data) > 0:
            ax.errorbar(
                metric_data["Retention"], 
                metric_data["Accuracy_mean"],
                yerr=metric_data["Accuracy_std"],
                marker="o", 
                label="Threshold",
                color=colors.get(metric, "gray"),
                capsize=3,
            )
        
        # Metric backbone reference
        bb_mask = (backbone_agg["Dataset"] == dataset_name) & \
                  (backbone_agg["Metric"] == metric) & \
                  (backbone_agg["Scenario"] == scenario_d)
        backbone_data = backbone_agg[bb_mask]
        
        if len(backbone_data) > 0:
            bb_acc = backbone_data["Accuracy_mean"].values[0]
            bb_ret = backbone_data["ActualRetention"].values[0]
            bb_std = backbone_data["Accuracy_std"].values[0]
            ax.axhline(bb_acc, color="black", linestyle="--", linewidth=2, 
                      label=f"Backbone ({bb_ret:.0%})")
            ax.axhspan(bb_acc - bb_std, bb_acc + bb_std, alpha=0.2, color="gray")
        
        ax.set_xlabel("Retention Ratio")
        ax.set_ylabel("Accuracy")
        ax.set_title(f"{dataset_name.upper()} - {metric.replace('_', ' ').title()}")
        ax.legend(loc="lower right", fontsize=8)
        ax.grid(alpha=0.3)
        ax.set_ylim(0.3, 0.95)

fig.suptitle(f"Scenario D: Threshold vs Metric Backbone (All Datasets)\nModel: {MODEL}, Seeds: {SEEDS}", fontsize=14)
plt.tight_layout()
plt.show()

## 4. Cross-Model Comparison

Compare ablation results across different GNN architectures (GCN, SAGE, GAT) at a fixed retention ratio (70%), comparing both threshold-based and metric backbone approaches.

In [None]:
# Cross-model comparison on ALL datasets
MODELS = ["gcn", "sage", "gat"]
FIXED_RETENTION = 0.7

cross_model_list = []

for dataset_name in DATASET_NAMES:
    print(f"\n{'#' * 70}")
    print(f"# CROSS-MODEL: {dataset_name.upper()}")
    print(f"{'#' * 70}")
    
    ds = datasets[dataset_name]
    study = AblationStudy(
        data=ds["data"],
        num_features=ds["num_features"],
        num_classes=ds["num_classes"],
        device=DEVICE,
    )
    
    # Threshold-based for all models
    print(f"\nThreshold-based ({FIXED_RETENTION:.0%} retention)...")
    cross_threshold = study.run_multi_config_study(
        model_names=MODELS,
        metrics=METRICS,
        retention_ratios=[FIXED_RETENTION],
        hidden_channels=64,
        epochs=200,
        patience=20,
        seeds=SEEDS,
        use_metric_backbone=False,
    )
    cross_threshold["SparsificationType"] = "Threshold"
    cross_threshold["Dataset"] = dataset_name
    cross_model_list.append(cross_threshold)
    
    # Metric backbone for all models
    print(f"Metric backbone...")
    cross_backbone = study.run_multi_config_study(
        model_names=MODELS,
        metrics=METRICS,
        retention_ratios=[1.0],
        hidden_channels=64,
        epochs=200,
        patience=20,
        seeds=SEEDS,
        use_metric_backbone=True,
    )
    cross_backbone["SparsificationType"] = "MetricBackbone"
    cross_backbone["Dataset"] = dataset_name
    cross_model_list.append(cross_backbone)
    
    print(f"Completed {dataset_name}")

# Combine
cross_model_df = pd.concat(cross_model_list, ignore_index=True)
print(f"\nTotal cross-model experiments: {len(cross_model_df)}")

### 4.1 Comparative Analysis of Architectures

Compare how GCN, GraphSAGE, and GAT perform under threshold-based vs metric backbone sparsification.

In [None]:
# Aggregate cross-model results
cross_agg = cross_model_df.groupby(["Dataset", "Model", "Metric", "Scenario", "SparsificationType"]).agg({
    "Accuracy": ["mean", "std"],
    "TrainSec": "mean",
    "SparsifySec": "mean",
    "ActualRetention": "mean",
}).reset_index()
cross_agg.columns = ["Dataset", "Model", "Metric", "Scenario", "SparsificationType", 
                     "Accuracy_mean", "Accuracy_std", "TrainSec", "SparsifySec", "ActualRetention"]

# Print summary table for Scenario D by dataset
print("=" * 100)
print("CROSS-MODEL SUMMARY: Scenario D (Sparse + Weighted) - All Datasets")
print("=" * 100)

for dataset_name in DATASET_NAMES:
    print(f"\n{'='*50}")
    print(f"DATASET: {dataset_name.upper()}")
    print(f"{'='*50}")
    ds_data = cross_agg[(cross_agg["Dataset"] == dataset_name) & (cross_agg["Scenario"] == scenario_d)]
    if len(ds_data) > 0:
        pivot = ds_data.pivot_table(
            index=["Model", "Metric"], 
            columns="SparsificationType", 
            values=["Accuracy_mean", "ActualRetention"],
            aggfunc="first"
        )
        print(pivot.round(3).to_string())

In [None]:
# Cross-model comparison: compact visualization for all datasets
n_datasets = len(DATASET_NAMES)
n_cols = 4
n_rows = (n_datasets + n_cols - 1) // n_cols  # Ceiling division

fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 4 * n_rows))
axes = axes.flatten()

scenario_d_agg = cross_agg[cross_agg["Scenario"] == scenario_d]
bar_width = 0.35
x = np.arange(len(MODELS))

for idx, dataset_name in enumerate(DATASET_NAMES):
    ax = axes[idx]
    ds_data = scenario_d_agg[scenario_d_agg["Dataset"] == dataset_name]
    
    # Average across metrics for cleaner visualization
    threshold_by_model = ds_data[ds_data["SparsificationType"] == "Threshold"].groupby("Model").agg({
        "Accuracy_mean": "mean",
        "Accuracy_std": "mean",
    }).reindex(MODELS)
    
    backbone_by_model = ds_data[ds_data["SparsificationType"] == "MetricBackbone"].groupby("Model").agg({
        "Accuracy_mean": "mean",
        "Accuracy_std": "mean",
    }).reindex(MODELS)
    
    bars1 = ax.bar(x - bar_width/2, threshold_by_model["Accuracy_mean"], bar_width, 
                   yerr=threshold_by_model["Accuracy_std"],
                   label="Threshold (70%)", color="#3498db", capsize=3, alpha=0.8)
    bars2 = ax.bar(x + bar_width/2, backbone_by_model["Accuracy_mean"], bar_width,
                   yerr=backbone_by_model["Accuracy_std"],
                   label="Metric Backbone", color="#2ecc71", capsize=3, alpha=0.8)
    
    ax.set_xlabel("Model")
    ax.set_ylabel("Accuracy")
    ax.set_title(f"{dataset_name.upper()}", fontsize=10)
    ax.set_xticks(x)
    ax.set_xticklabels([m.upper() for m in MODELS], fontsize=8)
    ax.legend(loc="lower right", fontsize=7)
    ax.grid(axis="y", alpha=0.3)
    ax.set_ylim(0.1, 0.95)

# Hide any unused subplots
for idx in range(n_datasets, len(axes)):
    axes[idx].set_visible(False)

fig.suptitle(f"Cross-Model Comparison: Threshold vs Metric Backbone\n(Scenario D, Averaged Across Metrics, Seeds: {SEEDS})", fontsize=14)
plt.tight_layout()
plt.show()

## 5. Save Results and Summary

Save all results for analysis in notebook 05.

In [None]:
# Combine all results from multi-config and cross-model studies
final_results_df = pd.concat([all_results_df, cross_model_df], ignore_index=True)

# Remove duplicates (same config may appear in both studies)
final_results_df = final_results_df.drop_duplicates(
    subset=["Dataset", "Model", "Metric", "Retention", "Scenario", "SparsificationType", "Seed"]
)

# Save to CSV
output_path = Path("../data/ablation_results_comprehensive.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
final_results_df.to_csv(output_path, index=False)
print(f"Saved {len(final_results_df)} experiment results to {output_path}")

# Print summary statistics
print("\n" + "=" * 70)
print("EXPERIMENT SUMMARY")
print("=" * 70)
print(f"Total experiments: {len(final_results_df)}")
print(f"Datasets: {sorted(final_results_df['Dataset'].unique().tolist())}")
print(f"Sparsification types: {final_results_df['SparsificationType'].unique().tolist()}")
print(f"Models tested: {sorted(final_results_df['Model'].unique().tolist())}")
print(f"Metrics tested: {sorted(final_results_df['Metric'].unique().tolist())}")
print(f"Seeds used: {sorted(final_results_df['Seed'].unique().tolist())}")

# Summary table: Best accuracy per dataset and method
print("\n" + "=" * 70)
print("BEST ACCURACY BY DATASET & SPARSIFICATION TYPE (Scenario D)")
print("=" * 70)
best_summary = final_results_df[final_results_df["Scenario"] == "D: Sparse + Weighted"].groupby(
    ["Dataset", "SparsificationType"]
).agg({
    "Accuracy": ["mean", "std", "max"],
    "ActualRetention": "mean",
    "TrainSec": "mean",
}).round(4)
print(best_summary.to_string())