# MUGA LAB Seed Sensitivity and Reproducibility Analysis

**Model Understanding and Generative Alignment Laboratory (MUGA LAB)**  
Department of Mathematics, Ateneo de Manila University 
BS Applied Mathematics (Data Science Track)



---

## Objective

This notebook evaluates **model performance stability across random seeds**  
to quantify the **reproducibility** of calibration and distillation experiments.

It integrates with:
- `seed_sensitivity_utils.py` (multi-seed evaluation)
- `mlp_tuner_tabular_mlflow.py` (Optuna/DEHB tuning)
- `calibration_metrics.py` (ECE, MCS, NLL computation)
- `reliability_diagram_utils.py` (calibration visualization)

---

## nalysis Workflow

1. Import dependencies and connect to MLflow.
2. Load the best model or configuration from tuning results.
3. Run multi-seed evaluation using `evaluate_across_seeds()` or `evaluate_metrics_across_seeds()`.
4. Compute and visualize:
   - Mean ± standard deviation of metrics across seeds.
   - Calibration plots for best and worst seeds.
   - Stability ranking of configurations.
5. Export results as CSV and summary report.

---



In [None]:
# mugalab/undergraduate_research/notebooks/seed_sensitivity_analysis.ipynb

# ============================================================
# 1. Setup
# ============================================================

import mlflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mugalab.undergraduate_research.utils.seed_sensitivity_utils import (
    evaluate_metrics_across_seeds,
    export_seed_results
)
from mugalab.calibration_metrics import compute_calibration_metrics
from mugalab.reliability_diagram_utils import plot_reliability_diagram

# Optional: Use inline plotting
%matplotlib inline

mlflow.set_tracking_uri("../../results/mlruns")
mlflow.set_experiment("Seed_Sensitivity_Analysis")

# Define seed list and experiment context
seeds = [0, 21, 42, 84, 126]
task = "classification"

# ============================================================
# 2. Load Trained Model and Data
# ============================================================

import torch
from mugalab.mlp_tuner_tabular_mlflow import load_model_and_data

# Example function to load dataset and model checkpoint
# Adjust paths for your project
model_uri = "runs:/<RUN_ID>/model"
model, X_val, y_val = load_model_and_data(model_uri)

print(f"Loaded model from {model_uri}")
print(f"Validation data: {X_val.shape}, Labels: {y_val.shape}")

# ============================================================
# 3. Define Evaluation Function
# ============================================================

def evaluate_calibration(model, X_val, y_val, task="classification"):
    """
    Evaluate calibration metrics for a given model and dataset.
    Returns a dictionary of metrics.
    """
    model.eval()
    with torch.no_grad():
        logits = model(torch.tensor(X_val, dtype=torch.float32))
        probs = torch.softmax(logits, dim=1).numpy()
        preds = np.argmax(probs, axis=1)

    metrics = compute_calibration_metrics(probs, y_val)
    return {
        "ECE": metrics["ece"],
        "MCS": metrics["mcs"],
        "NLL": metrics["nll"],
        "Accuracy": np.mean(preds == y_val)
    }

# ============================================================
# 4. Run Multi-Seed Evaluation
# ============================================================

from mugalab.undergraduate_research.utils.seed_sensitivity_utils import evaluate_metrics_across_seeds

results_df = evaluate_metrics_across_seeds(
    func=evaluate_calibration,
    seeds=seeds,
    metric_keys=["ECE", "MCS", "NLL", "Accuracy"],
    mlflow_experiment="Seed_Sensitivity_Analysis",
    model=model,
    X_val=X_val,
    y_val=y_val,
    task=task
)

results_df.head()

# ============================================================
# 5. Compute Summary Statistics
# ============================================================

summary = results_df.describe()[["ECE", "MCS", "NLL", "Accuracy"]].T
summary["std / mean"] = summary["std"] / summary["mean"]
summary = summary.round(4)
summary

# ============================================================
# 6. Visualization: Metric Variance Across Seeds
# ============================================================

plt.figure(figsize=(8, 5))
for metric in ["ECE", "MCS", "NLL", "Accuracy"]:
    plt.plot(results_df["seed"], results_df[metric], marker="o", label=metric)
plt.title("Metric Variability Across Seeds")
plt.xlabel("Random Seed")
plt.ylabel("Metric Value")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# ============================================================
# 7. Reliability Diagrams for Extremes
# ============================================================

best_seed = results_df.loc[results_df["ECE"].idxmin(), "seed"]
worst_seed = results_df.loc[results_df["ECE"].idxmax(), "seed"]

print(f"Best calibration (lowest ECE): Seed {best_seed}")
print(f"Worst calibration (highest ECE): Seed {worst_seed}")

# Assuming model can be reloaded per seed if trained separately
for seed in [best_seed, worst_seed]:
    np.random.seed(seed)
    torch.manual_seed(seed)
    logits = model(torch.tensor(X_val, dtype=torch.float32))
    probs = torch.softmax(logits, dim=1).numpy()
    plot_reliability_diagram(probs, y_val, title=f"Reliability Diagram (Seed {seed})")

# ============================================================
# 8. Export Results
# ============================================================

output_path = "../../reports/summary/seed_sensitivity_results.csv"
export_seed_results(results_df, output_path)
print(f"Results exported to {output_path}")



---

## Discussion and Reporting

In the **Results and Discussion** section of the thesis, report:

- Mean ± std for all calibration metrics.
- Identify whether performance variance is statistically significant.
- Visualize reliability diagrams for best/worst seeds.
- Discuss stability implications for reproducibility.

**Example phrasing:**
> “Across five random seeds, Expected Calibration Error (ECE) varied between 0.014 and 0.029,  
> indicating moderate sensitivity to initialization. The variance-to-mean ratio suggests that  
> calibration remains consistent across different random initializations.”


---

## Notebook Summary

| Section | Purpose |
|---------|----------|
| 1–2.  Setup & Load  | Connect to MLflow and load model/data. |
| 3. Evaluation Function | Compute ECE, MCS, NLL, and accuracy. |
| 4–5. Multi-seed Analysis | Aggregate results across seeds. |
| 6–7. Visualization | Plot metric variance and reliability diagrams. |
| 8. Export | Save CSV to `reports/summary/`. |
| Discussion | Interpret results for thesis inclusion. |
