# MUGA LAB Calibration Analysis Notebook
**Model Understanding and Generative Alignment Laboratory (MUGA LAB)**  
Department of Mathematics · Ateneo de Manila University
BS Applied Mathematics (Data Science Track)

---

## Objective

This notebook evaluates **model calibration quality** for tabular neural networks  
trained via the MUGA LAB tuning pipeline.

We compute standard metrics (ECE, MCS, NLL), visualize reliability diagrams,  
and log results automatically to MLflow.

---

## Analysis Workflow

1. Load model predictions and labels.  
2. Compute calibration metrics (ECE, MCS, NLL).  
3. Plot reliability diagrams and confidence histograms.  
4. Compare calibrated vs. uncalibrated models.  
5. Export results to CSV and LaTeX summary tables.

---



In [None]:

# ============================================================
# 1. Setup and Imports
# ============================================================

import os
import mlflow
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt

from mugalab.calibration_metrics import compute_calibration_metrics
from mugalab.reliability_diagram_utils import (
    plot_reliability_diagram,
    plot_confidence_histogram
)

mlflow.set_tracking_uri("../../results/mlruns")
mlflow.set_experiment("Calibration_Analysis")

%matplotlib inline
plt.style.use("seaborn-v0_8-muted")


# ============================================================
# 2. Load Model and Validation Data
# ============================================================

from mugalab.mlp_tuner_tabular_mlflow import load_model_and_data

model_uri = "runs:/<RUN_ID>/model"   # Replace with your MLflow run ID
model, X_val, y_val = load_model_and_data(model_uri)

print(f"Loaded model from {model_uri}")
print(f"Validation data: {X_val.shape}, Labels: {y_val.shape}")

# ============================================================
# 3. Generate Predictions and Probabilities
# ============================================================

model.eval()
with torch.no_grad():
    logits = model(torch.tensor(X_val, dtype=torch.float32))
    probs = torch.softmax(logits, dim=1).numpy()
    preds = np.argmax(probs, axis=1)

print(f"Sample probabilities:\n{probs[:5]}")


In [None]:

# ============================================================
# 4. Compute Calibration Metrics
# ============================================================

metrics = compute_calibration_metrics(probs, y_val)

print("Calibration Metrics:")
for k, v in metrics.items():
    print(f"  {k}: {v:.4f}")

results_df = pd.DataFrame([metrics])
results_df




In [None]:
# ============================================================
# 5. Visualize Reliability Diagram
# ============================================================

plot_reliability_diagram(
    probs,
    y_val,
    n_bins=15,
    title="Reliability Diagram (Uncalibrated Model)"
)

In [None]:
# ============================================================
# 6. Optional: Compare Calibrated vs. Uncalibrated
# ============================================================

from mugalab.undergraduate_research.experiments.temperature_scaling import TemperatureScaling

calibrator = TemperatureScaling()
calibrator.fit(torch.tensor(probs), torch.tensor(y_val))

probs_calibrated = calibrator.predict(torch.tensor(probs)).numpy()

# Compute metrics after calibration
metrics_calibrated = compute_calibration_metrics(probs_calibrated, y_val)

comparison = pd.DataFrame({
    "Metric": ["ECE", "MCS", "NLL"],
    "Uncalibrated": [metrics["ece"], metrics["mcs"], metrics["nll"]],
    "Calibrated": [metrics_calibrated["ece"], metrics_calibrated["mcs"], metrics_calibrated["nll"]]
})
comparison["Improvement"] = comparison["Uncalibrated"] - comparison["Calibrated"]
comparison.round(4)

In [None]:
# ============================================================
# 7. Reliability Diagram: Calibrated vs. Uncalibrated
# ============================================================

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

plot_reliability_diagram(
    probs,
    y_val,
    n_bins=15,
    title="Before Calibration",
    ax=ax[0]
)

plot_reliability_diagram(
    probs_calibrated,
    y_val,
    n_bins=15,
    title="After Calibration",
    ax=ax[1]
)

plt.tight_layout()
plt.show()


In [None]:
# ============================================================
# 8. Confidence Histogram
# ============================================================

plot_confidence_histogram(
    probs,
    y_val,
    title="Confidence Distribution (Before Calibration)"
)

plot_confidence_histogram(
    probs_calibrated,
    y_val,
    title="Confidence Distribution (After Calibration)"
)

In [None]:
# ============================================================
# 9. Log Metrics to MLflow
# ============================================================

with mlflow.start_run(run_name="Calibration_Analysis_Run"):
    for key, value in metrics.items():
        mlflow.log_metric(f"uncalibrated_{key}", value)
    for key, value in metrics_calibrated.items():
        mlflow.log_metric(f"calibrated_{key}", value)

    mlflow.log_artifact("../../reports/summary/")
    mlflow.set_tag("analysis", "calibration_comparison")

In [None]:
# ============================================================
# 10. Export Results for Reporting
# ============================================================

output_path = "../../reports/summary/calibration_comparison.csv"
comparison.to_csv(output_path, index=False)
print(f"Calibration comparison exported to: {output_path}")

## Discussion and Interpretation

In the **Results and Discussion** section of the thesis, include:

- Quantitative changes in calibration metrics (ECE, MCS, NLL).  
- Visual comparison between reliability diagrams.  
- Interpretation of whether temperature scaling improved model confidence alignment.  
- Discussion of residual miscalibration and potential future improvements.

**Example phrasing:**

> “Temperature scaling reduced the Expected Calibration Error from 0.024 to 0.011,  
> indicating that post-hoc calibration substantially improved model reliability  
> without affecting predictive accuracy.”

## Notebook Summary

| Section | Description |
|---------|-------------|
| 1–3 | Setup, model load, and predictions |
| 4 | Metric computation |
| 5–7 | Visualization and comparison |
| 8 | Confidence distribution |
| 9–10 | Logging and export |
