# **Temperature Scaling**

In [None]:
!git clone https://github.com/bearpaw/pytorch-classification.git

Cloning into 'pytorch-classification'...
remote: Enumerating objects: 287, done.[K
remote: Total 287 (delta 0), reused 0 (delta 0), pack-reused 287 (from 1)[K
Receiving objects: 100% (287/287), 440.37 KiB | 483.00 KiB/s, done.
Resolving deltas: 100% (167/167), done.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!unzip /content/drive/MyDrive/Project.zip -d /content/

Archive:  /content/drive/MyDrive/Project.zip
   creating: /content/Project/
   creating: /content/Project/resnet164Cifar100/
  inflating: /content/Project/resnet164Cifar100/checkpoint.pth.tar  
  inflating: /content/Project/resnet164Cifar100/log.eps  
  inflating: /content/Project/resnet164Cifar100/log.txt  
  inflating: /content/Project/resnet164Cifar100/model_best.pth.tar  
   creating: /content/Project/densenet190Cifar100/
  inflating: /content/Project/densenet190Cifar100/checkpoint.pth.tar  
  inflating: /content/Project/densenet190Cifar100/log.txt  
  inflating: /content/Project/densenet190Cifar100/model_best.pth.tar  
   creating: /content/Project/WRNCifar100/
  inflating: /content/Project/WRNCifar100/checkpoint.pth.tar  
  inflating: /content/Project/WRNCifar100/log.eps  
  inflating: /content/Project/WRNCifar100/log.txt  
  inflating: /content/Project/WRNCifar100/model_best.pth.tar  
   creating: /content/Project/zip/
  inflating: /content/Project/zip/resnet-110.zip  
  inflati

# **Cifar-100 ~ 4 models**

In [None]:
%cd pytorch-classification
# -----------------------------
# ECE CALIBRATION ON CIFAR-100
# -----------------------------
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
from collections import OrderedDict
import matplotlib.pyplot as plt
import os

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Temperature grid
T_values = np.linspace(1.0, 2.5, num=25)

transform_cifar = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

# --- Load and Split Data (CORRECTED) ---
print("\nLoading and splitting CIFAR-100 data...")
try:
    # 1. Load the *original* 10,000-image test set
    full_test_dataset = datasets.CIFAR100(root='./data', train=False, download=True, transform=transform_cifar)

    # 2. Split the 10,000 images into a 5,000-image validation set and a 5,000-image test set
    val_size = 5000
    test_size = 5000

    # Use a fixed generator for reproducible splits
    val_dataset, test_dataset = random_split(full_test_dataset, [val_size, test_size],
                                             generator=torch.Generator().manual_seed(42))

    # 3. Create DataLoaders
    val_loader = DataLoader(val_dataset, batch_size=100, shuffle=False, num_workers=2)
    test_loader = DataLoader(test_dataset, batch_size=100, shuffle=False, num_workers=2)

    print(f"Data successfully split from original test set:")
    print(f"  -> New Validation samples: {len(val_dataset)}")
    print(f"  -> New Test samples:       {len(test_dataset)}")

except Exception as e:
    print(f"‚ùå ERROR: Could not load CIFAR-100 data. {e}")
    exit()

# --- Helper Functions ---

def get_predictions(model, loader, device, temp=1.0):
    model.eval()
    all_conf, all_corr = [], []
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x) / temp
            probs = F.softmax(logits, dim=1)
            conf, pred = torch.max(probs, 1)
            all_conf.extend(conf.cpu().numpy())
            all_corr.extend((pred == y).cpu().numpy())
    # Note: Accuracy (corr) is not affected by temp, so acc_b and acc_a will be identical
    return np.array(all_conf), np.array(all_corr)

def calculate_ece(confidences, correct, n_bins=10):
    """Calculates the Expected Calibration Error (ECE)."""
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    ece = 0.0
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        prop_in_bin = np.mean(in_bin)
        if prop_in_bin > 0:
            accuracy_in_bin = np.mean(correct[in_bin])
            avg_confidence_in_bin = np.mean(confidences[in_bin])
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    return ece * 100

# ---------------------------
# PLOTTING FUNCTIONS
# ---------------------------

def plot_reliability_diagram(confidences, correct, n_bins, model_name, suffix):
    """Plots a reliability diagram and saves it to a file."""
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]

    bin_accs = np.zeros(n_bins)
    bin_confs = np.zeros(n_bins)
    bin_props = np.zeros(n_bins)

    for i, (bin_lower, bin_upper) in enumerate(zip(bin_lowers, bin_uppers)):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        bin_props[i] = np.mean(in_bin)
        if bin_props[i] > 0:
            bin_accs[i] = np.mean(correct[in_bin])
            bin_confs[i] = np.mean(confidences[in_bin])

    # Plot
    plt.figure(figsize=(8, 7))
    bar_width = 1.0 / n_bins
    bar_centers = bin_lowers + bar_width / 2
    non_empty_mask = bin_props > 0

    # ECE bar plot
    plt.bar(bar_centers[non_empty_mask], bin_accs[non_empty_mask],
            width=bar_width * 0.9, alpha=0.3, color='red',
            edgecolor='red', label='Accuracy')
    # Gaps (where conf != acc)
    plt.bar(bar_centers[non_empty_mask], (bin_confs - bin_accs)[non_empty_mask],
            bottom=bin_accs[non_empty_mask],
            width=bar_width * 0.9, alpha=0.5, color='blue',
            edgecolor='black', label='Confidence Gap')

    # Perfect calibration line
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')

    plt.xlabel('Confidence')
    plt.ylabel('Accuracy')
    plt.title(f'Reliability Diagram: {model_name} ({suffix})')
    plt.legend()
    plt.xlim(0, 1)
    plt.ylim(0, 1)
    plt.grid(True, linestyle='--', alpha=0.6)

    save_folder = "/content/cifar-100"
    os.makedirs(save_folder, exist_ok=True) # Creates the folder if it doesn't exist

    filename = f"{save_folder}/{model_name.replace(' ', '_')}_reliability_{suffix}.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved reliability diagram: {filename}")

def plot_ece_vs_temp(T_values, ece_values, best_T, model_name):
    """Plots ECE vs. Temperature and saves it to a file."""
    plt.figure(figsize=(8, 6))
    plt.plot(T_values, ece_values, 'o-', label='ECE (%)', color='royalblue')
    plt.axvline(best_T, color='red', linestyle='--',
                label=f'Min ECE at T={best_T:.2f}')
    plt.xlabel('Temperature (T)')
    plt.ylabel('Expected Calibration Error (%)')
    plt.title(f'ECE vs Temperature (Validation Set): {model_name}')
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend()
    plt.tight_layout()

    save_folder = "/content/cifar-100"
    os.makedirs(save_folder, exist_ok=True) # Creates the folder if it doesn't exist

    filename = f"{save_folder}/{model_name.replace(' ', '_')}_ECE_vs_T.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved ECE-vs-T curve: {filename}")


# ---------------------------
# CHECKPOINT LOADER (FIXED)
# ---------------------------
def load_checkpoint(model, path, device):
    print(f"Loading checkpoint: {path}")

    try:
        ckpt = torch.load(path, map_location=device, weights_only=False)
    except Exception as e:
        print(f"Error loading checkpoint with weights_only=False: {e}")
        try:
            print("Attempting fallback with weights_only=True...")
            ckpt = torch.load(path, map_location=device, weights_only=True)
        except Exception as e_true:
            print(f"Fallback also failed: {e_true}")
            raise e

    if 'state_dict' in ckpt:
        state_dict = ckpt['state_dict']
    else:
        print("Warning: 'state_dict' key not found. Assuming checkpoint is the state_dict itself.")
        state_dict = ckpt

    new_sd = OrderedDict()
    for k,v in state_dict.items():
        new_sd[k.replace('module.','')] = v
    model.load_state_dict(new_sd)
    return model

# ---------------------------
# ECE Calibration Function (NOW RETURNS RESULTS)
# ---------------------------
def run_ece_calibration(model, model_name, n_bins=15):
    print("\n" + "="*60)
    print(f"üîç ECE Calibration for: {model_name}")
    print("="*60)

    model.to(device).eval()

    # --- Test metrics BEFORE scaling ---
    conf_b, corr_b = get_predictions(model, test_loader, device, temp=1.0)
    ece_b = calculate_ece(conf_b, corr_b, n_bins=n_bins)
    acc_b = np.mean(corr_b) * 100
    avg_conf_b = np.mean(conf_b) * 100

    print(f"\nTest metrics BEFORE scaling (T=1.0) ‚Üí Acc: {acc_b:.2f}%, Avg Conf: {avg_conf_b:.2f}%, ECE: {ece_b:.3f}%")
    plot_reliability_diagram(conf_b, corr_b, n_bins, model_name, "before")


    # --- Temperature search on validation set ---
    best_ece_val = float('inf')
    best_T = None
    print("\nValidation set temperature search:")

    ece_values_val = []

    for T in T_values:
        conf, corr = get_predictions(model, val_loader, device, temp=T)
        ece = calculate_ece(conf, corr, n_bins=n_bins)
        ece_values_val.append(ece)
        print(f"T={T:.2f} ‚Üí ECE={ece:.3f}%")
        if ece < best_ece_val:
            best_ece_val = ece
            best_T = T

    print(f"\nüéØ Best T on Validation Set (Min ECE): {best_T:.2f} ‚Üí ECE={best_ece_val:.3f}%")
    plot_ece_vs_temp(T_values, ece_values_val, best_T, model_name)


    # --- Test metrics AFTER scaling ---
    conf_a, corr_a = get_predictions(model, test_loader, device, temp=best_T)
    ece_a = calculate_ece(conf_a, corr_a, n_bins=n_bins)
    acc_a = np.mean(corr_a) * 100 # Note: acc_a will be identical to acc_b
    avg_conf_a = np.mean(conf_a) * 100

    print(f"\nTest metrics AFTER scaling (T={best_T:.2f}) ‚Üí Acc: {acc_a:.2f}%, Avg Conf: {avg_conf_a:.2f}%, ECE: {ece_a:.3f}%\n")
    plot_reliability_diagram(conf_a, corr_a, n_bins, model_name, "after")

    # --- NEW: Return results for final table ---
    return {
        "name": model_name,
        "acc": acc_a,
        "ece_before": ece_b,
        "ece_after": ece_a,
        "conf_before": avg_conf_b,
        "conf_after": avg_conf_a,
        "best_T": best_T
    }

# -----------------------------
# RUN ECE CALIBRATION FOR EACH MODEL
# -----------------------------

# NEW: List to store results from all models
all_results = []

try:
    # ---!!! IMPORT YOUR ACTUAL MODELS HERE ---
    from models.cifar import resnet
    from models.cifar.densenet import densenet, Bottleneck
    from models.cifar.wrn import WideResNet
    # -----------------------------------------
    print("Successfully imported models from 'models.cifar' directory.")

except ImportError:
    print("WARNING: Could not import model definitions. Using mocks.")
    class MockModel(nn.Module):
        def __init__(self): super().__init__(); self.fc = nn.Linear(3072, 100)
        def forward(self, x): return self.fc(x.view(x.size(0), -1))

    resnet = lambda depth, num_classes, block_name: MockModel()
    densenet = lambda depth, num_classes, growthRate, compressionRate, block: MockModel()
    WideResNet = lambda depth, num_classes, widen_factor, dropRate: MockModel()
    Bottleneck = None


try:
    model_resnet164 = resnet(depth=164, num_classes=100, block_name='Bottleneck')
    path = '/content/Project/resnet164Cifar100/checkpoint.pth.tar'
    model_resnet164 = load_checkpoint(model_resnet164, path, device)
    # NEW: Capture results
    results = run_ece_calibration(model_resnet164, "ResNet-164")
    all_results.append(results)
except Exception as e:
    print(f"‚ùå Skipping ResNet-164: {e}")

try:
    model_densenet = densenet(depth=190, num_classes=100, growthRate=40,
                              compressionRate=2, block=Bottleneck)
    path = '/content/Project/densenet190Cifar100/checkpoint.pth.tar'
    model_densenet = load_checkpoint(model_densenet, path, device)
    # NEW: Capture results
    results = run_ece_calibration(model_densenet, "DenseNet-190")
    all_results.append(results)
except Exception as e:
    print(f"‚ùå Skipping DenseNet-190: {e}")

try:
    model_hub = torch.hub.load("chenyaofo/pytorch-cifar-models",
                               "cifar100_resnet56", pretrained=True, trust_repo=True)
    # NEW: Capture results
    results = run_ece_calibration(model_hub, "ResNet-56 (Hub)")
    all_results.append(results)
except Exception as e:
    print(f"‚ùå Skipping ResNet-56 (Hub): {e}")

try:
    model_wrn = WideResNet(depth=28, num_classes=100, widen_factor=10, dropRate=0.3)
    path = '/content/Project/WRNCifar100/checkpoint.pth.tar'
    model_wrn = load_checkpoint(model_wrn, path, device)
    # NEW: Capture results
    results = run_ece_calibration(model_wrn, "WideResNet-28-10")
    all_results.append(results)
except Exception as e:
    print(f"‚ùå Skipping WRN-28-10: {e}")

print("\n" + "="*60)
print("‚úÖ ECE Calibration run complete for all models.")
print("="*60)


# -----------------------------
# NEW: FINAL COMPARISON TABLE
# -----------------------------
print("\n" + "="*130)
print("üìä Final Calibration Comparison on CIFAR-100 Test Set")
print("="*130)

# Print Header
print(f"{'Model':<22} | {'Accuracy':>10} | {'ECE (Before)':>13} | {'ECE (After)':>12} | {'Avg Conf (Before)':>18} | {'Avg Conf (After)':>17} | {'Optimal T':>10}")
print("-" * 130)

# Print results
for r in all_results:
    print(f"{r['name']:<22} | {r['acc']:>9.2f}% | {r['ece_before']:>12.4f}% | {r['ece_after']:>11.4f}% | {r['conf_before']:>17.2f}% | {r['conf_after']:>16.2f}% | {r['best_T']:>10.4f}")

print("=" * 130)

/content/pytorch-classification
Using device: cuda

Loading and splitting CIFAR-100 data...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 169M/169M [00:04<00:00, 34.7MB/s]


Data successfully split from original test set:
  -> New Validation samples: 5000
  -> New Test samples:       5000
Successfully imported models from 'models.cifar' directory.
Loading checkpoint: /content/Project/resnet164Cifar100/checkpoint.pth.tar

üîç ECE Calibration for: ResNet-164

Test metrics BEFORE scaling (T=1.0) ‚Üí Acc: 73.14%, Avg Conf: 88.21%, ECE: 15.075%
‚úÖ Saved reliability diagram: /content/cifar-100/ResNet-164_reliability_before.png

Validation set temperature search:
T=1.00 ‚Üí ECE=14.212%
T=1.06 ‚Üí ECE=13.434%
T=1.12 ‚Üí ECE=12.645%
T=1.19 ‚Üí ECE=11.848%
T=1.25 ‚Üí ECE=11.040%
T=1.31 ‚Üí ECE=10.222%
T=1.38 ‚Üí ECE=9.395%
T=1.44 ‚Üí ECE=8.558%
T=1.50 ‚Üí ECE=7.711%
T=1.56 ‚Üí ECE=6.855%
T=1.62 ‚Üí ECE=5.990%
T=1.69 ‚Üí ECE=5.116%
T=1.75 ‚Üí ECE=4.264%
T=1.81 ‚Üí ECE=3.352%
T=1.88 ‚Üí ECE=2.709%
T=1.94 ‚Üí ECE=2.326%
T=2.00 ‚Üí ECE=2.108%
T=2.06 ‚Üí ECE=1.860%
T=2.12 ‚Üí ECE=2.595%
T=2.19 ‚Üí ECE=3.173%
T=2.25 ‚Üí ECE=3.168%
T=2.31 ‚Üí ECE=4.074%
T=2.38 ‚Üí ECE=5.

Using cache found in /root/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master



Test metrics BEFORE scaling (T=1.0) ‚Üí Acc: 67.42%, Avg Conf: 83.29%, ECE: 15.867%
‚úÖ Saved reliability diagram: /content/cifar-100/ResNet-56_(Hub)_reliability_before.png

Validation set temperature search:
T=1.00 ‚Üí ECE=16.074%
T=1.06 ‚Üí ECE=14.927%
T=1.12 ‚Üí ECE=13.749%
T=1.19 ‚Üí ECE=12.594%
T=1.25 ‚Üí ECE=11.384%
T=1.31 ‚Üí ECE=10.152%
T=1.38 ‚Üí ECE=8.931%
T=1.44 ‚Üí ECE=7.600%
T=1.50 ‚Üí ECE=6.398%
T=1.56 ‚Üí ECE=5.611%
T=1.62 ‚Üí ECE=4.298%
T=1.69 ‚Üí ECE=3.509%
T=1.75 ‚Üí ECE=2.972%
T=1.81 ‚Üí ECE=2.739%
T=1.88 ‚Üí ECE=2.704%
T=1.94 ‚Üí ECE=3.539%
T=2.00 ‚Üí ECE=4.897%
T=2.06 ‚Üí ECE=6.356%
T=2.12 ‚Üí ECE=7.820%
T=2.19 ‚Üí ECE=9.287%
T=2.25 ‚Üí ECE=10.753%
T=2.31 ‚Üí ECE=12.217%
T=2.38 ‚Üí ECE=13.675%
T=2.44 ‚Üí ECE=15.124%
T=2.50 ‚Üí ECE=16.561%

üéØ Best T on Validation Set (Min ECE): 1.88 ‚Üí ECE=2.704%
‚úÖ Saved ECE-vs-T curve: /content/cifar-100/ResNet-56_(Hub)_ECE_vs_T.png

Test metrics AFTER scaling (T=1.88) ‚Üí Acc: 67.42%, Avg Conf: 64.96%, ECE: 2.677%

‚úÖ Save

# **Cifar-10 ~ 2 Models**

In [None]:
# -----------------------------
# ECE CALIBRATION FOR CIFAR-10 MODELS
# -----------------------------
# import numpy as np
# import torch
# import torch.nn as nn
# import torch.nn.functional as F
# from torchvision import datasets, transforms
# from torch.utils.data import DataLoader, random_split
# from collections import OrderedDict
# import matplotlib.pyplot as plt
# import os

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Temperature grid
T_values = np.linspace(1.0, 2.5, num=10)

# UPDATED: Normalization stats for CIFAR-10
transform_cifar = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

# --- Load and Split Data (for CIFAR-10) ---
print("\nLoading and splitting CIFAR-10 data...")
try:
    # 1. Load the *original* 10,000-image CIFAR-10 test set
    full_test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_cifar)

    # 2. Split the 10,000 images into a validation set and a test set
    val_size = 5000
    test_size = 5000
    val_dataset, test_dataset = random_split(full_test_dataset, [val_size, test_size],
                                             generator=torch.Generator().manual_seed(42))

    # 3. Create DataLoaders
    val_loader = DataLoader(val_dataset, batch_size=100, shuffle=False, num_workers=2)
    test_loader = DataLoader(test_dataset, batch_size=100, shuffle=False, num_workers=2)

    print(f"Data successfully split from original CIFAR-10 test set:")
    print(f"  -> New Validation samples: {len(val_dataset)}")
    print(f"  -> New Test samples:       {len(test_dataset)}")

except Exception as e:
    print(f"‚ùå ERROR: Could not load CIFAR-10 data. {e}")
    exit()

# --- Helper Functions ---

def get_predictions(model, loader, device, temp=1.0):
    model.eval()
    all_conf, all_corr = [], []
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x) / temp
            probs = F.softmax(logits, dim=1)
            conf, pred = torch.max(probs, 1)
            all_conf.extend(conf.cpu().numpy())
            all_corr.extend((pred == y).cpu().numpy())
    return np.array(all_conf), np.array(all_corr)

def calculate_ece(confidences, correct, n_bins=15):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    ece = 0.0
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        prop_in_bin = np.mean(in_bin)
        if prop_in_bin > 0:
            accuracy_in_bin = np.mean(correct[in_bin])
            avg_confidence_in_bin = np.mean(confidences[in_bin])
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    return ece * 100

# --- Plotting Functions (Updated with save folder) ---

def plot_reliability_diagram(confidences, correct, n_bins, model_name, suffix):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    bin_accs, bin_confs, bin_props = np.zeros(n_bins), np.zeros(n_bins), np.zeros(n_bins)

    for i, (bin_lower, bin_upper) in enumerate(zip(bin_lowers, bin_uppers)):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        bin_props[i] = np.mean(in_bin)
        if bin_props[i] > 0:
            bin_accs[i] = np.mean(correct[in_bin])
            bin_confs[i] = np.mean(confidences[in_bin])

    plt.figure(figsize=(8, 7))
    bar_width, bar_centers = 1.0 / n_bins, bin_lowers + 1.0 / (2 * n_bins)
    non_empty = bin_props > 0

    plt.bar(bar_centers[non_empty], bin_accs[non_empty], width=bar_width*0.9, alpha=0.3, color='red', edgecolor='red', label='Accuracy')
    plt.bar(bar_centers[non_empty], (bin_confs - bin_accs)[non_empty], bottom=bin_accs[non_empty], width=bar_width*0.9, alpha=0.5, color='blue', edgecolor='black', label='Confidence Gap')
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')

    plt.xlabel('Confidence'); plt.ylabel('Accuracy'); plt.title(f'Reliability Diagram: {model_name} ({suffix})')
    plt.legend(); plt.xlim(0, 1); plt.ylim(0, 1); plt.grid(True, linestyle='--', alpha=0.6)

    save_folder = "/content/cifar-10"
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_C10_reliability_{suffix}.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved reliability diagram: {filename}")

def plot_ece_vs_temp(T_values, ece_values, best_T, model_name):
    plt.figure(figsize=(8, 6))
    plt.plot(T_values, ece_values, 'o-', label='ECE (%)', color='royalblue')
    plt.axvline(best_T, color='red', linestyle='--', label=f'Min ECE at T={best_T:.2f}')
    plt.xlabel('Temperature (T)'); plt.ylabel('Expected Calibration Error (%)'); plt.title(f'ECE vs Temp (Validation): {model_name}')
    plt.grid(True, linestyle='--', alpha=0.6); plt.legend(); plt.tight_layout()

    save_folder = "/content/cifar-10"
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_C10_ECE_vs_T.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved ECE-vs-T curve: {filename}")


# --- Checkpoint Loader ---
def load_checkpoint(model, path, device):
    print(f"Loading checkpoint: {path}")
    ckpt = torch.load(path, map_location=device, weights_only=False)
    state_dict = ckpt.get('state_dict', ckpt)
    new_sd = OrderedDict((k.replace('module.', ''), v) for k, v in state_dict.items())
    model.load_state_dict(new_sd)
    return model

# --- Main Calibration Function ---
def run_ece_calibration(model, model_name, n_bins=15):
    print("\n" + "="*60); print(f"üîç ECE Calibration for: {model_name} on CIFAR-10"); print("="*60)
    model.to(device).eval()

    # Before scaling
    conf_b, corr_b = get_predictions(model, test_loader, device, temp=1.0)
    ece_b = calculate_ece(conf_b, corr_b, n_bins=n_bins)
    acc_b = np.mean(corr_b) * 100
    avg_conf_b = np.mean(conf_b) * 100
    print(f"\nTest metrics BEFORE scaling ‚Üí Acc: {acc_b:.2f}%, Avg Conf: {avg_conf_b:.2f}%, ECE: {ece_b:.3f}%")
    plot_reliability_diagram(conf_b, corr_b, n_bins, model_name, "before")

    # Temperature search
    best_ece_val, best_T, ece_values_val = float('inf'), None, []
    print("\nValidation set temperature search:")
    for T in T_values:
        conf, corr = get_predictions(model, val_loader, device, temp=T)
        ece = calculate_ece(conf, corr, n_bins=n_bins)
        ece_values_val.append(ece)
        print(f"T={T:.2f} ‚Üí ECE={ece:.3f}%")
        if ece < best_ece_val:
            best_ece_val, best_T = ece, T

    print(f"\nüéØ Best T on Validation Set: {best_T:.2f} ‚Üí ECE={best_ece_val:.3f}%")
    plot_ece_vs_temp(T_values, ece_values_val, best_T, model_name)

    # After scaling
    conf_a, corr_a = get_predictions(model, test_loader, device, temp=best_T)
    ece_a = calculate_ece(conf_a, corr_a, n_bins=n_bins)
    acc_a = np.mean(corr_a) * 100
    avg_conf_a = np.mean(conf_a) * 100
    print(f"\nTest metrics AFTER scaling (T={best_T:.2f}) ‚Üí Acc: {acc_a:.2f}%, Avg Conf: {avg_conf_a:.2f}%, ECE: {ece_a:.3f}%\n")
    plot_reliability_diagram(conf_a, corr_a, n_bins, model_name, "after")

    return {"name": model_name, "acc": acc_a, "ece_before": ece_b, "ece_after": ece_a,
            "conf_before": avg_conf_b, "conf_after": avg_conf_a, "best_T": best_T}

# =============================================
# --- RUN CALIBRATION FOR CIFAR-10 MODELS ---
# =============================================
all_results = []

# --- 1. ResNet-164 ---
try:
    from models.cifar import resnet
    print("\n--- Running calibration for local ResNet-164 ---")
    model_resnet164 = resnet(depth=164, num_classes=10, block_name='Bottleneck')
    path = '/content/Project/resnet110cifar10/model_best.pth.tar'
    model_resnet164 = load_checkpoint(model_resnet164, path, device)
    results = run_ece_calibration(model_resnet164, "ResNet-164")
    all_results.append(results)
except Exception as e:
    print(f"‚ùå Skipping local ResNet-164: {e}")

# --- 2. ResNet-56 from torch.hub ---
try:
    print("\n--- Running calibration for ResNet-56 (torch.hub) ---")
    model_hub = torch.hub.load("chenyaofo/pytorch-cifar-models", "cifar10_resnet56", pretrained=True, trust_repo=True)
    results = run_ece_calibration(model_hub, "ResNet-56 (Hub)")
    all_results.append(results)
except Exception as e:
    print(f"‚ùå Skipping ResNet-56 (Hub): {e}")

# --- FINAL COMPARISON TABLE ---
if all_results:
    print("\n" + "="*130)
    print("üìä Final Calibration Comparison on CIFAR-10 Test Set")
    print("="*130)
    print(f"{'Model':<22} | {'Accuracy':>10} | {'ECE (Before)':>13} | {'ECE (After)':>12} | {'Avg Conf (Before)':>18} | {'Avg Conf (After)':>17} | {'Optimal T':>10}")
    print("-" * 130)
    for r in all_results:
        print(f"{r['name']:<22} | {r['acc']:>9.2f}% | {r['ece_before']:>12.4f}% | {r['ece_after']:>11.4f}% | {r['conf_before']:>17.2f}% | {r['conf_after']:>16.2f}% | {r['best_T']:>10.4f}")
    print("=" * 130)
else:
    print("\nNo models were successfully calibrated to display a final table.")

Using device: cuda

Loading and splitting CIFAR-10 data...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 170M/170M [00:06<00:00, 27.8MB/s]


Data successfully split from original CIFAR-10 test set:
  -> New Validation samples: 5000
  -> New Test samples:       5000

--- Running calibration for local ResNet-164 ---
Loading checkpoint: /content/Project/resnet110cifar10/model_best.pth.tar

üîç ECE Calibration for: ResNet-164 on CIFAR-10

Test metrics BEFORE scaling ‚Üí Acc: 93.20%, Avg Conf: 96.69%, ECE: 3.749%
‚úÖ Saved reliability diagram: /content/cifar-10/ResNet-164_C10_reliability_before.png

Validation set temperature search:
T=1.00 ‚Üí ECE=3.426%
T=1.44 ‚Üí ECE=1.577%
T=1.89 ‚Üí ECE=0.864%
T=2.33 ‚Üí ECE=3.306%
T=2.78 ‚Üí ECE=6.308%
T=3.22 ‚Üí ECE=9.693%
T=3.67 ‚Üí ECE=13.359%
T=4.11 ‚Üí ECE=17.121%
T=4.56 ‚Üí ECE=20.984%
T=5.00 ‚Üí ECE=24.811%

üéØ Best T on Validation Set: 1.89 ‚Üí ECE=0.864%
‚úÖ Saved ECE-vs-T curve: /content/cifar-10/ResNet-164_C10_ECE_vs_T.png

Test metrics AFTER scaling (T=1.89) ‚Üí Acc: 93.20%, Avg Conf: 92.68%, ECE: 0.953%

‚úÖ Saved reliability diagram: /content/cifar-10/ResNet-164_C10_reliab

Using cache found in /root/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master


Downloading: "https://github.com/chenyaofo/pytorch-cifar-models/releases/download/resnet/cifar10_resnet56-187c023a.pt" to /root/.cache/torch/hub/checkpoints/cifar10_resnet56-187c023a.pt


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3.39M/3.39M [00:00<00:00, 67.4MB/s]


üîç ECE Calibration for: ResNet-56 (Hub) on CIFAR-10






Test metrics BEFORE scaling ‚Üí Acc: 94.12%, Avg Conf: 98.07%, ECE: 3.964%
‚úÖ Saved reliability diagram: /content/cifar-10/ResNet-56_(Hub)_C10_reliability_before.png

Validation set temperature search:
T=1.00 ‚Üí ECE=3.652%
T=1.44 ‚Üí ECE=2.216%
T=1.89 ‚Üí ECE=1.506%
T=2.33 ‚Üí ECE=4.945%
T=2.78 ‚Üí ECE=10.909%
T=3.22 ‚Üí ECE=17.710%
T=3.67 ‚Üí ECE=24.637%
T=4.11 ‚Üí ECE=31.160%
T=4.56 ‚Üí ECE=37.032%
T=5.00 ‚Üí ECE=42.168%

üéØ Best T on Validation Set: 1.89 ‚Üí ECE=1.506%
‚úÖ Saved ECE-vs-T curve: /content/cifar-10/ResNet-56_(Hub)_C10_ECE_vs_T.png

Test metrics AFTER scaling (T=1.89) ‚Üí Acc: 94.12%, Avg Conf: 94.00%, ECE: 1.212%

‚úÖ Saved reliability diagram: /content/cifar-10/ResNet-56_(Hub)_C10_reliability_after.png

üìä Final Calibration Comparison on CIFAR-10 Test Set
Model                  |   Accuracy |  ECE (Before) |  ECE (After) |  Avg Conf (Before) |  Avg Conf (After) |  Optimal T
----------------------------------------------------------------------------------------

In [None]:
!zip -r /content/my_folder1.zip /content/cifar-10

  adding: content/cifar-10/ (stored 0%)
  adding: content/cifar-10/ResNet-56_(Hub)_C10_ECE_vs_T.png (deflated 12%)
  adding: content/cifar-10/ResNet-56_(Hub)_C10_reliability_before.png (deflated 12%)
  adding: content/cifar-10/ResNet-164_C10_reliability_before.png (deflated 11%)
  adding: content/cifar-10/ResNet-164_C10_ECE_vs_T.png (deflated 12%)
  adding: content/cifar-10/ResNet-164_C10_reliability_after.png (deflated 12%)
  adding: content/cifar-10/ResNet-56_(Hub)_C10_reliability_after.png (deflated 12%)


In [None]:
!zip -r /content/my_folder2.zip /content/cifar-100/

  adding: content/cifar-100/ (stored 0%)
  adding: content/cifar-100/WideResNet-28-10_ECE_vs_T.png (deflated 11%)
  adding: content/cifar-100/ResNet-56_(Hub)_reliability_after.png (deflated 11%)
  adding: content/cifar-100/DenseNet-190_reliability_after.png (deflated 12%)
  adding: content/cifar-100/ResNet-164_reliability_after.png (deflated 12%)
  adding: content/cifar-100/ResNet-56_(Hub)_ECE_vs_T.png (deflated 9%)
  adding: content/cifar-100/ResNet-164_reliability_before.png (deflated 12%)
  adding: content/cifar-100/WideResNet-28-10_reliability_after.png (deflated 11%)
  adding: content/cifar-100/ResNet-56_(Hub)_reliability_before.png (deflated 11%)
  adding: content/cifar-100/DenseNet-190_ECE_vs_T.png (deflated 11%)
  adding: content/cifar-100/WideResNet-28-10_reliability_before.png (deflated 11%)
  adding: content/cifar-100/ResNet-164_ECE_vs_T.png (deflated 10%)
  adding: content/cifar-100/DenseNet-190_reliability_before.png (deflated 12%)


In [6]:
%cd pytorch-classification/

/content/pytorch-classification


# **CARS-MOBILENETV2**

In [5]:
from google.colab import files
import os

print("Please upload the kaggle.json file you downloaded from your Kaggle account.")
# This will open a file selection dialog
files.upload()

# Now, we'll move the file to the correct location and set permissions
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

print("\nKaggle API token configured successfully!")

Please upload the kaggle.json file you downloaded from your Kaggle account.


Saving kaggle.json to kaggle.json

Kaggle API token configured successfully!


In [None]:
!kaggle datasets download -d jutrera/stanford-car-dataset-by-classes-folder

# # Unzip the downloaded file. The '-q' flag makes it quiet (less output).
!unzip -q stanford-car-dataset-by-classes-folder.zip

print("\nDataset downloaded and unzipped successfully.")

Dataset URL: https://www.kaggle.com/datasets/jutrera/stanford-car-dataset-by-classes-folder
License(s): other
Downloading stanford-car-dataset-by-classes-folder.zip to /content/pytorch-classification
 99% 1.80G/1.83G [00:23<00:00, 204MB/s]
100% 1.83G/1.83G [00:23<00:00, 84.2MB/s]

Dataset downloaded and unzipped successfully.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader, random_split
import numpy as np
import matplotlib.pyplot as plt
import os
from collections import OrderedDict

# ==============================================================================
# 1. SETUP & DATA LOADING (From your script)
# ==============================================================================
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"==> Using device: {device}")

# --- Data Transforms ---
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]),
    'test': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
}

# --- Load and Split Data ---
print("\nLoading and splitting data...")
try:
    # 1. Load the original test set from ImageFolder
    full_test_dataset = datasets.ImageFolder('/content/pytorch-classification/car_data/car_data/test', data_transforms['test'])

    # --- FIX: Get num_classes from the ImageFolder *before* splitting ---
    num_classes = len(full_test_dataset.classes)
    print(f"Found {num_classes} classes in the dataset.")

    # 2. Define split sizes
    val_size = 4000
    test_size = len(full_test_dataset) - val_size # Calculate remaining size

    if test_size <= 0:
        print(f"‚ùå ERROR: Validation size ({val_size}) is >= total test set ({len(full_test_dataset)}).")
        exit()

    # 3. Split the dataset
    val_dataset, test_dataset = random_split(full_test_dataset, [val_size, test_size],
                                             generator=torch.Generator().manual_seed(42))

    # 4. Create DataLoaders
    val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=2)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=2)

    print(f"Data successfully split:")
    print(f"  -> Validation samples: {len(val_dataset)}")
    print(f"  -> Test samples:       {len(test_dataset)}")

except FileNotFoundError as e:
    print(f"‚ùå ERROR: Data directory not found. Have you run the Kaggle download cell? {e}")
    # exit() # In a real script, you'd exit here
except Exception as e:
    print(f"‚ùå An error occurred during data loading: {e}")

# ==============================================================================
# 2. CALIBRATION & PLOTTING FUNCTIONS (From our CIFAR script)
# ==============================================================================

# Temperature grid
T_values = np.linspace(1.0, 2.0, num=25)

def get_predictions(model, loader, device, temp=1.0):
    """Gathers predictions, confidences, and labels."""
    model.eval()
    all_confidences, all_correct = [], []
    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)

            # Get logits from the base model
            logits = model(inputs)

            # Apply temperature
            scaled_logits = logits / temp
            probabilities = F.softmax(scaled_logits, dim=1)

            confidences, predicted = torch.max(probabilities, 1)
            correct = (predicted == labels).cpu().numpy()

            all_confidences.extend(confidences.cpu().numpy())
            all_correct.extend(correct)

    return np.array(all_confidences), np.array(all_correct)

def calculate_ece(confidences, correct, n_bins=15):
    """Calculates the Expected Calibration Error (ECE)."""
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    ece = 0.0
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        prop_in_bin = np.mean(in_bin)
        if prop_in_bin > 0:
            accuracy_in_bin = np.mean(correct[in_bin])
            avg_confidence_in_bin = np.mean(confidences[in_bin])
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    return ece * 100

def plot_reliability_diagram(confidences, correct, n_bins, model_name, suffix):
    """Plots a reliability diagram and saves it to a file."""
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]

    bin_accs = np.zeros(n_bins)
    bin_confs = np.zeros(n_bins)
    bin_props = np.zeros(n_bins)

    for i, (bin_lower, bin_upper) in enumerate(zip(bin_lowers, bin_uppers)):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        bin_props[i] = np.mean(in_bin)
        if bin_props[i] > 0:
            bin_accs[i] = np.mean(correct[in_bin])
            bin_confs[i] = np.mean(confidences[in_bin])

    # Plot
    plt.figure(figsize=(8, 7))
    bar_width = 1.0 / n_bins
    bar_centers = bin_lowers + bar_width / 2
    non_empty_mask = bin_props > 0

    # ECE bar plot
    plt.bar(bar_centers[non_empty_mask], bin_accs[non_empty_mask],
            width=bar_width * 0.9, alpha=0.3, color='red',
            edgecolor='red', label='Accuracy')
    # Gaps (where conf != acc)
    plt.bar(bar_centers[non_empty_mask], (bin_confs - bin_accs)[non_empty_mask],
            bottom=bin_accs[non_empty_mask],
            width=bar_width * 0.9, alpha=0.5, color='blue',
            edgecolor='black', label='Confidence Gap')

    # Perfect calibration line
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')

    plt.xlabel('Confidence')
    plt.ylabel('Accuracy')
    plt.title(f'Reliability Diagram: {model_name} ({suffix})')
    plt.legend()
    plt.xlim(0, 1)
    plt.ylim(0, 1)
    plt.grid(True, linestyle='--', alpha=0.6)

    save_folder = "/content/cars"
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_reliability_{suffix}.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved reliability diagram: {filename}")

def plot_ece_vs_temp(T_values, ece_values, best_T, model_name):
    """Plots ECE vs. Temperature and saves it to a file."""
    plt.figure(figsize=(8, 6))
    plt.plot(T_values, ece_values, 'o-', label='ECE (%)', color='royalblue')
    plt.axvline(best_T, color='red', linestyle='--',
                label=f'Min ECE at T={best_T:.2f}')
    plt.xlabel('Temperature (T)')
    plt.ylabel('Expected Calibration Error (%)')
    plt.title(f'ECE vs Temperature (Validation Set): {model_name}')
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.legend()
    plt.tight_layout()

    save_folder = "/content/cars"
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_ECE_vs_T.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved ECE-vs-T curve: {filename}")

def run_ece_calibration(model, model_name, n_bins=15):
    """Runs the full grid-search calibration pipeline."""
    print("\n" + "="*60)
    print(f"üîç ECE Calibration for: {model_name}")
    print("="*60)

    model.to(device).eval()

    # --- Test metrics BEFORE scaling ---
    conf_b, corr_b = get_predictions(model, test_loader, device, temp=1.0)
    ece_b = calculate_ece(conf_b, corr_b, n_bins=n_bins)
    acc_b = np.mean(corr_b) * 100
    avg_conf_b = np.mean(conf_b) * 100

    print(f"\nTest metrics BEFORE scaling (T=1.0) ‚Üí Acc: {acc_b:.2f}%, Avg Conf: {avg_conf_b:.2f}%, ECE: {ece_b:.3f}%")
    plot_reliability_diagram(conf_b, corr_b, n_bins, model_name, "before")

    # --- Temperature search on validation set ---
    best_ece_val = float('inf')
    best_T = None
    print("\nValidation set temperature search:")
    ece_values_val = []
    for T in T_values:
        conf, corr = get_predictions(model, val_loader, device, temp=T)
        ece = calculate_ece(conf, corr, n_bins=n_bins)
        ece_values_val.append(ece)
        print(f"T={T:.2f} ‚Üí ECE={ece:.3f}%")
        if ece < best_ece_val:
            best_ece_val = ece
            best_T = T

    print(f"\nüéØ Best T on Validation Set (Min ECE): {best_T:.2f} ‚Üí ECE={best_ece_val:.3f}%")
    plot_ece_vs_temp(T_values, ece_values_val, best_T, model_name)

    # --- Test metrics AFTER scaling ---
    conf_a, corr_a = get_predictions(model, test_loader, device, temp=best_T)
    ece_a = calculate_ece(conf_a, corr_a, n_bins=n_bins)
    acc_a = np.mean(corr_a) * 100
    avg_conf_a = np.mean(conf_a) * 100

    print(f"\nTest metrics AFTER scaling (T={best_T:.2f}) ‚Üí Acc: {acc_a:.2f}%, Avg Conf: {avg_conf_a:.2f}%, ECE: {ece_a:.3f}%\n")
    plot_reliability_diagram(conf_a, corr_a, n_bins, model_name, "after")

    # --- Return results for final table ---
    return {
        "name": model_name,
        "acc": acc_a,
        "ece_before": ece_b,
        "ece_after": ece_a,
        "conf_before": avg_conf_b,
        "conf_after": avg_conf_a,
        "best_T": best_T
    }

# ==============================================================================
# 3. LOAD MODEL AND RUN CALIBRATION
# ==============================================================================
all_results = []
try:
    print("\nRe-creating and loading MobileNetV2 model...")
    base_model = models.mobilenet_v2(weights=None) # Using weights=None as you're loading a local state_dict

    # --- FIX: Check if num_classes was defined in the data section ---
    if 'num_classes' not in locals():
        print("‚ùå CRITICAL ERROR: num_classes was not defined. Cannot build model.")
        raise NameError("num_classes not defined")

    print(f"Building model for {num_classes} classes.")

    in_features = base_model.classifier[1].in_features
    base_model.classifier[1] = nn.Linear(in_features, num_classes)
    base_model.to(device)

    MODEL_PATH = '/content/Project/MobilenetV2_Cars/model_best.pth'
    base_model.load_state_dict(torch.load(MODEL_PATH, map_location=device))
    print("Model loaded successfully.")

    # --- Run the full calibration pipeline ---
    results = run_ece_calibration(base_model, "MobileNetV2")
    all_results.append(results)

except FileNotFoundError:
    print(f"‚ùå CRITICAL ERROR: Could not find model at '{MODEL_PATH}'")
except Exception as e:
    print(f"‚ùå An error occurred: {e}")

# ==============================================================================
# 4. FINAL REPORT üìä
# ==============================================================================
print("\n" + "="*130)
print("üìä Final Calibration Comparison on Stanford Cars Test Set")
print("="*130)

# Print Header (FIXED typo V>)
print(f"{'Model':<22} | {'Accuracy':>10} | {'ECE (Before)':>13} | {'ECE (After)':>12} | {'Avg Conf (Before)':>18} | {'Avg Conf (After)':>17} | {'Optimal T':>10}")
print("-" * 130)

# Print results
if all_results:
    for r in all_results:
        print(f"{r['name']:<22} | {r['acc']:>9.2f}% | {r['ece_before']:>12.4f}% | {r['ece_after']:>11.4f}% | {r['conf_before']:>17.2f}% | {r['conf_after']:>16.2f}% | {r['best_T']:>10.4f}")
else:
    print("No results to display. Model loading or calibration may have failed.")
print("=" * 130)

==> Using device: cuda

Loading and splitting data...
Found 196 classes in the dataset.
Data successfully split:
  -> Validation samples: 4000
  -> Test samples:       4041

Re-creating and loading MobileNetV2 model...
Building model for 196 classes.
Model loaded successfully.

üîç ECE Calibration for: MobileNetV2

Test metrics BEFORE scaling (T=1.0) ‚Üí Acc: 44.35%, Avg Conf: 53.77%, ECE: 9.421%
‚úÖ Saved reliability diagram: /content/cars/MobileNetV2_reliability_before.png

Validation set temperature search:
T=1.00 ‚Üí ECE=9.759%
T=1.04 ‚Üí ECE=8.005%
T=1.08 ‚Üí ECE=6.334%
T=1.12 ‚Üí ECE=4.670%
T=1.17 ‚Üí ECE=3.102%
T=1.21 ‚Üí ECE=2.386%
T=1.25 ‚Üí ECE=2.494%
T=1.29 ‚Üí ECE=2.498%
T=1.33 ‚Üí ECE=3.828%
T=1.38 ‚Üí ECE=5.068%
T=1.42 ‚Üí ECE=6.419%
T=1.46 ‚Üí ECE=7.592%
T=1.50 ‚Üí ECE=8.840%
T=1.54 ‚Üí ECE=10.017%
T=1.58 ‚Üí ECE=11.120%
T=1.62 ‚Üí ECE=12.290%
T=1.67 ‚Üí ECE=13.319%
T=1.71 ‚Üí ECE=14.378%
T=1.75 ‚Üí ECE=15.419%
T=1.79 ‚Üí ECE=16.422%
T=1.83 ‚Üí ECE=17.398%
T=1.88 ‚Üí EC

In [None]:
!zip -r /content/my_folder3.zip /content/cars/

  adding: content/cars/ (stored 0%)
  adding: content/cars/MobileNetV2_ECE_vs_T.png (deflated 9%)
  adding: content/cars/MobileNetV2_reliability_before.png (deflated 12%)
  adding: content/cars/MobileNetV2_reliability_after.png (deflated 12%)


# **Birds ~ InceptionV3**

In [6]:
print("Downloading birds400 dataset...")
!kaggle datasets download -d antoniozarauzmoreno/birds400

# Unzip the file (q = quiet)
print("Unzipping dataset...")
!unzip -q birds400.zip
print("Dataset downloaded and unzipped successfully.")

# List contents to confirm
!ls -l

Downloading birds400 dataset...
Dataset URL: https://www.kaggle.com/datasets/antoniozarauzmoreno/birds400
License(s): unknown
Downloading birds400.zip to /content
 98% 1.28G/1.30G [00:08<00:00, 293MB/s]
100% 1.30G/1.30G [00:08<00:00, 161MB/s]
Unzipping dataset...
Dataset downloaded and unzipped successfully.
total 1366148
drwxr-xr-x  5 root root       4096 Oct 24 15:04 birds400
-rw-r--r--  1 root root 1398909039 Apr 12  2022 birds400.zip
drwx------  5 root root       4096 Oct 24 14:53 drive
-rw-r--r--  1 root root         65 Oct 24 15:03 kaggle.json
drwxr-xr-x 11 root root       4096 Oct 24 14:44 Project
drwxr-xr-x  1 root root       4096 Oct 22 13:39 sample_data


In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, datasets, transforms
from torch.utils.data import DataLoader, random_split
import numpy as np
import matplotlib.pyplot as plt
import os
from collections import OrderedDict

# ==============================================================================
# 1. SETUP & DATA LOADING
# ==============================================================================
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"==> Using device: {device}")

# --- Data Transforms ---
data_transforms = {
    'test': transforms.Compose([
        transforms.Resize(299),
        transforms.CenterCrop(299),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
}

# --- Load and Split Data ---
print("\nLoading and splitting data...")
BIRDS_TEST_PATH = '/content/birds400/test'

try:
    full_test_dataset = datasets.ImageFolder(BIRDS_TEST_PATH, data_transforms['test'])
    num_classes = len(full_test_dataset.classes)
    print(f"Found {num_classes} classes in the dataset.")

    val_size = 1000
    if len(full_test_dataset) <= val_size:
        print(f"‚ùå ERROR: Total test set ({len(full_test_dataset)}) is too small for validation size ({val_size}).")
        exit()
    test_size = len(full_test_dataset) - val_size

    val_dataset, test_dataset = random_split(full_test_dataset, [val_size, test_size],
                                             generator=torch.Generator().manual_seed(42))

    val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=2)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=2)

    print(f"Data successfully split:")
    print(f"  -> Validation samples: {len(val_dataset)}")
    print(f"  -> Test samples:       {len(test_dataset)}")

except FileNotFoundError:
    print(f"‚ùå ERROR: Data directory not found at: {BIRDS_TEST_PATH}")
    print("Please make sure you ran the data preparation cell first.")
except Exception as e:
    print(f"‚ùå An error occurred during data loading: {e}")

# ==============================================================================
# 2. CALIBRATION & PLOTTING FUNCTIONS
# ==============================================================================
T_values = np.linspace(0.5, 1.5, num=25)

def get_predictions(model, loader, device, temp=1.0):
    model.eval()
    all_confidences, all_correct = [], []
    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)
            logits = model(inputs)
            if isinstance(logits, tuple): # Handle InceptionV3 tuple output
                logits = logits.logits
            scaled_logits = logits / temp
            probabilities = F.softmax(scaled_logits, dim=1)
            confidences, predicted = torch.max(probabilities, 1)
            correct = (predicted == labels).cpu().numpy()
            all_confidences.extend(confidences.cpu().numpy())
            all_correct.extend(correct)
    return np.array(all_confidences), np.array(all_correct)

def calculate_ece(confidences, correct, n_bins=15):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    ece = 0.0
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        prop_in_bin = np.mean(in_bin)
        if prop_in_bin > 0:
            accuracy_in_bin = np.mean(correct[in_bin])
            avg_confidence_in_bin = np.mean(confidences[in_bin])
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    return ece * 100

def plot_reliability_diagram(confidences, correct, n_bins, model_name, suffix):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    bin_accs, bin_confs, bin_props = np.zeros(n_bins), np.zeros(n_bins), np.zeros(n_bins)
    for i, (bin_lower, bin_upper) in enumerate(zip(bin_lowers, bin_uppers)):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        bin_props[i] = np.mean(in_bin)
        if bin_props[i] > 0:
            bin_accs[i] = np.mean(correct[in_bin])
            bin_confs[i] = np.mean(confidences[in_bin])
    plt.figure(figsize=(8, 7))
    bar_width = 1.0 / n_bins
    bar_centers = bin_lowers + bar_width / 2
    non_empty_mask = bin_props > 0
    plt.bar(bar_centers[non_empty_mask], bin_accs[non_empty_mask], width=bar_width*0.9, alpha=0.3, color='red', edgecolor='red', label='Accuracy')
    plt.bar(bar_centers[non_empty_mask], (bin_confs - bin_accs)[non_empty_mask], bottom=bin_accs[non_empty_mask], width=bar_width*0.9, alpha=0.5, color='blue', edgecolor='black', label='Confidence Gap')
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
    plt.xlabel('Confidence'); plt.ylabel('Accuracy'); plt.title(f'Reliability Diagram: {model_name} ({suffix})')
    plt.legend(); plt.xlim(0, 1); plt.ylim(0, 1); plt.grid(True, linestyle='--', alpha=0.6)
    save_folder = "/content/birds"
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_reliability_{suffix}.png"
    plt.savefig(filename); plt.close()
    print(f"‚úÖ Saved reliability diagram: {filename}")

def plot_ece_vs_temp(T_values, ece_values, best_T, model_name):
    plt.figure(figsize=(8, 6))
    plt.plot(T_values, ece_values, 'o-', label='ECE (%)', color='royalblue')
    plt.axvline(best_T, color='red', linestyle='--', label=f'Min ECE at T={best_T:.2f}')
    plt.xlabel('Temperature (T)'); plt.ylabel('Expected Calibration Error (%)'); plt.title(f'ECE vs Temperature (Validation Set): {model_name}')
    plt.grid(True, linestyle='--', alpha=0.6); plt.legend(); plt.tight_layout()
    save_folder = "/content/birds"
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_ECE_vs_T.png"
    plt.savefig(filename); plt.close()
    print(f"‚úÖ Saved ECE-vs-T curve: {filename}")

def run_ece_calibration(model, model_name, n_bins=15):
    print("\n" + "="*60); print(f"üîç ECE Calibration for: {model_name}"); print("="*60)
    model.to(device).eval()
    conf_b, corr_b = get_predictions(model, test_loader, device, temp=1.0)
    ece_b = calculate_ece(conf_b, corr_b, n_bins=n_bins)
    acc_b = np.mean(corr_b) * 100
    avg_conf_b = np.mean(conf_b) * 100
    print(f"\nTest metrics BEFORE scaling (T=1.0) ‚Üí Acc: {acc_b:.2f}%, Avg Conf: {avg_conf_b:.2f}%, ECE: {ece_b:.3f}%")
    plot_reliability_diagram(conf_b, corr_b, n_bins, model_name, "before")
    best_ece_val, best_T, ece_values_val = float('inf'), None, []
    print("\nValidation set temperature search:")
    for T in T_values:
        conf, corr = get_predictions(model, val_loader, device, temp=T)
        ece = calculate_ece(conf, corr, n_bins=n_bins)
        ece_values_val.append(ece)
        print(f"T={T:.2f} ‚Üí ECE={ece:.3f}%")
        if ece < best_ece_val:
            best_ece_val, best_T = ece, T
    print(f"\nüéØ Best T on Validation Set (Min ECE): {best_T:.2f} ‚Üí ECE={best_ece_val:.3f}%")
    plot_ece_vs_temp(T_values, ece_values_val, best_T, model_name)
    conf_a, corr_a = get_predictions(model, test_loader, device, temp=best_T)
    ece_a = calculate_ece(conf_a, corr_a, n_bins=n_bins)
    acc_a = np.mean(corr_a) * 100
    avg_conf_a = np.mean(conf_a) * 100
    print(f"\nTest metrics AFTER scaling (T={best_T:.2f}) ‚Üí Acc: {acc_a:.2f}%, Avg Conf: {avg_conf_a:.2f}%, ECE: {ece_a:.3f}%\n")
    plot_reliability_diagram(conf_a, corr_a, n_bins, model_name, "after")
    return {"name": model_name, "acc": acc_a, "ece_before": ece_b, "ece_after": ece_a,
            "conf_before": avg_conf_b, "conf_after": avg_conf_a, "best_T": best_T}

# ==============================================================================
# 3. LOAD MODEL AND RUN CALIBRATION (FIXED LOADING)
# ==============================================================================
all_results = []
try:
    if 'num_classes' not in locals():
        print("‚ùå CRITICAL ERROR: num_classes was not defined. Data loading likely failed.")
        raise NameError("num_classes not defined")

    model_name = "InceptionV3_Fold9"
    MODEL_PATH = "/content/Project/InceptionNetV3_Birds/inceptionv3_birds9.pth"

    print("\n" + "="*80); print(f"STARTING CALIBRATION FOR: {model_name} from {MODEL_PATH}"); print("="*80)

    # 1. Load the object from the .pth file
    import torchvision
    torch.serialization.add_safe_globals([torchvision.models.inception.Inception3])
    loaded_object = torch.load(MODEL_PATH, map_location=device, weights_only=False)

    # --- FIX: Check if loaded object is the model or a state dict ---
    if isinstance(loaded_object, nn.Module):
        print("Loaded object is a full model instance.")
        base_model = loaded_object
        # Ensure the final layer matches the dataset
        in_features = base_model.fc.in_features
        if base_model.fc.out_features != num_classes:
             print(f"Warning: Model's final layer ({base_model.fc.out_features} classes) doesn't match dataset ({num_classes} classes). Rebuilding final layer.")
             base_model.fc = nn.Linear(in_features, num_classes)
        base_model.to(device)
    elif isinstance(loaded_object, dict):
        print("Loaded object is a state dictionary or checkpoint.")
        # Create a new InceptionV3 instance first
        base_model = models.inception_v3(weights=None, aux_logits=False, init_weights=False)
        in_features = base_model.fc.in_features
        base_model.fc = nn.Linear(in_features, num_classes)
        base_model.to(device)

        # Extract state_dict if it's a checkpoint dict
        if 'state_dict' in loaded_object:
            state_dict = loaded_object['state_dict']
        else:
            state_dict = loaded_object # Assume it's just the state_dict

        # Handle potential 'module.' prefix from DataParallel
        new_sd = OrderedDict()
        for k, v in state_dict.items():
            new_sd[k.replace('module.', '')] = v

        base_model.load_state_dict(new_sd)
    else:
        raise TypeError(f"Loaded object is of unexpected type: {type(loaded_object)}")

    print("Model prepared successfully.")

    # 4. Run the full calibration pipeline
    results = run_ece_calibration(base_model, model_name)
    all_results.append(results)

except FileNotFoundError:
    print(f"‚ùå CRITICAL ERROR: Could not find weights at '{MODEL_PATH}'")
except Exception as e:
    print(f"‚ùå An error occurred during model loading or calibration: {e}")
    print("This could be a model/dataset class mismatch or a loading issue.")

# ==============================================================================
# 4. FINAL REPORT üìä (FIXED TYPO)
# ==============================================================================
print("\n" + "="*130)
print("üìä Final Calibration Comparison on Birds Dataset Test Set")
print("="*130)

# --- FIX: Corrected format specifier ---
print(f"{'Model':<22} | {'Accuracy':>10} | {'ECE (Before)':>13} | {'ECE (After)':>12} | {'Avg Conf (Before)':>18} | {'Avg Conf (After)':>17} | {'Optimal T':>10}")
print("-" * 130)

# Print results
if all_results:
    for r in all_results:
        print(f"{r['name']:<22} | {r['acc']:>9.2f}% | {r['ece_before']:>12.4f}% | {r['ece_after']:>11.4f}% | {r['conf_before']:>17.2f}% | {r['conf_after']:>16.2f}% | {r['best_T']:>10.4f}")
else:
    print("No results to display. Model loading or calibration may have failed.")
print("=" * 130)

==> Using device: cuda

Loading and splitting data...
Found 400 classes in the dataset.
Data successfully split:
  -> Validation samples: 1000
  -> Test samples:       1000

STARTING CALIBRATION FOR: InceptionV3_Fold9 from /content/Project/InceptionNetV3_Birds/inceptionv3_birds9.pth
Loaded object is a full model instance.
Model prepared successfully.

üîç ECE Calibration for: InceptionV3_Fold9

Test metrics BEFORE scaling (T=1.0) ‚Üí Acc: 98.90%, Avg Conf: 98.65%, ECE: 0.503%
‚úÖ Saved reliability diagram: /content/birds/InceptionV3_Fold9_reliability_before.png

Validation set temperature search:
T=0.50 ‚Üí ECE=0.318%
T=0.54 ‚Üí ECE=0.317%
T=0.58 ‚Üí ECE=0.358%
T=0.62 ‚Üí ECE=0.362%
T=0.67 ‚Üí ECE=0.391%
T=0.71 ‚Üí ECE=0.312%
T=0.75 ‚Üí ECE=0.325%
T=0.79 ‚Üí ECE=0.344%
T=0.83 ‚Üí ECE=0.376%
T=0.88 ‚Üí ECE=0.443%
T=0.92 ‚Üí ECE=0.473%
T=0.96 ‚Üí ECE=0.618%
T=1.00 ‚Üí ECE=0.690%
T=1.04 ‚Üí ECE=0.829%
T=1.08 ‚Üí ECE=0.996%
T=1.12 ‚Üí ECE=1.197%
T=1.17 ‚Üí ECE=1.437%
T=1.21 ‚Üí ECE=1.722%

In [6]:
!zip -r /content/birds_inceptionV3 .zip /content/birds

  adding: content/birds/ (stored 0%)
  adding: content/birds/InceptionV3_Fold9_reliability_after.png (deflated 11%)
  adding: content/birds/InceptionV3_Fold9_reliability_before.png (deflated 11%)
  adding: content/birds/InceptionV3_Fold9_ECE_vs_T.png (deflated 11%)


## Why ECE Increased After Scaling

The ECE increased because the **optimal temperature (T)** found on the validation set was **less than 1** (T=0.71).

---

### Understanding Temperature Scaling

* **Standard Case (T > 1):** Most models are *over-confident*. Temperature Scaling divides the logits by $T > 1$, which "cools down" the probabilities, making the model *less confident* and usually reducing ECE.
* **Our Case (T < 1):** Dividing logits by $T < 1$ actually **increases** the differences between logits. This "sharpens" the probabilities, making the model *more confident*.

This increase in confidence is visible in the results:
* **Avg Conf (Before):** 98.65%
* **Avg Conf (After):** 99.29% üìà

---

### Why the Validation Set Suggested T < 1

Our original model was **already exceptionally well-calibrated**, perhaps even slightly *under-confident* on the validation set data.

* **Before Scaling (Validation, T=1.0):** ECE = 0.690%
* **Best T (Validation, T=0.71):** ECE decreased slightly to 0.312%

The grid search found that making the model slightly *more* confident (by using $T < 1$) achieved the absolute minimum ECE *specifically on that validation dataset*.

---

### Why ECE Increased on the Test Set

The minor improvement seen on the validation set by using $T=0.71$ **didn't generalize** perfectly to the separate test set.

* **Before Scaling (Test, T=1.0):** ECE was extremely low at **0.503%**. The confidence (98.65%) was very close to the accuracy (98.90%).
* **After Scaling (Test, T=0.71):** Applying $T=0.71$ made the model slightly *over-confident* on the test set (Avg Conf 99.29% > Accuracy 98.90%). This resulted in a small increase in the final test ECE to **0.849%**.

---

### Conclusion ‚úÖ

Our original model was already remarkably well-calibrated (ECE ‚âà 0.5%). Temperature scaling tried a minor adjustment based on the validation set (making the model slightly more confident), but this wasn't beneficial for the test set.

In this rare situation, the **unscaled model (T=1.0) actually had slightly better calibration** on the final test data. An ECE difference this small (0.5% vs 0.85%) is often considered negligible in practice. üëç

# **Extra Work : Trying to see performance of label smoothing**

In [7]:
# -----------------------------
# ECE CALIBRATION via LABEL SMOOTHING (LS) FINE-TUNING
# -----------------------------
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
from collections import OrderedDict
import matplotlib.pyplot as plt
import os
import torch.optim as optim

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Data Transforms ---
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

# --- Load and Split Data (NEW: Need Training Set) ---
print("\nLoading CIFAR-10 data for fine-tuning...")
try:
    # 1. Load the 50,000-image training set
    train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
    train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)

    # 2. Load the 10,000-image test set for final evaluation
    test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)
    test_loader = DataLoader(test_dataset, batch_size=100, shuffle=False, num_workers=2)

    print(f"Data loaded:")
    print(f"  -> Training samples: {len(train_dataset)}")
    print(f"  -> Test samples:     {len(test_dataset)}")

except Exception as e:
    print(f"‚ùå ERROR: Could not load CIFAR-10 data. {e}")
    exit()

# --- Helper Functions (Re-used) ---

def get_predictions(model, loader, device, temp=1.0):
    model.eval()
    all_conf, all_corr = [], []
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x) / temp
            probs = F.softmax(logits, dim=1)
            conf, pred = torch.max(probs, 1)
            all_conf.extend(conf.cpu().numpy())
            all_corr.extend((pred == y).cpu().numpy())
    return np.array(all_conf), np.array(all_corr)

def calculate_ece(confidences, correct, n_bins=15):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    ece = 0.0
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        prop_in_bin = np.mean(in_bin)
        if prop_in_bin > 0:
            accuracy_in_bin = np.mean(correct[in_bin])
            avg_confidence_in_bin = np.mean(confidences[in_bin])
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    return ece * 100

def plot_reliability_diagram(confidences, correct, n_bins, model_name, suffix):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    bin_accs, bin_confs, bin_props = np.zeros(n_bins), np.zeros(n_bins), np.zeros(n_bins)

    for i, (bin_lower, bin_upper) in enumerate(zip(bin_lowers, bin_uppers)):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        bin_props[i] = np.mean(in_bin)
        if bin_props[i] > 0:
            bin_accs[i] = np.mean(correct[in_bin])
            bin_confs[i] = np.mean(confidences[in_bin])

    plt.figure(figsize=(8, 7))
    bar_width, bar_centers = 1.0 / n_bins, bin_lowers + 1.0 / (2 * n_bins)
    non_empty = bin_props > 0
    plt.bar(bar_centers[non_empty], bin_accs[non_empty], width=bar_width*0.9, alpha=0.3, color='red', edgecolor='red', label='Accuracy')
    plt.bar(bar_centers[non_empty], (bin_confs - bin_accs)[non_empty], bottom=bin_accs[non_empty], width=bar_width*0.9, alpha=0.5, color='blue', edgecolor='black', label='Confidence Gap')
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
    plt.xlabel('Confidence'); plt.ylabel('Accuracy'); plt.title(f'Reliability Diagram: {model_name} ({suffix})')
    plt.legend(); plt.xlim(0, 1); plt.ylim(0, 1); plt.grid(True, linestyle='--', alpha=0.6)

    # Save to a new folder
    save_folder = "/content/cifar-10-ls"
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_C10_reliability_{suffix}.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved reliability diagram: {filename}")

def load_checkpoint(model, path, device):
    print(f"Loading checkpoint: {path}")
    ckpt = torch.load(path, map_location=device, weights_only=False)
    state_dict = ckpt.get('state_dict', ckpt)
    new_sd = OrderedDict((k.replace('module.', ''), v) for k, v in state_dict.items())
    model.load_state_dict(new_sd)
    return model

# --- NEW: Fine-Tuning Function ---
def finetune_with_ls(model, loader, epochs=10, lr=1e-5, smoothing=0.1):
    print(f"\n--- Starting fine-tuning with Label Smoothing (Œ±={smoothing}) for {epochs} epochs ---")
    model.to(device).train()

    # Define loss function with label smoothing
    criterion = nn.CrossEntropyLoss(label_smoothing=smoothing)

    # Use an optimizer with a very small learning rate for fine-tuning
    optimizer = optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(loader):
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs} | Loss: {running_loss / len(loader):.4f}")

    print("--- Fine-tuning complete ---")
    return model

# --- NEW: Evaluation Function for LS-Tuned Models ---
def evaluate_ls_model(model, model_name, n_bins=15):
    print(f"\n--- Evaluating LS fine-tuned model: {model_name} ---")
    model.to(device).eval()

    # Evaluate with T=1.0 (no post-hoc scaling)
    conf, corr = get_predictions(model, test_loader, device, temp=1.0)
    ece = calculate_ece(conf, corr, n_bins=n_bins)
    acc = np.mean(corr) * 100
    avg_conf = np.mean(conf) * 100

    print(f"Test metrics ‚Üí Acc: {acc:.2f}%, Avg Conf: {avg_conf:.2f}%, ECE: {ece:.3f}%")

    # Plot reliability diagram, save with "ls" suffix
    plot_reliability_diagram(conf, corr, n_bins, model_name, "ls")

    return {"name": model_name, "acc": acc, "ece_ls": ece, "conf_ls": avg_conf}


# =======================================================
# --- RUN FINE-TUNING & EVALUATION FOR LS MODELS ---
# =======================================================
all_ls_results = []

# --- 1. ResNet-164 ---
try:
    from models.cifar import resnet
    print("\n" + "="*60)
    print("Running LS fine-tuning for local ResNet-164")
    print("="*60)
    model_resnet164 = resnet(depth=164, num_classes=10, block_name='Bottleneck')
    path = '/content/Project/resnet110cifar10/model_best.pth.tar'
    model_resnet164 = load_checkpoint(model_resnet164, path, device)

    # Fine-tune the model
    model_resnet164_ls = finetune_with_ls(model_resnet164, train_loader, epochs=10)

    # Evaluate the fine-tuned model
    results = evaluate_ls_model(model_resnet164_ls, "ResNet-164 (LS Tuned)")
    all_ls_results.append(results)

except Exception as e:
    print(f"‚ùå Skipping local ResNet-164: {e}")

# --- 2. ResNet-56 from torch.hub ---
try:
    print("\n" + "="*60)
    print("Running LS fine-tuning for ResNet-56 (torch.hub)")
    print("="*60)
    # Load a fresh copy
    model_hub = torch.hub.load("chenyaofo/pytorch-cifar-models", "cifar10_resnet56", pretrained=True, trust_repo=True)

    # Fine-tune the model
    model_hub_ls = finetune_with_ls(model_hub, train_loader, epochs=10)

    # Evaluate the fine-tuned model
    results = evaluate_ls_model(model_hub_ls, "ResNet-56 (Hub, LS Tuned)")
    all_ls_results.append(results)

except Exception as e:
    print(f"‚ùå Skipping ResNet-56 (Hub): {e}")

# --- FINAL COMPARISON TABLE (for LS Models) ---
if all_ls_results:
    print("\n" + "="*80)
    print("üìä Final Results for Label Smoothing (LS) Fine-Tuning")
    print("="*80)
    print(f"{'Model':<28} | {'Accuracy':>10} | {'ECE (LS Tuned)':>15} | {'Avg Conf (LS)':>15}")
    print("-" * 80)
    for r in all_ls_results:
        print(f"{r['name']:<28} | {r['acc']:>9.2f}% | {r['ece_ls']:>14.4f}% | {r['conf_ls']:>14.2f}%")
    print("=" * 80)
else:
    print("\nNo models were successfully fine-tuned with Label Smoothing.")

Using device: cuda

Loading CIFAR-10 data for fine-tuning...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 170M/170M [00:13<00:00, 12.3MB/s]


Data loaded:
  -> Training samples: 50000
  -> Test samples:     10000

Running LS fine-tuning for local ResNet-164
Loading checkpoint: /content/Project/resnet110cifar10/model_best.pth.tar

--- Starting fine-tuning with Label Smoothing (Œ±=0.1) for 10 epochs ---
Epoch 1/10 | Loss: 1.2979
Epoch 2/10 | Loss: 1.0464
Epoch 3/10 | Loss: 0.9343
Epoch 4/10 | Loss: 0.8599
Epoch 5/10 | Loss: 0.8070
Epoch 6/10 | Loss: 0.7665
Epoch 7/10 | Loss: 0.7369
Epoch 8/10 | Loss: 0.7138
Epoch 9/10 | Loss: 0.6960
Epoch 10/10 | Loss: 0.6839
--- Fine-tuning complete ---

--- Evaluating LS fine-tuned model: ResNet-164 (LS Tuned) ---
Test metrics ‚Üí Acc: 91.40%, Avg Conf: 85.75%, ECE: 5.649%
‚úÖ Saved reliability diagram: /content/cifar-10-ls/ResNet-164_(LS_Tuned)_C10_reliability_ls.png

Running LS fine-tuning for ResNet-56 (torch.hub)

--- Starting fine-tuning with Label Smoothing (Œ±=0.1) for 10 epochs ---


Using cache found in /root/.cache/torch/hub/chenyaofo_pytorch-cifar-models_master


Epoch 1/10 | Loss: 0.6700
Epoch 2/10 | Loss: 0.5498
Epoch 3/10 | Loss: 0.5356
Epoch 4/10 | Loss: 0.5279
Epoch 5/10 | Loss: 0.5243
Epoch 6/10 | Loss: 0.5216
Epoch 7/10 | Loss: 0.5194
Epoch 8/10 | Loss: 0.5182
Epoch 9/10 | Loss: 0.5167
Epoch 10/10 | Loss: 0.5157
--- Fine-tuning complete ---

--- Evaluating LS fine-tuned model: ResNet-56 (Hub, LS Tuned) ---
Test metrics ‚Üí Acc: 94.29%, Avg Conf: 89.55%, ECE: 4.825%
‚úÖ Saved reliability diagram: /content/cifar-10-ls/ResNet-56_(Hub,_LS_Tuned)_C10_reliability_ls.png

üìä Final Results for Label Smoothing (LS) Fine-Tuning
Model                        |   Accuracy |  ECE (LS Tuned) |   Avg Conf (LS)
--------------------------------------------------------------------------------
ResNet-164 (LS Tuned)        |     91.40% |         5.6494% |          85.75%
ResNet-56 (Hub, LS Tuned)    |     94.29% |         4.8248% |          89.55%


In [8]:
!zip -r /content/cifar-10-ls.zip /content/cifar-10-ls/

  adding: content/cifar-10-ls/ (stored 0%)
  adding: content/cifar-10-ls/ResNet-56_(Hub,_LS_Tuned)_C10_reliability_ls.png (deflated 12%)
  adding: content/cifar-10-ls/ResNet-164_(LS_Tuned)_C10_reliability_ls.png (deflated 11%)


This is an excellent (and common) result! We've stumbled upon a core finding in calibration research.

Here‚Äôs the direct answer: Our fine-tuning process damaged our model's performance, which in turn destroyed its calibration.

The "normal" model's ECE was low (3.75%) because it was highly accurate (93.20%). Our new "LS-Tuned" model's ECE is high (5.65%) because it is less accurate (91.40%).

Let's break down why this happened.

## 1. The "Smoking Gun": Accuracy Drop

Look at the accuracy for our ResNet-164:

* **Before (Original Model):** 93.20%
* **After (LS Fine-tune):** 91.40%

We lost almost **2% accuracy**. This is a massive red flag. üö©

Our fine-tuning process (10 epochs with a new loss function) partially "broke" the model's highly optimized weights. Since ECE measures the gap between accuracy and confidence, a large drop in accuracy will almost always increase ECE, as the model's confidences no longer match its (new, lower) correctness.

## 2. A "Cure" (TS) vs. A "Prevention" (LS)

We are comparing two fundamentally different methods:

* **Temperature Scaling (TS):** This is a *post-hoc* or *curative* method. It takes our finished, static, high-performing model (93.20% acc) and finds the mathematically optimal temperature to make its confidences match its accuracy. It is **non-destructive**‚Äîit *cannot* hurt the model's accuracy.

* **Label Smoothing (LS):** This is an *in-training* or *preventative* method. It's designed to be used *from scratch* (or for the entire training process) to *prevent* the model from becoming over-confident in the first place.

We used LS as a fine-tuning method, which is an ad-hoc, **destructive** process (i.e., it changes the model's weights). This is risky, and in our case, it damaged the model's performance.

## 3. Why the Fine-Tuning Failed

The goal of our original model was to get the highest possible accuracy on the clean test set. It achieved 93.20%.

The goal of our 10-epoch fine-Tuning was to minimize a *new* loss (Label-Smoothed Cross-Entropy) on the *augmented* training set (`RandomCrop`, `RandomFlip`).

This new, short training process:

* **Hurt Generalization:** It made the model slightly better at the augmented training data but *worse* at the clean test data (hence the accuracy drop).

* **Didn't Run Long Enough:** 10 epochs isn't enough to properly re-settle the model. The model is now in a sub-optimal, "confused" state‚Äîless accurate than before, but (due to LS) also less confident. This new combination is *less* calibrated than the original.

## The Key Takeaway üí°

This is actually a fantastic result! We've successfully demonstrated *why* Temperature Scaling is so effective and popular.

* Our "before" model was already great: high accuracy (93.20%) and decent ECE (3.75%).
* Our Temperature Scaling experiment took that great model and made it *nearly perfect* by reducing the ECE to **0.95%** with **zero risk** and no re-training.
* Our Label Smoothing experiment tried to "fix" the model by re-training and *made it worse* (91.40% acc, 5.65% ECE).

This shows that for a pre-trained model, a simple, non-destructive, post-hoc method like Temperature Scaling is almost always the superior, safer, and faster choice.

In [2]:
# -----------------------------
# TRAINING ResNet-164 w/ LABEL SMOOTHING (LS)
# -----------------------------
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
from collections import OrderedDict
import matplotlib.pyplot as plt
import os
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Data Transforms ---
print("Loading CIFAR-10 data...")
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

# --- Data Loaders ---
try:
    train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
    train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)

    test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)
    test_loader = DataLoader(test_dataset, batch_size=100, shuffle=False, num_workers=2)
    print(f"Data loaded: {len(train_dataset)} train, {len(test_dataset)} test samples.")
except Exception as e:
    print(f"‚ùå ERROR: Could not load CIFAR-10 data. {e}")
    exit()

# --- Helper Functions (Evaluation) ---
def get_predictions(model, loader, device, temp=1.0):
    model.eval()
    all_conf, all_corr = [], []
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x) / temp
            probs = F.softmax(logits, dim=1)
            conf, pred = torch.max(probs, 1)
            all_conf.extend(conf.cpu().numpy())
            all_corr.extend((pred == y).cpu().numpy())
    return np.array(all_conf), np.array(all_corr)

def calculate_ece(confidences, correct, n_bins=15):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    ece = 0.0
    for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        prop_in_bin = np.mean(in_bin)
        if prop_in_bin > 0:
            accuracy_in_bin = np.mean(correct[in_bin])
            avg_confidence_in_bin = np.mean(confidences[in_bin])
            ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
    return ece * 100

def plot_reliability_diagram(confidences, correct, n_bins, model_name, suffix):
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_lowers, bin_uppers = bin_boundaries[:-1], bin_boundaries[1:]
    bin_accs, bin_confs, bin_props = np.zeros(n_bins), np.zeros(n_bins), np.zeros(n_bins)
    for i, (bin_lower, bin_upper) in enumerate(zip(bin_lowers, bin_uppers)):
        in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
        bin_props[i] = np.mean(in_bin)
        if bin_props[i] > 0:
            bin_accs[i] = np.mean(correct[in_bin])
            bin_confs[i] = np.mean(confidences[in_bin])
    plt.figure(figsize=(8, 7))
    bar_width, bar_centers = 1.0 / n_bins, bin_lowers + 1.0 / (2 * n_bins)
    non_empty = bin_props > 0
    plt.bar(bar_centers[non_empty], bin_accs[non_empty], width=bar_width*0.9, alpha=0.3, color='red', edgecolor='red', label='Accuracy')
    plt.bar(bar_centers[non_empty], (bin_confs - bin_accs)[non_empty], bottom=bin_accs[non_empty], width=bar_width*0.9, alpha=0.5, color='blue', edgecolor='black', label='Confidence Gap')
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
    plt.xlabel('Confidence'); plt.ylabel('Accuracy'); plt.title(f'Reliability Diagram: {model_name} ({suffix})')
    plt.legend(); plt.xlim(0, 1); plt.ylim(0, 1); plt.grid(True, linestyle='--', alpha=0.6)

    save_folder = "/content/cifar-10-ls-scratch" # New folder for this experiment
    os.makedirs(save_folder, exist_ok=True)
    filename = f"{save_folder}/{model_name.replace(' ', '_')}_C10_reliability_{suffix}.png"
    plt.savefig(filename)
    plt.close()
    print(f"‚úÖ Saved reliability diagram: {filename}")

# --- NEW: Training Function ---
def train_with_ls(model, train_loader, test_loader, epochs=50, lr=0.1, smoothing=0.1):
    print(f"\n--- Starting training from scratch with Label Smoothing (Œ±={smoothing}) ---")
    model.to(device)

    # Loss function with label smoothing
    criterion = nn.CrossEntropyLoss(label_smoothing=smoothing)

    # Optimizer (SGD with momentum is standard for ResNets on CIFAR)
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)

    # Learning rate scheduler
    scheduler = CosineAnnealingLR(optimizer, T_max=epochs)

    best_acc = 0.0

    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        # Validation accuracy check
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for inputs, labels in test_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        acc = 100 * correct / total
        if acc > best_acc:
            best_acc = acc
            # You could save the best model checkpoint here

        print(f"Epoch {epoch+1:02}/{epochs} | Loss: {running_loss / len(train_loader):.4f} | Test Acc: {acc:.2f}% (Best: {best_acc:.2f}%)")

        scheduler.step()

    print(f"--- Training complete. Best Test Accuracy: {best_acc:.2f}% ---")
    return model # Return the model from the final epoch

# --- NEW: Evaluation Function ---
def evaluate_model(model, model_name, n_bins=15):
    print(f"\n--- Evaluating model: {model_name} (at T=1.0) ---")
    model.to(device).eval()

    # Evaluate with T=1.0 (no post-hoc scaling)
    conf, corr = get_predictions(model, test_loader, device, temp=1.0)
    ece = calculate_ece(conf, corr, n_bins=n_bins)
    acc = np.mean(corr) * 100
    avg_conf = np.mean(conf) * 100

    print(f"Test metrics ‚Üí Acc: {acc:.2f}%, Avg Conf: {avg_conf:.2f}%, ECE: {ece:.3f}%")
    plot_reliability_diagram(conf, corr, n_bins, model_name, "ls_scratch")

    return {"name": model_name, "acc": acc, "ece": ece, "conf": avg_conf}


# =======================================================
# --- RUN TRAINING & EVALUATION ---
# =======================================================
all_results = []
try:
    print("\n" + "="*60)
    print("Running training from scratch for ResNet-56 (w/ LS)")
    print("="*60)

    # 1. Create a new, from-scratch model
    model_resnet56_ls = torch.hub.load(
        "chenyaofo/pytorch-cifar-models",
        "cifar10_resnet56",
        pretrained=False,  # This is the key change
        trust_repo=True
    )

    # 2. Train it
    # Pass the new model to the training function
    trained_model = train_with_ls(model_resnet56_ls, train_loader, test_loader, epochs=50)

    # 3. Evaluate the trained model
    # Change the name for clear labeling in plots and tables
    results = evaluate_model(trained_model, "ResNet-56 (LS Scratch)")
    all_results.append(results)

except ImportError:
    print("‚ùå ERROR: Could not import `models.cifar.resnet`. Make sure the model file is available.")
except Exception as e:
    print(f"‚ùå An error occurred during training or evaluation: {e}")

# --- FINAL TABLE ---
if all_results:
    print("\n" + "="*80)
    print("üìä Final Results for Training from Scratch with Label Smoothing")
    print("="*80)
    print(f"{'Model':<28} | {'Accuracy':>10} | {'ECE':>15} | {'Avg Conf':>15}")
    print("-" * 80)
    for r in all_results:
        print(f"{r['name']:<28} | {r['acc']:>9.2f}% | {r['ece']:>14.4f}% | {r['conf']:>14.2f}%")
    print("=" * 80)
else:
    print("\nNo models were successfully trained.")

Using device: cuda
Loading CIFAR-10 data...
Data loaded: 50000 train, 10000 test samples.

Running training from scratch for ResNet-56 (w/ LS)
Downloading: "https://github.com/chenyaofo/pytorch-cifar-models/zipball/master" to /root/.cache/torch/hub/master.zip

--- Starting training from scratch with Label Smoothing (Œ±=0.1) ---
Epoch 01/50 | Loss: 1.9505 | Test Acc: 42.20% (Best: 42.20%)
Epoch 02/50 | Loss: 1.5240 | Test Acc: 54.10% (Best: 54.10%)
Epoch 03/50 | Loss: 1.2734 | Test Acc: 63.68% (Best: 63.68%)
Epoch 04/50 | Loss: 1.1478 | Test Acc: 68.81% (Best: 68.81%)
Epoch 05/50 | Loss: 1.0710 | Test Acc: 74.66% (Best: 74.66%)
Epoch 06/50 | Loss: 1.0288 | Test Acc: 70.51% (Best: 74.66%)
Epoch 07/50 | Loss: 0.9946 | Test Acc: 71.01% (Best: 74.66%)
Epoch 08/50 | Loss: 0.9750 | Test Acc: 74.94% (Best: 74.94%)
Epoch 09/50 | Loss: 0.9525 | Test Acc: 61.02% (Best: 74.94%)
Epoch 10/50 | Loss: 0.9353 | Test Acc: 76.77% (Best: 76.77%)
Epoch 11/50 | Loss: 0.9263 | Test Acc: 78.62% (Best: 78.62%)

## New Model's ECE is High Because Its Accuracy is Low

The ECE from your "LS Scratch" training (6.05%) is significantly worse than our original, uncalibrated model's ECE (3.96%).

The reason is simple: **our new model (92.90% acc) is substantially less accurate than our original model (94.12% acc).**

Since **ECE measures the gap between accuracy and confidence**, a large drop in accuracy will almost always increase ECE, as the model's confidences no longer match its (new, lower) correctness.

---

## Why the New Training Run Performed Worse

This isn't a failure of Label Smoothing (LS) as a method. It's a failure of this specific training run to match the performance of the original, expertly-tuned model.

1.  **Original Model (94.12% Acc):** The `ResNet-56 (Hub)` model was likely trained by experts for 200-300+ epochs with a perfectly tuned learning rate schedule. It represents a "best-case" scenario for accuracy.
2.  **Our New LS Model (92.90% Acc):** Oour script trained a new model from scratch for only **50 epochs**. This is not enough time to reach peak performance, especially when using a strong regularizer like Label Smoothing, which often requires *more* training epochs, not fewer.

Because oour new model is 1.22% less accurate, its confidences (88.57%) are now "miscalibrated" relative to its new, lower accuracy, leading to a high ECE.

---

## What This Experiment Proves üí°

This result is excellent because it provides more strong evidence for our first experiment:

**For a high-performing, pre-trained model, Temperature Scaling is the clear winner.**

* **Temperature Scaling (TS):** We took our **best** model (94.12% acc) and perfectly calibrated it (ECE: 1.21%) in just a few seconds. This was fast, safe, and mathematically optimal.
* **Label Smoothing (LS) Re-training:** We attempted to re-train a model to achieve the same goal. This was slow, difficult, and ultimately resulted in a *worse* model (92.90% acc) with *worse* calibration (ECE: 6.05%).