This notebook perform some analysis on several CLIP models fine-tuned on TinyImageNet. 

## Setup Environment

In [1]:
LOCAL = True

# if run locally:
if LOCAL:
    DATA_DIR = "../dataset"
    CODE_DIR = "../"
# on Colab
else:
    DATA_DIR = "/content"
    CODE_DIR = "./clip_TinyImageNet"

By default, We will work under the same dir as this notebook

In [2]:
import os, sys
ROOT = os.path.abspath(CODE_DIR)
if ROOT not in sys.path:
    sys.path.insert(0, ROOT)

If you want to use Claude Code, uncomment the cell below.

In [3]:
# !npm install -g @anthropic-ai/claude-code

If use Colab, you need to save output results to google drive.

In [4]:
if not LOCAL:
    from google.colab import drive
    drive.mount('/content/drive')
    storage_dir = "drive/MyDrive/Colab Notebooks/"

To copy the code for fine-tuneing clip on tinyImageNet, run:

In [5]:
if not LOCAL:
    !git clone https://github.com/nbzy1995/clip_TinyImageNet.git

Now we download tiny imagenet dataset. The cell below will create a directory called "tiny-imagenet-200" containing the dataset.

In [6]:
if not LOCAL:
    !wget -q http://cs231n.stanford.edu/tiny-imagenet-200.zip
    !unzip -q tiny-imagenet-200.zip

We now copy pre-computed index for the train/ folder, 90% for training, 10% for validation. The val/ folder will be used as test set.

In [7]:
if not LOCAL:
    !cp $CODE_DIR/dataset/tiny_imagenet_train_val_indices.npy /content/tiny_imagenet_train_val_indices.npy

Python Requirements for fine-tuning clip on tinyImageNet

In [8]:
if not LOCAL:
    !pip install --quiet --upgrade pip
    !pip install -q -r clip_TinyImageNet/requirements.txt
    print("✅ Core packages installed!")

Device Info


In [9]:
import torch
import subprocess

print("🔍 System Information:")
print(f"Python version: {subprocess.check_output(['python', '--version']).decode().strip()}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU deivce: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("❌ No GPU available! Please enable GPU runtime in Colab.")
    print("Runtime > Change runtime type > Hardware accelerator > GPU")

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

🔍 System Information:
Python version: Python 3.11.5
PyTorch version: 2.8.0
CUDA available: False
❌ No GPU available! Please enable GPU runtime in Colab.
Runtime > Change runtime type > Hardware accelerator > GPU


## Load Models and Dataset
Load several models (different fine-tunes), at various epochs

In [10]:
import torch
import clip
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import math
from tqdm.notebook import tqdm
from utils import ModelWrapper,get_model_from_sd, eval_model_on_dataset

In [25]:
# Load all trained model_sds

Model_names = ['config1', 'config2', 'config3', 'config4', 'config5']
Model_sds = []
checkpoint_steps = range(0, 11, 1)

base_model, preprocess = clip.load('ViT-B/32', DEVICE, jit=False)
base_model = base_model.float() # Force the base model to stay in float32 to match saved weights

for name in Model_names:
    for checkpoint_step in checkpoint_steps:

        model_path = f'checkpoints/{name}_{checkpoint_step}.pt'
        drive_path = f'/content/drive/MyDrive/Colab Notebooks/{model_path}'

        # Try local first, then Drive backup
        sd = None
        if os.path.exists(model_path):
            print(f'Loading {model_path} (local)')
            sd = torch.load(model_path, map_location=DEVICE)
        elif os.path.exists(drive_path):
            print(f'Loading {drive_path} (from Drive)')
            sd = torch.load(drive_path, map_location=DEVICE)
        
        if sd:
            Model_sds.append({
                'name': name,
                'epoch': checkpoint_step,
                'state_dict': sd
            })

print(f"✅ Loaded {len(Model_sds)} model_sds successfully!")

Loading checkpoints/config3_6.pt (local)
Loading checkpoints/config4_6.pt (local)
Loading checkpoints/config5_9.pt (local)
Loading checkpoints/config5_9.pt (local)
✅ Loaded 3 model_sds successfully!
✅ Loaded 3 model_sds successfully!


In [21]:
print("Creating datasets...")

from dataset.tiny_imagenet import TinyImageNet

# Use CLIP's expected preprocessing for proper model evaluation
import clip
clip_preprocess = clip.load('ViT-B/32', DEVICE, jit=False)[1]

data_tinyImageNet = TinyImageNet(
    train_preprocess=clip_preprocess,  # Use CLIP preprocessing for training
    eval_preprocess=clip_preprocess,   # Use CLIP preprocessing for evaluation
    location=DATA_DIR,
    batch_size=8,
    num_workers=2,
    distributed=False,
)

test_loader = data_tinyImageNet.test_loader

print("✅ Done")

Creating datasets...
✅Done


## Parameter space distance - pairs of models

Pairs: At any two config, each at any epoch, 

Compute L2 distance, over all pairs of models

In [26]:
n_models = len(Model_sds)
print(f"🔍 Analyzing {n_models} models")

# Get all parameter group names from first model
param_names = list(Model_sds[0]['state_dict'].keys())
print(f"Total parameter groups: {len(param_names)}")

# Separate classification head vs backbone parameters
head_params = [k for k in param_names if k.startswith('classification_head')]
backbone_params = [k for k in param_names if not k.startswith('classification_head')]

print(f"Classification head parameter groups: {len(head_params)}")
print(f"Backbone parameter groups: {len(backbone_params)}")

# Check if backbone parameters are identical
backbone_identical = True
for param_name in backbone_params:
    ref_param = Model_sds[0]['state_dict'][param_name]
    for i in range(1, n_models):
        if not torch.allclose(ref_param, Model_sds[i]['state_dict'][param_name], atol=1e-6):
            # print(f"❌ {param_name} differs between model 0 and model {i}")
            backbone_identical = False
            break

if backbone_identical:
    print("✅ All backbone parameters are identical across models!")
else:
    print("❌ Some backbone parameters differ between models.")

records = []

for i in range(n_models):
    for j in range(i + 1, n_models):
        model_i = Model_sds[i]
        model_j = Model_sds[j]

        # Flatten and concatenate all parameters
        params_i = torch.cat([p.flatten() for p in model_i['state_dict'].values()])
        params_j = torch.cat([p.flatten() for p in model_j['state_dict'].values()])

        # Compute L2 distance
        l2_dist = torch.norm(params_i - params_j).item()

        records.append({
            'model1_name': model_i['name'],
            'model1_epoch': model_i['epoch'],
            'model2_name': model_j['name'],
            'model2_epoch': model_j['epoch'],
            'l2_distance': l2_dist,
        })

# Create DataFrame
df_results = pd.DataFrame(records)
print("✅ DataFrame with L2 distances created.")
df_results.head()

🔍 Analyzing 3 models
Total parameter groups: 160
Classification head parameter groups: 2
Backbone parameter groups: 158
❌ Some backbone parameters differ between models.
❌ Some backbone parameters differ between models.
✅ DataFrame with L2 distances created.
✅ DataFrame with L2 distances created.


Unnamed: 0,model1_name,model1_epoch,model2_name,model2_epoch,l2_distance
0,config3,6,config4,6,6.266105
1,config3,6,config5,9,1.097487
2,config4,6,config5,9,6.281423


### Plot: distance(epoch)
L2 distance between any two different models at same epoch during training.

In [None]:
# TODO:

## Model soup performance (on test split) -  pairs of models

In [16]:
def create_soup(state_di, weights=None):
    """Create a model soup by averaging state dicts with given weights"""
    if weights is None:
        weights = [1.0 / len(state_di)] * len(state_di)

    # Start with the first model weighted
    soup_state_dict = {k: v.clone() * weights[0] for k, v in state_di[0].items()}

    # Add remaining models
    for i, state_dict in enumerate(state_di[1:], 1):
        for k, v in state_dict.items():
            soup_state_dict[k] += v.clone() * weights[i]

    return soup_state_dict


Compute soup accuracy for each pair

In [None]:
soup_accuracies = []
# Create a dictionary for quick state_dict lookup
model_map = {(m['name'], m['epoch']): m['state_dict'] for m in Model_sds}

for _, row in tqdm(df_results.iterrows(), total=len(df_results), desc="Evaluating Soups"):
    sd1 = model_map[(row['model1_name'], row['model1_epoch'])]
    sd2 = model_map[(row['model2_name'], row['model2_epoch'])]
    
    # Create a soup model:  theta = 1/2(theta_i + theta_j)
    soup_sd = create_soup([sd1, sd2])
    
    # Evaluate the soup on test split
    soup_model = get_model_from_sd(soup_sd, base_model)
    acc = eval_model_on_dataset(soup_model, test_loader)
    soup_accuracies.append(acc)

df_results['soup_accuracy'] = soup_accuracies
print("✅ Soup accuracies calculated and added to DataFrame.")
df_results.head()



[0% 0/1250]	Acc: 62.50	Data (t) 3.840	Batch (t) 3.975
[2% 20/1250]	Acc: 80.36	Data (t) 0.000	Batch (t) 0.099
[3% 40/1250]	Acc: 78.35	Data (t) 0.000	Batch (t) 0.103
[5% 60/1250]	Acc: 80.53	Data (t) 0.000	Batch (t) 0.104
[6% 80/1250]	Acc: 81.33	Data (t) 0.000	Batch (t) 0.110
[8% 100/1250]	Acc: 81.06	Data (t) 0.000	Batch (t) 0.114
[10% 120/1250]	Acc: 80.99	Data (t) 0.000	Batch (t) 0.115
[11% 140/1250]	Acc: 81.12	Data (t) 0.000	Batch (t) 0.114
[13% 160/1250]	Acc: 81.83	Data (t) 0.000	Batch (t) 0.109
[14% 180/1250]	Acc: 81.91	Data (t) 0.000	Batch (t) 0.115
[16% 200/1250]	Acc: 82.21	Data (t) 0.001	Batch (t) 0.117
[18% 220/1250]	Acc: 82.30	Data (t) 0.000	Batch (t) 0.109
[19% 240/1250]	Acc: 82.16	Data (t) 0.001	Batch (t) 0.109
[21% 260/1250]	Acc: 82.52	Data (t) 0.001	Batch (t) 0.124
[22% 280/1250]	Acc: 83.10	Data (t) 0.000	Batch (t) 0.133
[24% 300/1250]	Acc: 83.35	Data (t) 0.000	Batch (t) 0.129
[26% 320/1250]	Acc: 83.49	Data (t) 0.001	Batch (t) 0.109
[27% 340/1250]	Acc: 83.28	Data (t) 0.000	Ba

KeyboardInterrupt: 

### Plot: soup accuracy(epoch)
soup accuracy for any two models at same epoch.

soup improvement above mean of each model.

In [None]:
# TODO

## Relation between Soup performance and model L2 distance

### Plot: soup improvement vs L2 distance
scatters