# PathMNIST Few‑Shot Benchmark

This notebook reproduces a simple few‑shot benchmark on the **PathMNIST** histology dataset
using image embeddings extracted from either:

* **DINOv2 ViT‑L/14**
* **OpenAI CLIP ViT‑B/16**

For each backbone we train a multinomial Logistic Regression on varying fractions of
labeled data (1 % → 100 %) and report accuracy on the held‑out test set.


## Imports & Global Configuration

All necessary libraries, constants, and the global random‐seed helper.

In [None]:
from __future__ import annotations
import warnings, logging, argparse, math
from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import torch, torch.nn as nn
import torchvision.transforms as T
from torch.utils.data import DataLoader, TensorDataset
from PIL import Image
from tqdm import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler

from medmnist import PathMNIST
import clip  # OpenAI CLIP

warnings.filterwarnings('ignore', message='xFormers is available*')

DEVICE = 'cuda:1' if torch.cuda.is_available() else 'cpu'
BATCH_SIZE = 128
NUM_WORKERS = 4
SEED = 42

LABEL_PERCENTS = (0.01, 0.05, 0.10, 0.20, 0.50, 1.00) # Percentages of labeled data to use

def set_seed(seed: int = SEED):
    torch.manual_seed(seed)
    np.random.seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed()


## Logger Utility

A tiny helper to get a **consistent log format** throughout the notebook.

In [3]:
def get_logger(name: str = 'PathMNIST'):
    logger = logging.getLogger(name)
    if not logger.handlers:
        logger.setLevel(logging.INFO)
        h = logging.StreamHandler()
        h.setFormatter(logging.Formatter('[%(levelname)s] %(message)s'))
        logger.addHandler(h)
    return logger

LOGGER = get_logger()


## Dataset Preparation

Functions that download **PathMNIST**, apply image transforms, and wrap them in a `TensorDataset`.

In [4]:
def default_transform():
    return T.Compose([
        T.ToTensor(),
        T.Resize((224, 224)),
        T.Normalize([0.5]*3, [0.5]*3)
    ])

def load_pathmnist(split: str, transform: T.Compose | None = None) -> TensorDataset:
    data = PathMNIST(split=split, download=True)
    imgs, labels = data.imgs, data.labels.flatten()
    transform = transform or default_transform()
    tensors = torch.stack([transform(Image.fromarray(im)) for im in tqdm(imgs, desc=f'Transform {split}')])
    labels = torch.as_tensor(labels, dtype=torch.long)
    LOGGER.info('✓ %s split ready (%d samples)', split, len(tensors))
    return TensorDataset(tensors, labels)


## Feature Extraction

Helper that feeds images through the chosen backbone and returns NumPy arrays of embeddings & labels.

In [5]:
def extract_features(model: nn.Module, dataset: TensorDataset,
                     batch_size: int = BATCH_SIZE) -> Tuple[np.ndarray, np.ndarray]:
    model.eval().to(DEVICE)
    loader = DataLoader(dataset, batch_size=batch_size, num_workers=NUM_WORKERS)
    feats, labels = [], []
    with torch.no_grad():
        for xb, yb in tqdm(loader, desc='→ Extract'):
            xb = xb.to(DEVICE)
            with torch.autocast(device_type=DEVICE):
                feats.append(model(xb).cpu())
            labels.append(yb)
    return torch.cat(feats).numpy(), torch.cat(labels).numpy()


## Few‑shot Evaluation with Logistic Regression

Train a multinomial Logistic Regression on different label budgets and report performance.

In [6]:
def benchmark_logreg(X_all: np.ndarray, y_all: np.ndarray,
                     X_test: np.ndarray, y_test: np.ndarray,
                     hp_space: Dict[float, Dict]) -> List[Tuple[float, float, float]]:
    results = []
    for pct in LABEL_PERCENTS:
        LOGGER.info('-- %.0f%% labels', pct*100)
        if pct < 1.0:
            sss = StratifiedShuffleSplit(1, train_size=pct, random_state=SEED)
            train_idx, _ = next(sss.split(X_all, y_all))
        else:
            train_idx = np.arange(len(X_all))
        X_tr, y_tr = X_all[train_idx], y_all[train_idx]

        scaler = StandardScaler()
        X_tr = scaler.fit_transform(X_tr.astype(np.float32))
        X_ts = scaler.transform(X_test.astype(np.float32))

        clf = LogisticRegression(max_iter=3000, tol=1e-2, **hp_space[pct])
        clf.fit(X_tr, y_tr.ravel())

        train_acc = accuracy_score(y_tr, clf.predict(X_tr))
        test_acc = accuracy_score(y_test, clf.predict(X_ts))
        LOGGER.info('train=%.4f | test=%.4f', train_acc, test_acc)
        results.append((pct, train_acc, test_acc))
    return results


## Hyper‑parameters

Fixed hyper‑parameters found by manual tuning for each label percentage.

In [7]:
BEST_PARAMS_DINO = {
    0.01: {'C':1.0,'penalty':'elasticnet','solver':'saga','l1_ratio':0.5},
    0.05: {'C':0.1,'penalty':'l2','solver':'saga'},
    0.10: {'C':0.1,'penalty':'l2','solver':'saga'},
    0.20: {'C':0.1,'penalty':'l2','solver':'saga'},
    0.50: {'C':0.1,'penalty':'l2','solver':'saga'},
    1.00: {'C':1.0,'penalty':'elasticnet','solver':'saga','l1_ratio':0.5},
}

BEST_PARAMS_CLIP = {
    0.01: {'C':0.1,'penalty':'l2','solver':'saga'},
    0.05: {'C':1.0,'penalty':'l2','solver':'saga'},
    0.10: {'C':1.0,'penalty':'elasticnet','solver':'saga','l1_ratio':0.5},
    0.20: {'C':1.0,'penalty':'l2','solver':'saga'},
    0.50: {'C':1.0,'penalty':'l2','solver':'saga'},
    1.00: {'C':1.0,'penalty':'l2','solver':'saga'},
}


## Evaluate DINOv2 Backbone

Extract embeddings with **DINOv2 ViT‑L/14** and run the benchmark.

In [8]:
# ————————————— DINOv2 backbone —————————————
transform = default_transform()
train_ds = load_pathmnist('train', transform)
test_ds  = load_pathmnist('test' , transform)

dino_model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14', trust_repo=True)

LOGGER.info('Extracting DINOv2 features …')
X_tr_dino, y_tr_dino = extract_features(dino_model, train_ds)
X_te_dino, y_te_dino = extract_features(dino_model, test_ds)

LOGGER.info('Benchmarking Logistic Regression …')
results_dino = benchmark_logreg(X_tr_dino, y_tr_dino, X_te_dino, y_te_dino, BEST_PARAMS_DINO)


Transform train: 100%|██████████| 89996/89996 [02:07<00:00, 706.41it/s] 
[INFO] ✓ train split ready (89996 samples)
Transform test: 100%|██████████| 7180/7180 [00:09<00:00, 740.29it/s] 
[INFO] ✓ test split ready (7180 samples)
Using cache found in /home/infres/mmohamed-22/.cache/torch/hub/facebookresearch_dinov2_main
[INFO] Extracting DINOv2 features …
→ Extract: 100%|██████████| 704/704 [05:38<00:00,  2.08it/s]
→ Extract: 100%|██████████| 57/57 [00:40<00:00,  1.40it/s]
[INFO] Benchmarking Logistic Regression …
[INFO] -- 1% labels
[INFO] train=0.9944 | test=0.8623
[INFO] -- 5% labels
[INFO] train=0.9860 | test=0.8876
[INFO] -- 10% labels
[INFO] train=0.9821 | test=0.8891
[INFO] -- 20% labels
[INFO] train=0.9750 | test=0.8937
[INFO] -- 50% labels
[INFO] train=0.9680 | test=0.9015
[INFO] -- 100% labels
[INFO] train=0.9664 | test=0.9001


## Evaluate CLIP Backbone

Repeat the process with **OpenAI CLIP ViT‑B/16**.

In [9]:
LABEL_PERCENTS = (0.01, 0.20, 0.50) # Percentages of labeled data to use

In [10]:
# ————————————— CLIP backbone —————————————
clip_model, preprocess_clip = clip.load('ViT-B/16', device=DEVICE)

train_ds_clip = load_pathmnist('train', preprocess_clip)
test_ds_clip  = load_pathmnist('test' , preprocess_clip)

class CLIPWrapper(nn.Module):
    def __init__(self, model): super().__init__(); self.m = model
    def forward(self, x): return self.m.encode_image(x)

clip_wrapper = CLIPWrapper(clip_model)

LOGGER.info('Extracting CLIP features …')
X_tr_clip, y_tr_clip = extract_features(clip_wrapper, train_ds_clip)
X_te_clip, y_te_clip = extract_features(clip_wrapper, test_ds_clip)

LOGGER.info('Benchmarking Logistic Regression …')
results_clip = benchmark_logreg(X_tr_clip, y_tr_clip, X_te_clip, y_te_clip, BEST_PARAMS_CLIP)


Transform train: 100%|██████████| 89996/89996 [04:16<00:00, 350.44it/s]
[INFO] ✓ train split ready (89996 samples)
Transform test: 100%|██████████| 7180/7180 [00:23<00:00, 301.72it/s]
[INFO] ✓ test split ready (7180 samples)
[INFO] Extracting CLIP features …
→ Extract: 100%|██████████| 704/704 [02:50<00:00,  4.13it/s]
→ Extract: 100%|██████████| 57/57 [00:33<00:00,  1.72it/s]
[INFO] Benchmarking Logistic Regression …
[INFO] -- 1% labels
[INFO] train=0.9722 | test=0.8606
[INFO] -- 20% labels
[INFO] train=0.9548 | test=0.8921
[INFO] -- 50% labels
[INFO] train=0.9502 | test=0.8923


## Compare Results

Combine the two result lists into a single DataFrame for easy comparison.

In [11]:
import pandas as pd
df_dino = pd.DataFrame(results_dino, columns=['pct','train_acc','test_acc'])
df_clip = pd.DataFrame(results_clip, columns=['pct','train_acc','test_acc'])

display(pd.concat({'DINOv2':df_dino,'CLIP':df_clip}, axis=1))


Unnamed: 0_level_0,DINOv2,DINOv2,DINOv2,CLIP,CLIP,CLIP
Unnamed: 0_level_1,pct,train_acc,test_acc,pct,train_acc,test_acc
0,0.01,0.994438,0.862256,0.01,0.972191,0.860585
1,0.05,0.985997,0.887604,0.2,0.954831,0.892061
2,0.1,0.982109,0.889136,0.5,0.950153,0.89234
3,0.2,0.974999,0.893733,,,
4,0.5,0.967954,0.901532,,,
5,1.0,0.966387,0.900139,,,
