# Pneumonia Detection: Anchor-Free vs. Anchor-Based Object Detection

**NAML Course Project — Politecnico di Milano**

This notebook runs the full training and evaluation pipeline on **Kaggle** with GPU acceleration.

Three models compared:
1. **FCOS** — anchor-free (paper's method)
2. **RetinaNet** — anchor-based, one-stage
3. **Faster R-CNN** — anchor-based, two-stage

### Advanced Training Techniques
- **COCO-pretrained models**: Full detection model weights (backbone + FPN + heads) for fast convergence
- **Backbone freezing**: Early ResNet layers frozen for first 3 epochs (prevents overfitting)
- **EMA (Exponential Moving Average)**: Smoother optimization for better generalization
- **Cosine annealing LR**: Better convergence than step decay
- **WeightedRandomSampler**: Oversamples positive patients (3x) for class imbalance
- **Medical augmentations**: CLAHE, rotation, contrast, noise, elastic/grid distortion (via albumentations)
- **TTA (Test-Time Augmentation)**: Horizontal flip at eval time (+1-3% AP)
- **Gaussian Soft-NMS**: Decays overlapping scores instead of hard removal
- **Optimal patient threshold**: ROC/Youden's J statistic for patient classification
- **Channels-last memory**: 10-30% faster GPU convolutions
- **AMP + multi-GPU + torch.compile()**: Maximum hardware utilization

### Kaggle Setup
1. **Add the RSNA dataset**: Click *Add Data* → search `rsna-pneumonia-detection-challenge` (Competition tab) → Add
2. **Upload project code**: Upload your `Project_Pneumonia_Detection.zip` as a Kaggle Dataset, then add it here via *Add Data*
3. **Enable GPU**: Settings → Accelerator → **GPU T4 x2** (trains 2 models in parallel) or **GPU P100** (single GPU)
4. **Enable Internet**: Settings → Internet → **On** (needed for pip install)

## 1. Setup

In [None]:
# Verify GPU is available
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available:  {torch.cuda.is_available()}")
NUM_GPUS = torch.cuda.device_count()
print(f"Number of GPUs:  {NUM_GPUS}")
for i in range(NUM_GPUS):
    print(f"  GPU {i}: {torch.cuda.get_device_name(i)} "
          f"({torch.cuda.get_device_properties(i).total_memory / 1e9:.1f} GB)")
if NUM_GPUS == 0:
    print("WARNING: No GPU detected! Go to Settings > Accelerator > GPU T4 x2")
elif NUM_GPUS >= 2:
    print("\n2 GPUs detected — will train models in parallel!")
else:
    print("\n1 GPU detected — will train models sequentially.")

## RSNA Pneumonia Detection Challenge

In [None]:
# Copy project source code from uploaded dataset to working directory
import os
import shutil
from pathlib import Path

WORKING_DIR = "/kaggle/working"
os.chdir(WORKING_DIR)

# Search for main.py under all known Kaggle input mount points
# Skip checkpoint datasets (they may contain stale code from a previous run)
input_dirs = [Path("/kaggle/input/datasets"), Path("/kaggle/input"), Path("/kaggle/input/competitions")]
project_src = None

for base in input_dirs:
    if not base.exists():
        continue
    for p in base.rglob("main.py"):
        # Verify it's our project (has src/ alongside it)
        if (p.parent / "src").is_dir():
            # Skip if this looks like a checkpoint dataset (has checkpoints/ folder)
            if (p.parent / "checkpoints").is_dir():
                print(f"  Skipping checkpoint dataset: {p.parent}")
                continue
            project_src = p.parent
            break
    if project_src:
        break

# Fallback: if no non-checkpoint source found, use any match
if project_src is None:
    for base in input_dirs:
        if not base.exists():
            continue
        for p in base.rglob("main.py"):
            if (p.parent / "src").is_dir():
                project_src = p.parent
                break
        if project_src:
            break

if project_src is None:
    print("ERROR: Could not find project files (main.py + src/) in /kaggle/input/")
    print("Make sure you uploaded the project ZIP as a Kaggle Dataset and added it to this notebook.")
else:
    print(f"Found project at: {project_src}")
    for item in ["main.py", "src", "requirements.txt", "regenerate_plots.py"]:
        src = project_src / item
        dst = Path(WORKING_DIR) / item
        if src.exists():
            # Always overwrite to ensure latest code
            if dst.exists():
                if dst.is_dir():
                    shutil.rmtree(str(dst))
                else:
                    dst.unlink()
            if src.is_dir():
                shutil.copytree(str(src), str(dst))
            else:
                shutil.copy2(str(src), str(dst))
            print(f"  Copied: {item}")
    print(f"Working directory: {os.getcwd()}")

In [None]:
# Install dependencies (albumentations and scikit-learn are usually pre-installed on Kaggle)
!pip install -q pydicom seaborn albumentations scikit-learn

In [None]:
# Verify project structure
required = ["main.py", "src/config.py", "src/engine.py", "src/evaluate.py",
            "src/dataset.py", "src/transforms.py", "src/visualize.py",
            "src/models/__init__.py", "src/models/fcos.py",
            "src/models/retinanet.py", "src/models/faster_rcnn.py"]
missing = [f for f in required if not os.path.exists(f)]
if missing:
    print(f"ERROR: Missing files: {missing}")
    print(f"Current directory: {os.getcwd()}")
    print(f"Contents: {os.listdir('.')}")
else:
    print("All project files found.")

## 2. Prepare Dataset

The RSNA dataset is already available at `/kaggle/input/rsna-pneumonia-detection-challenge/` — no download needed!

Since Kaggle input is read-only, we symlink the dataset files to a writable `data/` directory.

In [None]:
# Set up data directory with symlinks to the Kaggle input (read-only)
import os, glob
from pathlib import Path

# Find RSNA dataset under any Kaggle mount point
RSNA_INPUT = None
for candidate in [
    "/kaggle/input/rsna-pneumonia-detection-challenge",
    "/kaggle/input/competitions/rsna-pneumonia-detection-challenge",
]:
    if os.path.exists(candidate):
        RSNA_INPUT = candidate
        break

if RSNA_INPUT is None:
    # Fallback: search for the labels CSV
    for p in Path("/kaggle/input").rglob("stage_2_train_labels.csv"):
        RSNA_INPUT = str(p.parent)
        break

if RSNA_INPUT is None:
    raise FileNotFoundError(
        "RSNA dataset not found! Add it via: Add Data → Competition tab → rsna-pneumonia-detection-challenge"
    )

print(f"RSNA dataset found at: {RSNA_INPUT}")

DATA_DIR = "/kaggle/working/data"
os.makedirs(DATA_DIR, exist_ok=True)

# Symlink the dataset files into our writable data directory
for item in os.listdir(RSNA_INPUT):
    src = os.path.join(RSNA_INPUT, item)
    dst = os.path.join(DATA_DIR, item)
    if not os.path.exists(dst):
        os.symlink(src, dst)
        print(f"  Linked: {item}")

# Auto-detect previous run output (saved as dataset) and restore PNG images
prev_png = None
for candidate in Path("/kaggle/input").rglob("stage_2_train_images_png"):
    if candidate.is_dir():
        prev_png = str(candidate)
        break

png_dst = os.path.join(DATA_DIR, "stage_2_train_images_png")
if prev_png and not os.path.exists(png_dst):
    os.symlink(prev_png, png_dst)
    n_pngs = len(os.listdir(png_dst))
    # Verify PNG count — if incomplete, remove symlink so cell 9 reconverts
    if n_pngs < 25000:  # full dataset has ~26,684
        os.unlink(png_dst)
        print(f"  Previous PNGs incomplete ({n_pngs} files), will reconvert")
    else:
        print(f"  Linked {n_pngs} PNG images from previous run")

print(f"\nData directory: {DATA_DIR}")
print(f"Contents: {os.listdir(DATA_DIR)}")

# Show dataset stats
import pandas as pd
df = pd.read_csv(f"{DATA_DIR}/stage_2_train_labels.csv")
n_patients = df["patientId"].nunique()
n_positive = df[df["Target"] == 1]["patientId"].nunique()
print(f"\nTotal patients: {n_patients}")
print(f"Positive (pneumonia): {n_positive}")
print(f"Negative: {n_patients - n_positive}")

In [None]:
# Preprocess DICOM -> PNG (much faster data loading during training)
# This runs on Kaggle's local SSD, so it's ~10-15 min instead of ~160 min on Colab+Drive
import os
png_dir = f"{DATA_DIR}/stage_2_train_images_png"
if not os.path.exists(png_dir) or len(os.listdir(png_dir)) < 100:
    !PYTHONUNBUFFERED=1 python -m src.preprocess --data-dir {DATA_DIR} --compress 1
else:
    print(f"PNG images already exist ({len(os.listdir(png_dir))} files).")

## 3. Training Configuration

**Optimized for maximum performance within Kaggle's 12-hour limit.**

Advanced features enabled by default:
- **Backbone freezing** (3 epochs): Prevents catastrophic forgetting of COCO features
- **EMA** (decay=0.999): Exponential Moving Average for smoother generalization
- **Cosine annealing**: Better LR schedule than step decay
- **Weighted sampler**: 3x oversampling of pneumonia-positive patients
- **TTA + Soft-NMS**: Test-time augmentation and Gaussian score decay at eval
- **Medical augmentations**: CLAHE, rotation, contrast, noise, elastic/grid distortion

In [None]:
# ============================================================
# TRAINING SETTINGS — adjust these as needed
# ============================================================

EPOCHS = 10           # 10 is enough with COCO-pretrained weights
BATCH_SIZE = 64       # 64 for A100 80GB; use 32 for T4 16GB; 24 if OOM
MAX_SAMPLES = None    # None = full dataset; set to 500 for quick test
LEARNING_RATE = 1e-4  # Stable LR for Adam + detection models
IMAGE_SIZE = 512      # Input image size
VAL_FREQUENCY = 2     # Validate every N epochs (2 = 50% faster)
EARLY_STOPPING = 5    # Stop after N validations without improvement

# Advanced features (all enabled by default for max performance)
FREEZE_EPOCHS = 3     # Freeze backbone for N epochs (0=disabled)
SCHEDULER = "cosine"  # "cosine" or "step"
GRAD_ACCUM = 1        # Gradient accumulation steps (increase if batch < 8)

# ============================================================

import os, shutil
from pathlib import Path

n_cpus = os.cpu_count() or 4
NUM_WORKERS = min(8, n_cpus)  # 8 workers per GPU for A100; min(4, cpus) for T4

# Auto-detect previous run output saved as a dataset (for resume)
PREV_RUN = None
for candidate in Path("/kaggle/input").rglob("checkpoints"):
    if candidate.is_dir() and any(f.endswith(".pth") for f in os.listdir(candidate)):
        PREV_RUN = str(candidate.parent)
        break

RESUME = False
if PREV_RUN:
    print(f"Found previous run at: {PREV_RUN}")

    # Restore checkpoints and results
    for folder in ["checkpoints", "results"]:
        prev_folder = os.path.join(PREV_RUN, folder)
        if os.path.exists(prev_folder):
            dst = os.path.join("/kaggle/working", folder)
            os.makedirs(dst, exist_ok=True)
            copied = 0
            for f in os.listdir(prev_folder):
                src_f = os.path.join(prev_folder, f)
                dst_f = os.path.join(dst, f)
                if not os.path.exists(dst_f) and os.path.isfile(src_f):
                    shutil.copy2(src_f, dst_f)
                    copied += 1
            if copied:
                print(f"  Restored {copied} files to {folder}/")
    ckpt_dir = "/kaggle/working/checkpoints"
    if os.path.exists(ckpt_dir) and os.listdir(ckpt_dir):
        RESUME = True
        print(f"  Checkpoints: {sorted(os.listdir(ckpt_dir))}")

# Remove ALL old checkpoints — model architectures changed
ckpt_dir = "/kaggle/working/checkpoints"
if os.path.exists(ckpt_dir):
    for f in os.listdir(ckpt_dir):
        if f.endswith(".pth"):
            os.remove(os.path.join(ckpt_dir, f))
            print(f"  Removed incompatible checkpoint: {f}")
    RESUME = False  # Force fresh training

# Figure out which models still need training
MODELS_TO_TRAIN = []
for name in ["fcos", "retinanet", "faster_rcnn"]:
    final = f"/kaggle/working/checkpoints/{name}_final.pth"
    resume_ckpt = f"/kaggle/working/checkpoints/{name}_resume.pth"
    if os.path.exists(final):
        print(f"  {name}: already completed")
    elif os.path.exists(resume_ckpt):
        print(f"  {name}: will RESUME from checkpoint")
        MODELS_TO_TRAIN.append(name)
    else:
        print(f"  {name}: will train from scratch")
        MODELS_TO_TRAIN.append(name)

resume_flag = " --resume" if RESUME else ""

base_args = (
    f" --data-dir {DATA_DIR}"
    f" --epochs {EPOCHS} --batch-size {BATCH_SIZE}"
    f" --lr {LEARNING_RATE} --image-size {IMAGE_SIZE}"
    f" --val-frequency {VAL_FREQUENCY}"
    f" --early-stopping {EARLY_STOPPING}"
    f" --prefetch-factor 4"
    f" --freeze-epochs {FREEZE_EPOCHS}"
    f" --scheduler {SCHEDULER}"
    f" --grad-accum {GRAD_ACCUM}"
    f"{resume_flag}"
)
if MAX_SAMPLES is not None:
    base_args += f" --max-samples {MAX_SAMPLES}"

print(f"\nSettings: epochs={EPOCHS}, batch_size={BATCH_SIZE}, lr={LEARNING_RATE}")
print(f"Advanced: freeze={FREEZE_EPOCHS}ep, scheduler={SCHEDULER}, EMA=True, TTA=True, Soft-NMS=True")
print(f"Resume: {RESUME}, Models to train: {MODELS_TO_TRAIN or 'none (all done)'}")

## 4. Train Models

Each model now uses:
- **COCO-pretrained backbone + FPN + heads** (only classification head replaced for 2 classes)
- **Backbone freezing** (first 3 epochs): Protects pretrained low-level features
- **EMA** (decay=0.999): Smoothed weights for better generalization
- **Cosine annealing LR** with warmup: Better convergence profile
- **WeightedRandomSampler**: 3x oversampling of positive patients
- **Medical augmentations** (CLAHE, rotation, contrast, noise, elastic distortion)
- **Gradient clipping** (max_norm=1.0): Prevents training instability
- **Channels-last memory format**: 10-30% faster GPU convolutions

At evaluation time: **TTA** (horizontal flip) + **Gaussian Soft-NMS** for maximum AP.

**If Kaggle kills the session**: Set `RESUME = True` above and re-run.

In [None]:
import subprocess, time, os

t_start = time.time()

if not MODELS_TO_TRAIN:
    print("All models already trained! Skipping to evaluation.")
elif NUM_GPUS >= 2 and len(MODELS_TO_TRAIN) >= 2:
    # GPU pool: as soon as a GPU finishes, start the next model immediately
    # 256 CPUs / 8 GPUs = 32 CPUs available per GPU, use 8 workers each (I/O bound)
    workers_per_gpu = min(8, max(4, (os.cpu_count() or 8) // NUM_GPUS))
    pending = list(MODELS_TO_TRAIN)
    running = {}   # gpu_id -> (model_name, process, log_file_handle, log_path)

    print(f"{'=' * 70}")
    print(f"  GPU POOL: {NUM_GPUS} GPUs, {workers_per_gpu} workers/GPU")
    print(f"  Models queued: {[m.upper() for m in pending]}")
    print(f"{'=' * 70}")

    while pending or running:
        # Launch on free GPUs
        free_gpus = [g for g in range(NUM_GPUS) if g not in running]
        while pending and free_gpus:
            gpu_id = free_gpus.pop(0)
            model_name = pending.pop(0)
            log_path = f"train_{model_name}.log"
            log_fh = open(log_path, "w")
            cmd = (
                f"PYTHONUNBUFFERED=1 python main.py --mode train --model {model_name}"
                f" --device cuda:{gpu_id} --num-workers {workers_per_gpu}"
                f" {base_args}"
            )
            proc = subprocess.Popen(cmd, shell=True, stdout=log_fh, stderr=subprocess.STDOUT)
            running[gpu_id] = (model_name, proc, log_fh, log_path)
            print(f"  Started {model_name.upper()} on GPU {gpu_id}  (queue: {[m.upper() for m in pending]})")

        if not running:
            break

        # Wait and show progress
        time.sleep(60)
        print(f"\n--- Progress ({time.strftime('%H:%M:%S')}) ---")

        # Check for completed processes
        for gpu_id in list(running.keys()):
            model_name, proc, log_fh, log_path = running[gpu_id]
            rc = proc.poll()

            # Show latest log line
            try:
                with open(log_path) as f:
                    lines = f.readlines()
                for line in reversed(lines):
                    stripped = line.strip()
                    if stripped and not stripped.startswith("W0"):
                        print(f"  [{model_name}] GPU {gpu_id}: {stripped}")
                        break
            except FileNotFoundError:
                print(f"  [{model_name}] GPU {gpu_id}: (waiting...)")

            if rc is not None:
                # Process finished — free the GPU
                log_fh.close()
                del running[gpu_id]
                status = "DONE" if rc == 0 else f"FAILED (exit {rc})"
                print(f"\n{'=' * 70}")
                print(f"  {model_name.upper()} — {status} (GPU {gpu_id} now free)")
                print(f"{'=' * 70}")
                try:
                    with open(log_path) as f:
                        for line in f.readlines()[-15:]:
                            print(line, end="")
                except FileNotFoundError:
                    pass

    # Print final summary
    for model_name in MODELS_TO_TRAIN:
        log_path = f"train_{model_name}.log"
        if os.path.exists(log_path):
            with open(log_path) as f:
                lines = f.readlines()
            # Find the last epoch summary line
            for line in reversed(lines):
                if f"[{model_name}]" in line and "Epoch" in line:
                    print(f"  {line.strip()}")
                    break
else:
    # Sequential on single GPU
    for model_name in MODELS_TO_TRAIN:
        print(f"\n{'=' * 70}")
        print(f"  Training: {model_name.upper()}")
        print(f"{'=' * 70}")
        !PYTHONUNBUFFERED=1 python main.py --mode train --model {model_name} \
            --device cuda:0 --num-workers {NUM_WORKERS} {base_args}

elapsed = time.time() - t_start
hours = int(elapsed // 3600)
mins = int((elapsed % 3600) // 60)
print(f"\n{'=' * 70}")
print(f"  TRAINING COMPLETE  —  Total time: {hours}h {mins}m")
print(f"{'=' * 70}")

## 5. Evaluate & Generate Plots

In [None]:
# Evaluate all models and generate plots/visualizations
!python main.py --mode evaluate --model all --num-workers {NUM_WORKERS} {base_args}
!python main.py --mode compare --num-workers {NUM_WORKERS} {base_args}
!python main.py --mode visualize --num-workers {NUM_WORKERS} {base_args}

## 6. Results

In [None]:
# Load and display metrics
import json

with open("results/all_metrics.json") as f:
    metrics = json.load(f)

print("=" * 70)
print("  DETECTION PERFORMANCE (%)")
print("=" * 70)
print(f"{'Model':<16} {'AP@0.5':>8} {'AP@.5:.95':>10} {'AP_M':>8} {'AP_L':>8} {'AR@10':>8} {'AR_L':>8}")
print("-" * 70)
for name, m in metrics.items():
    print(f"{name:<16} {m['AP@0.5']*100:>8.1f} {m['AP@0.5:0.95']*100:>10.1f}"
          f" {m['AP_M']*100:>8.1f} {m['AP_L']*100:>8.1f}"
          f" {m['AR@10']*100:>8.1f} {m['AR_L']*100:>8.1f}")

print()
print("=" * 70)
print("  PATIENT-LEVEL CLASSIFICATION (%)")
print("=" * 70)
print(f"{'Model':<16} {'Accuracy':>10} {'Precision':>10} {'Recall':>10} {'F1':>10}")
print("-" * 70)
for name, m in metrics.items():
    print(f"{name:<16} {m['patient_accuracy']*100:>10.1f} {m['patient_precision']*100:>10.1f}"
          f" {m['patient_recall']*100:>10.1f} {m['patient_f1']*100:>10.1f}")

### Validation AP@0.5 Over Training

In [None]:
from IPython.display import Image, display
display(Image(filename="results/val_ap_over_epochs.png", width=800))

### AP & AR Comparison

In [None]:
display(Image(filename="results/ap_comparison.png", width=800))
display(Image(filename="results/ar_comparison.png", width=800))

### Precision-Recall Curve

In [None]:
display(Image(filename="results/pr_curve.png", width=600))

### AP vs IoU Threshold

In [None]:
import os
from IPython.display import Image, display
if os.path.exists("results/ap_vs_iou.png"):
    display(Image(filename="results/ap_vs_iou.png", width=800))
else:
    print("ap_vs_iou.png not generated (requires predictions — run 'compare' mode).")

### Patient-Level Classification

In [None]:
display(Image(filename="results/classification_metrics.png", width=800))

### Training Speed

In [None]:
display(Image(filename="results/epoch_times.png", width=600))

### Detection Samples

In [None]:
display(Image(filename="results/detection_samples.png", width=900))

## 7. Download Results

On Kaggle, results are automatically saved as notebook output. You can also download the ZIP.

In [None]:
# Package results for download
!zip -r /kaggle/working/pneumonia_results.zip results/ checkpoints/
print("Results packaged. Download from the 'Output' tab on the right.")
print("Extract into your local Project_Pneumonia_Detection/ folder.")