# Supervised Lesion Detector

This notebook is dedicated to training and evaluating a supervised lesion detector on DeepLesion dataset with the following supervised model architectures for image detection with ResNet-50 backbone:
- YOLOv5 (stable, but YOLOv8 is newer),
- Faster R-CNN (torchvision.models.detection or Detectron2),
- DETR (Facebook DETR),
- Improved DETR (DINO-DETR or Deformable DETR, DINO has better performance but Deformable is faster),
- RetinaNet (FPN backbone + anchor-based).

## Assumptions:
- Use 2D slice inputs (optionally use the neighbouring ones too),
- Resize all images to 512x512,
- Use COCO-style Dataset class.
- Use DeepLesion for training a general lesion localizer and some other like LiTS (Liver Tumor Segmentation) or CHAOS (CT liver dataset) for more specialized localizer.

## 📚 Thesis Value Summary
### Contribution and Value:
- Comparison of CNN vs Transformer detectors on DeepLesion	-> ✅ Fills a gap in literature
- Evaluation of improved DETRs (DINO/Deformable) -> ✅ Modern insight
- General vs specialized lesion detection -> ✅ Strong clinical relevance
- Analysis of training time, robustness, failure modes -> ✅ Engineering depth


# Google Colab only

### Download required packages

In [None]:
!pip install -r https://raw.githubusercontent.com/pmalesa/lesion_detector/main/notebooks/requirements.txt

### Mount DeepLesion images from Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content
!rm -rf data/deeplesion   # remove existing link if any
!mkdir -p data
!ln -s /content/drive/MyDrive/deeplesion/data/deeplesion data/deeplesion
!ls -l data

# Import all packages

In [2]:
# General packages
import os
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from numpy.typing import NDArray
from typing import Any
from PIL import Image
import shutil
import random
from pathlib import Path
import yaml
import cv2
from datetime import datetime

# Faster R-CNN packages
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.datasets import ImageFolder
from torchvision import transforms
import torchvision.transforms as T
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_V2_Weights
from torchmetrics.detection.mean_ap import MeanAveragePrecision
import torchvision.ops as ops
import copy

# Image preprocessing and utility methods

In [3]:
def load_metadata(path) -> pd.DataFrame:
    """
    Loads metadata from the given path and
    returns it as a pandas DataFrame.
    """
    
    return pd.read_csv(path)

def normalize(img: NDArray[np.uint16], per_image_norm: bool):
    """
    Normalizes the input image
    """

    img = img.astype(np.float32)
    if not per_image_norm:
        return img / 65535.0
    max = np.max(img)
    min = np.min(img)
    img = (img - min) / (max - min)
    return img

def convert_to_hu(img: NDArray[np.uint16], norm: bool, hu_min=-1024, hu_max=3071):
    """
    Converts the pixel data of a uint16
    CT image to Hounsfield Units (HU).
    """

    hu_img = img.astype(np.int32) - 32768
    hu_img = np.clip(hu_img, hu_min, hu_max).astype(np.float32)
    if norm:
        hu_img = (hu_img - hu_min) / (hu_max - hu_min)
        hu_img = np.clip(hu_img, 0.0, 1.0)
    return hu_img

def load_image(path: str, hu_scale: bool = True, norm: bool = True, per_image_norm: bool = True):
    """
    Loads an image given its path
    and returns it as a numpy array.
    """
    
    img = Image.open(path)
    img_array = np.array(img)
    if hu_scale:
        hu_min = -160
        hu_max = 240
        return convert_to_hu(img_array, norm, hu_min, hu_max)
    elif norm:
        return normalize(img_array, per_image_norm)
    return img_array

def save_image(img_array: NDArray[Any], path: str):
    """
    Saves the input image to the specified path.
    """

    if img_array.dtype != np.uint16:
        img_array = img_array.astype(np.uint16)
    img = Image.fromarray(img_array)
    img.save(path)

def save_normalized_image_uint8(img_array: NDArray[Any], path: str):
    """
    Saves the normalized input image data (0.0 - 1.0) 
    to the specified path with uint8 precision (0 - 255).
    """

    img_array_scaled = (img_array * 255.0).clip(0, 255).astype(np.uint8)
    img = Image.fromarray(img_array_scaled)
    img.save(path)

def show_image(img: NDArray[np.float32], title="Example Image", cmap="gray"):
    """
    Shows the image given its data, title and colour map.
    """

    plt.figure(figsize=(5, 5))
    plt.imshow(img, cmap=cmap)
    if title is not None:
        plt.title(title)
    plt.axis("off")
    plt.show()


# Set paths to DeepLesion images and metadata

In [4]:
# Paths to unprocessed data
# [!] In Google Colab change to: "data/deeplesion/deeplesion_metadata.csv" and "data/deeplesion/key_slices/" respectively [!]
deeplesion_metadata_path = Path("../data/deeplesion_metadata.csv")
deeplesion_image_path = Path("../data/deeplesion/key_slices/")

# Paths to processed data
deeplesion_data_dir = Path("data/deeplesion/")
deeplesion_preprocessed_image_path = deeplesion_data_dir / "deeplesion_preprocessed_uint8/key_slices"
deeplesion_preprocessed_metadata_path = deeplesion_data_dir / "deeplesion_preprocessed_uint8/deeplesion_metadata_preprocessed.csv"

# Path to deeplesion metadata in COCO format
deeplesion_coco_json_path = deeplesion_data_dir / "deeplesion_coco.json"

# Preprocess the DeepLesion dataset

In [None]:
'''
Preprocess the DeepLesion dataset by:
    - Choosing only images that come from the original validation and test set divisions (only these have lesion type annotations),
    - Normalizing each image according to a fixed HU scale window [-160, 240] (can be also [-1024, 3071]),
    - Resizing each image to 512x512 if necessary (and adjusting the target bounding box coordinates),
    - Saving the new metadata file with resized images and adjusted target bounding boxes. 
'''

deeplesion_metadata = load_metadata(deeplesion_metadata_path)

# Clear the existing directory with preprocessed images
if deeplesion_preprocessed_image_path.exists() and deeplesion_preprocessed_image_path.is_dir():
    shutil.rmtree(deeplesion_preprocessed_image_path)
deeplesion_preprocessed_image_path.mkdir(parents=True, exist_ok=True)

for idx, row in deeplesion_metadata.iterrows():
    # Extract only images with annotated lesions (Val + Test)
    if row["Train_Val_Test"] == 1:
        continue

    file_name = row["File_name"]
    bbox_str = row["Bounding_boxes"]
    size_str = row["Image_size"]
    image_path = os.path.join(deeplesion_image_path, file_name)
    preprocessed_image_path = os.path.join(deeplesion_preprocessed_image_path, file_name)
    image_data = load_image(image_path)

    # Extract ground truth bounding box' coordinates
    bbox_coords = [float(val) for val in bbox_str.split(",")]
    x1, y1, x2, y2 = [round(c) for c in bbox_coords]

    # Extract sizes
    image_sizes = [int(val) for val in size_str.split(",")]
    width, height = [size for size in image_sizes]

    # Rescale to (512 x 512) if necessary
    if (width, height) != (512, 512):
        image_data = cv2.resize(
            image_data, (512, 512), interpolation=cv2.INTER_AREA
        )
        scale_x = 512 / width
        scale_y = 512 / height
        x1 = round(x1 * scale_x)
        y1 = round(y1 * scale_y)
        x2 = round(x2 * scale_x)
        y2 = round(y2 * scale_y)
        height = 512
        width = 512

        deeplesion_metadata.at[idx, "Bounding_boxes"] = f"{x1}, {y1}, {x2}, {y2}"
        deeplesion_metadata.at[idx, "Image_size"] = "512, 512"

    save_normalized_image_uint8(image_data, preprocessed_image_path)    

# Save preprocessed metadata csv file
deeplesion_metadata.to_csv(deeplesion_preprocessed_metadata_path)

# Verify if all images have 512x512 size
deeplesion_metadata_preprocessed = load_metadata(deeplesion_preprocessed_metadata_path)
for idx, row in deeplesion_metadata_preprocessed.iterrows():
    if row["Train_Val_Test"] == 1:
        continue
    size_str = row["Image_size"]
    image_sizes = [int(val) for val in size_str.split(",")]
    width, height = [size for size in image_sizes]
    if (width, height) != (512, 512):
        raise ValueError(f"ERROR: Not all images have required size!")
print(f"SUCCES: All images were preprocessed correctly.")
    

# Convert DeepLesion metadata to COCO format

In [None]:
deeplesion_metadata_preprocessed = load_metadata(deeplesion_preprocessed_metadata_path)

images = []
annotations = []
categories = [
    {"id": 1, "name": "bone"},
    {"id": 2, "name": "abdomen"},
    {"id": 3, "name": "mediastinum"},
    {"id": 4, "name": "liver"},
    {"id": 5, "name": "lung"},
    {"id": 6, "name": "kidney"},
    {"id": 7, "name": "soft tissue"},
    {"id": 8, "name": "pelvis"}
]

image_counter = 1
annotation_id = 1
image_id_map = {}

for idx, row in deeplesion_metadata_preprocessed.iterrows():
    # Extract only images with annotated lesions (Val + Test)
    if row["Train_Val_Test"] == 1:
        continue

    file_name = row["File_name"]
    lesion_type = row["Coarse_lesion_type"]
    bbox_str = row["Bounding_boxes"]
    size_str = row["Image_size"]

    # Extract ground truth bounding box' coordinates
    bbox_coords = [float(val) for val in bbox_str.split(",")]
    x1, y1, x2, y2 = [round(c) for c in bbox_coords]

    # Extract sizes
    image_sizes = [int(val) for val in size_str.split(",")]
    width, height = [size for size in image_sizes]

    # Register image
    if file_name not in image_id_map:
        image_id_map[file_name] = image_counter
        images.append({
            "id": image_counter,
            "file_name": file_name,
            "width": width,
            "height": height
        })
        image_counter += 1
    image_id = image_id_map[file_name]

    # Initialize ground truth bounding box' parameters
    bbox = [x1, y1, x2 - x1, y2 - y1]
    area = bbox[2] * bbox[3]

    annotations.append({
        "id": annotation_id,
        "image_id": image_id,
        "category": lesion_type,
        "bbox": bbox,
        "area": area,
        "iscrowd": 0    # Normal object (not crowd of indistinct objects, that can't be cleanly separated)
    })

    annotation_id += 1

# Save to JSON
deeplesion_coco_format = {
    "images": images,
    "annotations": annotations,
    "categories": categories
}

with open(deeplesion_coco_json_path, "w") as f:
    json.dump(deeplesion_coco_format, f, indent=2)


# Convert DeepLesion dataset to YOLOv5 format

In [None]:
deeplesion_metadata_preprocessed = load_metadata(deeplesion_preprocessed_metadata_path)

# Source directories
image_dir = deeplesion_preprocessed_image_path
label_dir = deeplesion_data_dir / "labels_unsorted"

# Create .txt files
label_dir.mkdir(parents=True, exist_ok=True)

# Create target directory
target_dir = deeplesion_data_dir / "deeplesion_yolo"
if target_dir.exists() and target_dir.is_dir():
    shutil.rmtree(target_dir)

for idx, row in deeplesion_metadata_preprocessed.iterrows():
    # Extract only images with annotated lesions (Val + Test)
    if row["Train_Val_Test"] == 1:
        continue

    file_name = row["File_name"]
    image_path = os.path.join(str(image_dir), file_name)
    label_path = os.path.join(str(label_dir), file_name.replace(".png", ".txt"))

    if not os.path.exists(image_path):
        continue

    lesion_type = row["Coarse_lesion_type"] - 1 # YOLOv5 requires class IDs starting at 0
    bbox_str = row["Bounding_boxes"]
    size_str = row["Image_size"]

    # Extract ground truth bounding box' coordinates
    bbox_coords = [float(val) for val in bbox_str.split(",")]
    x1, y1, x2, y2 = [round(c) for c in bbox_coords]

    # Extract sizes
    image_sizes = [int(val) for val in size_str.split(",")]
    width, height = [size for size in image_sizes]

    bbox_width = x2 - x1
    bbox_height = y2 - y1
    x_center = x1 + bbox_width / 2
    y_center = y1 + bbox_height / 2

    # Normalize
    x_center /= width
    y_center /= height
    bbox_width /= width
    bbox_height /= height

    with open(label_path, 'a') as f:
        f.write(f"{lesion_type} {x_center:.6f} {y_center:.6f} {bbox_width:.6f} {bbox_height:.6f}\n")

# -----------------------------------------------------------------------------

splits = ["train", "val", "test"]
for split in splits:
    (target_dir / "images" / split).mkdir(parents=True, exist_ok=True)
    (target_dir / "labels" / split).mkdir(parents=True, exist_ok=True)

# Collect all images with annotated lesions
annotated_images = [img for img in image_dir.glob("*.png") if (label_dir / (img.stem + ".txt")).exists()]
random.seed(42) # TODO - Change this seed to create different train/val/test splits (42, 314, 666)
random.shuffle(annotated_images)

# Split into train, val and test sets
n_total = len(annotated_images)
n_train = int(0.7 * n_total)
n_val = int(0.15 * n_total)

train_images = annotated_images[:n_train]
val_images = annotated_images[n_train:n_train + n_val]
test_images = annotated_images[n_train + n_val:]

splits_map = {
    "train": train_images,
    "val": val_images,
    "test": test_images
}

# Copy image files
for split, images in splits_map.items():
    for image_path in images:
        label_path = label_dir / (image_path.stem + ".txt")
        shutil.copy(image_path, target_dir / "images" / split / image_path.name)
        shutil.copy(label_path, target_dir / "labels" / split / label_path.name)

print(f"Split done! Total = {n_total}")

# Generate deeplesion.yaml
dataset_root = os.path.abspath(str(target_dir))
deeplesion_yaml = {
    "path": dataset_root,
    "train": os.path.join(dataset_root, "images/train"),
    "val": os.path.join(dataset_root, "images/val"),
    "test": os.path.join(dataset_root, "images/test"),
    "nc": 8,
    "names": [
        "bone",
        "abdomen",
        "mediastinum",
        "liver",
        "lung",
        "kidney",
        "soft_tissue",
        "pelvis"
    ]
}

with open(target_dir / "deeplesion.yaml", "w") as f:
    yaml.dump(deeplesion_yaml, f)

# Remove directory with unsorted labels
if label_dir.exists() and label_dir.is_dir():
    shutil.rmtree(label_dir)


# Convert DeepLesion dataset to Faster R-CNN format

In [None]:
deeplesion_metadata_preprocessed = load_metadata(deeplesion_preprocessed_metadata_path)

# Source directories
image_dir = deeplesion_preprocessed_image_path
label_dir = deeplesion_data_dir / "labels_unsorted"

# Create .txt files
label_dir.mkdir(parents=True, exist_ok=True)

# Create target directory
target_dir = deeplesion_data_dir / "deeplesion_fasterrcnn_split_X"
if target_dir.exists() and target_dir.is_dir():
    shutil.rmtree(target_dir)

for idx, row in deeplesion_metadata_preprocessed.iterrows():
    # Extract only images with annotated lesions (Val + Test)
    if row["Train_Val_Test"] == 1:
        continue

    file_name = row["File_name"]
    image_path = os.path.join(str(image_dir), file_name)
    label_path = os.path.join(str(label_dir), file_name.replace(".png", ".txt"))

    if not os.path.exists(image_path):
        continue

    lesion_type = row["Coarse_lesion_type"] # Faster R-CNN uses the class 0 implicitly as the background class (no need for subtraction)
    bbox_str = row["Bounding_boxes"]
    size_str = row["Image_size"]

    # Extract ground truth bounding box' coordinates
    bbox_coords = [float(val) for val in bbox_str.split(",")]
    x1, y1, x2, y2 = [round(c) for c in bbox_coords]

    with open(label_path, 'a') as f:
        f.write(f"{lesion_type} {x1} {y1} {x2} {y2}\n")

# -----------------------------------------------------------------------------

splits = ["train", "val", "test"]
for split in splits:
    (target_dir / "images" / split).mkdir(parents=True, exist_ok=True)
    (target_dir / "labels" / split).mkdir(parents=True, exist_ok=True)

# Collect all images with annotated lesions
annotated_images = [img for img in image_dir.glob("*.png") if (label_dir / (img.stem + ".txt")).exists()]
random.seed(42) # TODO - Change this seed to create different train/val/test splits (42, 314, 666)
random.shuffle(annotated_images)

# Split into train, val and test sets
n_total = len(annotated_images)
n_train = int(0.7 * n_total)
n_val = int(0.15 * n_total)

train_images = annotated_images[:n_train]
val_images = annotated_images[n_train:n_train + n_val]
test_images = annotated_images[n_train + n_val:]

splits_map = {
    "train": train_images,
    "val": val_images,
    "test": test_images
}

# Copy image files
for split, images in splits_map.items():
    for image_path in images:
        label_path = label_dir / (image_path.stem + ".txt")
        shutil.copy(image_path, target_dir / "images" / split / image_path.name)
        shutil.copy(label_path, target_dir / "labels" / split / label_path.name)

print(f"Split done! Total = {n_total}")

# Remove directory with unsorted labels
if label_dir.exists() and label_dir.is_dir():
    shutil.rmtree(label_dir)


# YOLOv5

### Download pretrained YOLOv5 model

In [None]:
!git clone https://github.com/ultralytics/yolov5
%cd yolov5
!pip install -r requirements.txt
%cd ..

### Train the YOLOv5 model on the DeepLesion dataset

In [None]:
!python yolov5/train.py --img 512 --batch 8 --epochs 100 --data data/deeplesion/deeplesion_yolo/deeplesion.yaml --weights yolov5s.pt --name deeplesion_yolov5

### Evaluate the YOLOv5 model

In [None]:
!python yolov5/val.py --data data/deeplesion/deeplesion_yolo/deeplesion.yaml --weights yolov5/runs/train/deeplesion_yolov5/weights/best.pt --img 512 --task test

# Faster R-CNN

### Load Pre-trained Faster R-CNN Model

In [5]:
# Number of classes (no. dataset classes + 1 for background)
num_classes = 8 + 1
class_names = ["bone", "abdomen", "mediastinum", "liver", "lung", "kidney", "soft_tissue", "pelvis"]

# Set up the available device
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

def construct_fasterrcnn_model():
    # 1) Load COCO-pretrained Faster R-CNN
    weights = FasterRCNN_ResNet50_FPN_V2_Weights.COCO_V1
    model = fasterrcnn_resnet50_fpn_v2(weights=weights, min_size=512, max_size=512)

    # 2) Replace the detection head to match the DeepLesion's number of classes
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # 3) Patch the first conv layer to accept 1-channel input
    #    (model.backbone.body is the ResNet-50 backbone)
    old_conv = model.backbone.body.conv1 # shape: [out_c, 3, k, k] ( == [out_channels, in_channels, kernel_height, kernel_width])

    new_conv = nn.Conv2d(
        in_channels=1,
        out_channels=old_conv.out_channels,
        kernel_size=old_conv.kernel_size,
        stride=old_conv.stride,
        padding=old_conv.padding,
        bias=False,
    )

    # Initialize 1-channel conv layer using pretrained RGB weights
    with torch.no_grad():
        # Option A: simple average over RGB
        new_conv.weight[:] = old_conv.weight.mean(dim=1, keepdim=True)

        # Option B (recommended): luminance-weighted sum to mimic grayscale
        r = old_conv.weight[:, 0:1, :, :]
        g = old_conv.weight[:, 1:2, :, :]
        b = old_conv.weight[:, 2:3, :, :]
        new_conv.weight[:] = 0.2989 * r + 0.5870 * g + 0.1140 * b

    model.backbone.body.conv1 = new_conv

    # 4) Adjust the model's internal normalization to 1 channel
    # If the loader returns tensors in [0, 1], this centers to roughly ImageNet-like scale.
    # [!] If we already pre-normalize to [0, 1] and don't want extra normalization, use mean = [0.0] and std = [1.0].
    model.transform.image_mean = [0.5]
    model.transform.image_std = [0.5]
    # These lists above usually contain 3 values, each for normalization of every RGB channel.
    # Since I have only one channel, then I need only one such value in both of these lists.

    # Move model to GPU if available
    model.to(device)

    # Sanity check
    # print(f"First conv layer shape: {model.backbone.body.conv1.weight.shape}") # Should be [64, 1, 7, 7]

    return model

### Prepare DeepLesion dataset for Faster R-CNN model

In [6]:
# Custom Dataset class for DeepLesion dataset
class DeepLesionDataset(Dataset):
    def __init__(self, root, split):
        # Initialize dataset path, split and transformations
        self.root = root
        self.split = split
        self.transforms = T.Compose([
            T.ToTensor(), # Converts [0, 255] uint8 values to float [0.0, 1.0], and preservers 1 channel
        ])

        # Dataset logic (image paths, annotations, etc.)
        self.image_dir = os.path.join(root, "images", split)
        self.label_dir = os.path.join(root, "labels", split)
        self.image_names = sorted([img for img in os.listdir(self.image_dir) if img.endswith(".png") or img.endswith(".jpg")])

    def __getitem__(self, idx):
        image_name = self.image_names[idx]
        image_path = os.path.join(self.image_dir, image_name)
        label_path = os.path.join(self.label_dir, os.path.splitext(image_name)[0] + ".txt")

        image = Image.open(image_path).convert("L")

        # Load corresponding bounding boxes and labels
        boxes, labels = [], []
        if os.path.exists(label_path):
            for line in open(label_path):
                cls, x_min, y_min, x_max, y_max = map(float, line.split())
                boxes.append([x_min, y_min, x_max, y_max])
                labels.append(int(cls))

        # Create a target dictionary
        target = {
            "boxes": torch.tensor(boxes, dtype=torch.float32),
            "labels": torch.tensor(labels, dtype=torch.int64)
        }
        
        # Apply transforms
        if self.transforms:
            image = self.transforms(image)
        
        return image, target
    
    def __len__(self):
        return len(self.image_names)
    

### Prepare DataLoader objects

In [None]:
"""
- Shuffling is enabled for training DataLoader, because SGD benefits from seeing data in a new random order every epoch.
  During validation and testing phases we do not need that, the order does not affect the metrics.

- num_workers is the number of background processes that load & transorm batches in parallel. Good rule of thumb is num_workers being 2-4.

- pin_memory, or pinned (page-locked) host memory, speeds up host to GPU copies and lets us use asynchronous transfers
  It should be set to True if we train on GPU. It usually gives a smal lbut real throughput bump. It consumes a bit more system RAM
  and is useless on CPU-only runs.

- Detection models expect lists of images and lists of target dicts, because each image can have different size and has a different
  number of boxes. The default PyTorch collate tries to stack everything into tensors of the same shape, which breaks for 
  variable-length targets. Custom collate_fn function here unzips the list oof pairs into pair of lists so Faster R-CNN can consume them:
    images: List[Tensor[C,H,W]]
    targets: List[Dict{'boxes': Tensor[N,4], 'labels': Tensor[N]}]
  That is exactly what torchvision's detection references use.

"""

# =============================================================|
# Set up the path and batch size for Google Collab
# =============================================================|
deeplesion_fasterrcnn_path = "deeplesion_fasterrcnn_split_1"
train_batch_size = 1      # Set to 4
test_val_batch_size = 1   # Set to 2
# =============================================================|

train_ds = DeepLesionDataset(deeplesion_data_dir / deeplesion_fasterrcnn_path, "train")
val_ds = DeepLesionDataset(deeplesion_data_dir / deeplesion_fasterrcnn_path, "val")
test_ds = DeepLesionDataset(deeplesion_data_dir / deeplesion_fasterrcnn_path, "test")

def collate_fn(batch):
    # batch: [(img1, target1), (img2, target2), ...]
    # returns: ([img1, img2, ...], [target1, target2, ...])
    return tuple(zip(*batch)) # -> 

train_loader = DataLoader(train_ds, batch_size=train_batch_size, shuffle=True, collate_fn=collate_fn, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_ds, batch_size=test_val_batch_size, shuffle=False, collate_fn=collate_fn, num_workers=2, pin_memory=True)
test_loader = DataLoader(test_ds, batch_size=test_val_batch_size, shuffle=False, collate_fn=collate_fn, num_workers=2, pin_memory=True)

### Evaluation Functions 

In [None]:
@torch.no_grad()
def evaluate_detector(model, loader, device, num_classes=8, class_names=None, pr_conf_thr=0.25, pr_iou_thr=0.5):
    """
        Returns a dictionary:
        {
            "mAP50": float,
            "mAP50_95: float,
            "per_class": ["name": ..., "AP50": ..., "AP": ...],
            "precision_overall": float,
            "recall_overall": float,
            "precision_per_class": [...],
            "recall_per_class": [...],
        }
    """

    model.eval()
    metric = MeanAveragePrecision(iou_type="bbox", box_format="xyxy", class_metrics=True)

    # For precision and recall at fixed thresholds (IoU=0.5, conf=pr_conf_thr)
    TP = torch.zeros(num_classes, dtype=torch.long)
    FP = torch.zeros(num_classes, dtype=torch.long)
    FN = torch.zeros(num_classes, dtype=torch.long)

    for batch_idx, (images, targets) in enumerate(loader, start=1):
        images = [image.to(device) for image in images]
        outputs = model(images)

        # Move to CPU for metrics
        predictions, ground_truths = [], []
        for output, target in zip(outputs, targets):
            predictions.append({"boxes": output["boxes"].cpu(),
                                "scores": output["scores"].cpu(),
                                "labels": output["labels"].cpu()})
            ground_truths.append({"boxes": target["boxes"].cpu(),
                                  "labels": target["labels"].cpu()})
            
        # mAP update
        metric.update(predictions, ground_truths)

        # Precision and recall accumulation at fixed thresholds
        for output, target in zip(predictions, ground_truths):
            # Filter predictions by confidence
            keep = output["scores"] >= pr_conf_thr
            pred_boxes = output["boxes"][keep]
            pred_labels = output["labels"][keep]
            gt_boxes = target["boxes"]
            gt_labels = target["labels"]

            matched = torch.zeros(len(gt_boxes), dtype=torch.bool)
            if len(pred_boxes) and len(gt_boxes):
                ious = ops.box_iou(pred_boxes, gt_boxes)
                for pred_idx in range(len(pred_boxes)):
                    cls = int(pred_labels[pred_idx].item()) # classes are 1...K
                    # candidates: same class
                    same = (gt_labels == cls)
                    if same.any():
                        ious_c = ious[pred_idx, same]
                        if len(ious_c):
                            gt_idxs = torch.where(same)[0]
                            best_iou, best_loc = ious_c.max(0)
                            gt_idx = gt_idxs[best_loc]
                            if best_iou >= pr_iou_thr and not matched[gt_idx]:
                                TP[cls - 1] += 1
                                matched[gt_idx] = True
                            else:
                                FP[cls - 1] += 1
                        else:
                            FP[cls - 1] += 1
                    else:
                        FP[cls - 1] += 1
            
            # Any unmatched ground truths are FN
            for gt_idx, gt_label in enumerate(gt_labels):
                if not matched[gt_idx]:
                    FN[int(gt_label.item()) - 1] += 1

        # mAP metrics
        res = metric.compute()
        out = {
            "mAP50": float(res["map_50"]),
            "mAP50_95": float(res["map"]),
        }

        # Per-class AP (if available)
        per_class = []
        map_per_class = res.get("map_per_class", None)
        map50_per_class = res.get("map_50_per_class", None)
        if map_per_class is not None:
            ap = map_per_class.tolist()
            ap50 = map50_per_class.tolist() if map50_per_class is not None else [None] * len(ap)
            for i in range(len(ap)):
                name = class_names[i] if class_names and i < len(class_names) else f"class_{i + 1}"
                per_class.append({"idx": i + 1, "name": name, "AP": ap[i], "AP50": ap50[i]})
        out["per_class"] = per_class

        # Precision/Recall at fixed thresholds
        precision_per_class = (TP.float() / (TP + FP).clamp(min=1)).tolist()
        recall_per_class = (TP.float() / (TP + FN).clamp(min=1)).tolist()
        overall_precision = float(TP.sum() / (TP.sum() + FP.sum()).clamp(min=1))
        overall_recall = float(TP.sum() / (TP.sum() + FN.sum()).clamp(min=1))

        out["precision_overall"] = overall_precision
        out["recall_overall"] = overall_recall
        out["precision_per_class"] = precision_per_class
        out["recall_per_class"] = recall_per_class
        out["pr_conf_thr"] = pr_conf_thr
        out["pr_iou_thr"] = pr_iou_thr

        print(f"Images: [{batch_idx}/{len(loader.dataset)}]")

    return out

# =================================================================================================================================================
# =================================================================================================================================================

def _count_instances_per_class(dataset, num_classes):
    """
    Function that counts the ground truth instances per class by reading the label .txt files on disk.
    """

    counts = [0] * num_classes
    total = 0

    for img_name in dataset.image_names:
        label_path = os.path.join(dataset.label_dir, Path(img_name).stem + ".txt")
        if not os.path.exists(label_path):
            continue
        with open(label_path, "r") as file:
            for line in file:
                parts = line.strip().split()
                if len(parts) != 5: # Ill-written label text file
                    continue
                cls = int(float(parts[0]))
                # Faster R-CNN labels are 1...K (background is implicit), we map to 0...K-1 index
                if 1 <= cls <= num_classes:
                    counts[cls - 1] += 1
                    total += 1
    return counts, total

def _to_float(x, default=float("nan")):
    if x is None:
        return default
    if isinstance(x, torch.Tensor):
        if x.numel() == 0:
            return default
        x = x.detach().cpu().item() if x.ndim == 0 else x.detach().cpu().numpy()
    if isinstance(x, np.ndarray):
        return float(x.item()) if x.size == 1 else default
    try:
        return float(x)
    except Exception:
        return default
    
def _to_int(x, default=0):
    try:
        return int(x)
    except Exception:
        return default

def print_result_report(metrics, loader, class_names):
    """
    Function that prints pretty report with evaluation metrics.
    Uses dataset files to compute number of images and instances.
    """

    num_classes = len(class_names)
    images      = _to_int(len(loader.dataset))
    per_class   = metrics.get("per_class", [])
    p_overall   = _to_float(metrics["precision_overall"])
    r_overall   = _to_float(metrics["recall_overall"])
    map50       = _to_float(metrics["mAP50"])
    map50_95    = _to_float(metrics["mAP50_95"])

    # Count instances per class from labels
    counts, total_instances = _count_instances_per_class(loader.dataset, num_classes)

    # Build quick dicts for per-class AP50/AP
    ap50_by_name = {d['name']: d['AP50'] for d in per_class}
    ap_by_name = {d['name']: d['AP'] for d in per_class}

    # Header
    print(f"{'Class':>18} {'Images':>8} {'Instances':>10} {'P':>10} {'R':>10} {'mAP50':>10} {'mAP50_90':>10}")

    # Overall row ("all")
    print(f"{'all':>18} {images:8d} {_to_int(total_instances):10d} {p_overall:10.3f} {r_overall:10.3f} {map50:10.3f} {map50_95:10.3f}")

    # Per-class rows
    p_pc = metrics.get("precision_per_class", [])
    r_pc = metrics.get("recall_per_class", [])

    for i, name in enumerate(class_names):
        P_i = _to_float(p_pc[i] if i < len(p_pc) else float("nan"))
        R_i = _to_float(r_pc[i] if i < len(r_pc) else float("nan"))
        AP50_i = _to_float(ap50_by_name.get(name, float("nan")))
        AP_i = _to_float(ap_by_name.get(name, float("nan")))
        inst_i = _to_int(counts[i])
        print(f"{name:>18} {images:8d} {inst_i:10d} {P_i:10.3f} {R_i:10.3f} {AP50_i:10.3f} {AP_i:10.3f}")


### Training Loop 

In [None]:
"""

- For Faster R-CNN it is common to use SGD or Adam as the optimizer.
- Hyperparameters:
    - momentum: 
        adds an exponential moving average of past gradients to the current step, which causes smoother updates,
        less zig-zagging and faster convergance. Typically set to 0.9, and rarely needs tuning.
    - weight_decay (L2 regularization):
        Penalizes large weights to reduce overfitting (shrinks params each step).
        Typical for detection with SGD: 5e-4 or 1e-4
    - step_size (in StepLR):
        Every step_size epochs, the LR scheduler triggers a decay.
    - gamma (in StepLR):
        Multiplicative LR factor at each step: new_lr = old_lr * gamma. Commonly set to 0.1.

- Cross-validate only on the following hyperparameters:
    - LR: [0.01, 0.005, 0.002]
    - weight_decay: [5e-4, 1e-4]
    - Epochs -> don't cross-validate over it -> set a generous cap (e.g. 20) and early stop

- use_amp (AMP - Automatic Mixed Precision) - runs many ops in float16 instead of float32, which takes much less GPU memory and is often faster
- autoca

- autocast() is a context manager that automatically picks a safe dtype per op (keeps numerically sensitive ops in float32, others in float16).
  It saves memory/computation.
  
- GradScaler multiplies the loss by a large scale before backprop to avoid float16 underflow, then unscales safely before the optimizer step.
  It makes the gradients stable in half precision.

- optimizer.zero_grad(set_to_none=True) - set_to_none parameter set to True means that for each parameter param.grad is set to None
  (no tensor is kept). On the next backward() PyTorch allocates a fresh grad tensor and writes into it. It causes faster & less memory traffic
  by avoiding writing zeros over large grad buffers every step. Lowers memory footprint by letting unused grads be garbage-collected and reallocated
  only when needed.

"""

def train_one_config(
    train_loader, val_loader, device,
    learning_rate, weight_decay, momentum=0.9,
    max_epochs=20, patience=5, metric_key="mAP50_95",
    gamma=0.1, step_size=3
):
    # Construct the model
    model = construct_fasterrcnn_model()

    # Set up optimizer
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=learning_rate, momentum=momentum, weight_decay=weight_decay)

    # Learning rate scheduler
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gamma=gamma)

    best_metric = -float("inf")
    best_epoch = -1
    best_state = None
    epochs_no_improve = 0
    history = []

    # Train model
    for epoch in range(1, max_epochs + 1):
        print(f"*** Epoch [{epoch}/{max_epochs}] started ***")
        model.train()
        running_loss = 0.0
        n_processed_images = 0

        # Training loop
        for batch_idx, (images, targets) in enumerate(train_loader, start=1):
            images = list(image.to(device) for image in images)
            targets = [{key: val.to(device) for key, val in target.items()} for target in targets]

            optimizer.zero_grad(set_to_none=True)
            loss_dict = model(images, targets)
            loss = sum(loss_dict.values())

            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            n_processed_images += len(images)

            if batch_idx % 10 == 0 or batch_idx == len(train_loader):
                avg_loss_so_far = running_loss / batch_idx
                print(f"Epoch: [{epoch}/{max_epochs}], "
                      f"Images: [{n_processed_images}/{len(train_loader.dataset)}], "
                      f"Loss: {avg_loss_so_far:.4f}")
            
        lr_scheduler.step()
        train_loss = running_loss / max(1, len(train_loader))
        print(f"*** Epoch [{epoch}/{max_epochs}] finished -> Loss: {train_loss:.4f} ***")

        # Validation
        print("*** Validation started ***")
        val_metrics = evaluate_detector(model, val_loader, device, num_classes, class_names)
        val_score = float(val_metrics[metric_key])

        history.append({"epoch": epoch, "train_loss": train_loss, **val_metrics})
        print(f"Loss={train_loss:.4f}, mAP50={val_metrics['mAP50']:.4f}, mAP50_95={val_metrics['mAP50_95']:.4f}")
        print(f"*** Validation finished ***")

        # Early stopping check
        if val_score > best_metric + 1e-6:
            best_metric = val_score
            best_epoch = epoch
            best_state = copy.deepcopy(model.state_dict())
            epochs_no_improve = 0
        else:
            epochs_no_improve += 1
            if epochs_no_improve >= patience:
                print(f"*** Early stopping after {epoch}/{max_epochs} epochs (best at {best_epoch} with {metric_key}={best_metric:.4f}). ***")
                break

    return {
        "best_metric": best_metric,
        "best_epoch": best_epoch,
        "best_state": best_state,
        "history": history,
        "lr": learning_rate,
        "weight_decay": weight_decay
    }

# =================================================================================================================================================
# =================================================================================================================================================


# Set up hyperparameters
learning_rates = [0.01, 0.005, 0.002]
weight_decays = [1e-4, 5e-4]
max_epochs = 20
patience = 5
best_result = None

print("*** Training started ***")
for learning_rate in learning_rates:
    for weight_decay in weight_decays:
        result = train_one_config(
            train_loader=train_loader, val_loader=val_loader, device=device,
            learning_rate=learning_rate, weight_decay=weight_decay, max_epochs=max_epochs, patience=patience,
            metric_key="mAP50_95"
        )

        if best_result is None or result["best_metric"] > best_result["best_metric"]:
            best_result = result

print("*** Training complete ***")
print(f"Best config: lr={best_result['lr']} wd={best_result['weight_decay']} "
      f"epoch={best_result['best_epoch']} mAP50-95={best_result['best_metric']:.4f}")

# Save the best model checkpoint
best_checkpoint = {
    "state_dict": best_result["best_state"],
    "epoch": best_result["best_epoch"],
    "metric_key": "mAP50-95",
    "metric_value": best_result["best_metric"],
    "hp": {
        "lr": best_result["lr"],
        "weight_decay": best_result["weight_decay"],
        "momentum": 0.9,
        "step_size": 3,
        "gamma": 0.1,
        "max_epochs": max_epochs,
        "patience": patience,
    },
    # Describe how to reconstruct the model
    "model_spec": {
        "arch": "fasterrcnn_resnet50_fpn_v2",
        "min_size": 512,
        "max_size": 512,
        "in_channels": 1,
        "num_classes": num_classes,
        "image_mean": [0.5],
        "image_std": [0.5],
    },
    "class_names": class_names,
    "versions": {"torch": torch.__version__, "torchvision": torchvision.__version__},
}

os.makedirs("checkpoints", exist_ok=True)
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
save_path = f"checkpoints/fasterrcnn_best_{ts}.pt"
torch.save(best_checkpoint, save_path)
print(f"Saved best checkpoint to {save_path}")

*** Training started ***
*** Epoch [1/20] started ***
*** Epoch [1/20] finished -> Loss: 0.0005 ***
*** Validation started ***
Images: [1/1443]
Loss=0.0005, mAP50=0.0000, mAP50_95=0.0000
*** Validation finished ***
*** Training complete ***
Best config: lr=0.01 wd=0.0001 epoch=1 mAP50-95=0.0000
Saved best checkpoint to checkpoints/fasterrcnn_best_20251101-154751.pt


### Final Evaluation

In [17]:
# Load best weights and evaluate on the test set
best_checkpoint = torch.load(save_path, map_location="cpu")
best_model = construct_fasterrcnn_model()
best_model.load_state_dict(best_checkpoint["state_dict"])

# Evaluate the best model
print(f"*** Evaluation started ***")
test_metrics = evaluate_detector(best_model, test_loader, device, num_classes, class_names)

print(f"*** Evaluation finished ***")
print(test_metrics)
print_result_report(test_metrics, test_loader, class_names)

  best_checkpoint = torch.load(save_path, map_location="cpu")


*** Evaluation started ***
Images: [1/1445]
*** Evaluation finished ***
{'mAP50': 0.0, 'mAP50_95': 0.0, 'per_class': [{'idx': 1, 'name': 'bone', 'AP': -1.0, 'AP50': None}, {'idx': 2, 'name': 'abdomen', 'AP': 0.0, 'AP50': None}, {'idx': 3, 'name': 'mediastinum', 'AP': -1.0, 'AP50': None}, {'idx': 4, 'name': 'liver', 'AP': -1.0, 'AP50': None}, {'idx': 5, 'name': 'lung', 'AP': -1.0, 'AP50': None}, {'idx': 6, 'name': 'kidney', 'AP': -1.0, 'AP50': None}, {'idx': 7, 'name': 'soft_tissue', 'AP': -1.0, 'AP50': None}], 'precision_overall': 0.0, 'recall_overall': 0.0, 'precision_per_class': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'recall_per_class': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 'pr_conf_thr': 0.25, 'pr_iou_thr': 0.5}
             Class   Images  Instances          P          R      mAP50   mAP50_90
               all     1445       1477      0.000      0.000      0.000      0.000
              bone     1445         35      0.000      0.000        nan     -1.000
         