YOLOv2 Output Cell Shape Explanation

The output cell shape of the YOLOv2 architecture refers to the structure of the final feature map produced by the network, which is used for object detection. To understand this, let’s break it down step-by-step:

1. **Grid Division**  
YOLOv2 divides the input image into a grid of size S x S. Each cell in this grid is responsible for predicting bounding boxes for objects whose centers fall within that cell. For instance, if the input image size is 416 x 416, a common grid size in YOLOv2 is 13 x 13, meaning each grid cell covers 32 x 32 pixels of the original image.


2. **Bounding Box Predictions**  
Each grid cell predicts:

A fixed number of bounding boxes (typically 5).
For each bounding box, it predicts:
The coordinates: (x, y, w, h) representing the center coordinates of the box relative to the cell, as well as its width and height.
The confidence score that indicates the likelihood that the box contains an object.


3. **Class Predictions**  
For each bounding box, YOLOv2 also predicts the probabilities that the object belongs to one of the predefined classes. If there are C classes, then for each bounding box, there are C class scores.


4. **Output Tensor Shape**  
The final output of YOLOv2 has the shape: S x S x (B x (5 + C))
Where:

S x S is the grid size (e.g., 13 x 13).
B is the number of bounding boxes predicted per grid cell (typically 5).
5 + C refers to the 5 values for each bounding box (4 for x, y, w, h and 1 for the confidence score) plus the class predictions C.
Example:
For a 13 x 13 grid, 5 bounding boxes per cell, and 20 classes (like in the Pascal VOC dataset):

The output shape would be 13 x 13 x (5 x (5 + 20)) = 13 x 13 x 125.
Each cell in this final output represents predictions for multiple bounding boxes and the associated class probabilities.

This grid of predictions is then post-processed using techniques like non-maximum suppression (NMS) to filter out overlapping and low-confidence boxes.

In [2]:
from model import model_builder
import torch
import lightnet as ln
from torch.utils.data import Dataset

## 

In [2]:
model = model_builder(num_classes=3)

  state = torch.load(weights_file, 'cpu')
Modules not matching, performing partial update


In [3]:
## Test model shape

def test(model):
    X = torch.randn((2, 3, 416, 416))
    print(model(X).shape)

def test_loss(model, loss_fn):
    loss = 0
    model.eval()
    X = torch.rand((1, 3, 416, 416))
    y = torch.rand((1, 5, 5))
    print(loss_fn)
    with torch.inference_mode():
        y_pred = model(X)

    print(y.shape)
    print(y_pred.shape)
    loss = loss_fn(y_pred, y)
    print(loss)



In [4]:
# S x S x (B x (5 + C)) -> (BATCH_SIZE, 5*(5+C), 13, 13)
test(model)

torch.Size([2, 40, 13, 13])


In [5]:
## Test framework loss

loss_fn = ln.network.loss.RegionLoss(
    num_classes= model.num_classes,
    anchors=model.anchors,
    network_stride=model.stride
)


In [7]:
model = ln.models.YoloV2(1)

# Create accompanying loss (minimal required arguments for it to work with our defined Yolo network)
loss = ln.network.loss.RegionLoss(
    num_classes=model.num_classes,
    anchors=model.anchors,
    network_stride=model.stride
)
print(loss)

# Use loss
input_tensor = torch.rand(1, 3, 416, 416)   # batch, channel, height, width
target_tensor = torch.rand(1, 2, 5)         # batch, num_anno, 5 (see RegionLoss docs)

output_tensor = model(input_tensor)
loss_value = loss(output_tensor, target_tensor)
loss_value.backward()

# Print loss
print(loss.values)

RegionLoss(
  classes=1, network_stride=32, IoU threshold=0.6, seen=0
  coord_scale=1.0, object_scale=5.0, noobject_scale=1.0, class_scale=1.0
  anchors=[1.3221, 1.7314] [3.1927, 4.0094] [5.0559, 8.0989] [9.4711, 4.8405] [11.236, 10.007]
)
{'total': tensor(120.5731), 'conf': tensor(113.4837), 'coord': tensor(7.0893), 'class': tensor(0.)}


## Custom Dataset 

In [23]:
from pathlib import Path

root_dir = Path("data")
labels_dir = root_dir / "labels"
img_dir = root_dir / "data_object_image_2/training/image_2"

In [15]:
len(list((labels_dir).glob("*.txt")))

7481

In [6]:
labels_dir

PosixPath('data/labels')

In [46]:
import os
from pathlib import Path
import torch
from torch.utils.data import Dataset
from PIL import Image
import torchvision.transforms as T

class YoloDarknetDataset(Dataset):
    def __init__(self, images_dir, labels_dir, classes, transform=None):
        """
        Args:
            images_dir (str or Path): Path to the directory containing images.
            labels_dir (str or Path): Path to the directory containing labels.
            classes (list): List of class names.
            transform (callable, optional): Transform to be applied on an image.
        """
        self.images_dir = Path(images_dir)
        self.labels_dir = Path(labels_dir)
        self.transform = transform
        self.classes = classes

        # Gather all image files in the directory
        self.image_files = sorted([p for p in self.images_dir.glob('*') if p.suffix in ['.jpg', '.jpeg', '.png']])

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        # Load the image
        img_path = self.image_files[idx]
        img = Image.open(img_path).convert('RGB')

        # Apply transforms if specified
        if self.transform:
            img = self.transform(img)

        # Load the corresponding label file
        label_path = self.labels_dir / f"{img_path.stem}.txt"
        boxes, labels = self._load_labels(label_path)

        # Convert boxes and labels to tensors
        boxes = torch.tensor(boxes, dtype=torch.float32)
        labels = torch.tensor(labels, dtype=torch.int64)

        return img, {'boxes': boxes, 'labels': labels}

    def _load_labels(self, label_path):
        """
        Load labels from a Darknet format .txt file without converting to pixel coordinates.
        
        Args:
            label_path (Path): Path to the .txt file.
        
        Returns:
            boxes (list of lists): Bounding boxes in normalized coordinates [x_center, y_center, width, height].
            labels (list): Class labels.
        """
        boxes = []
        labels = []

        if label_path.exists():
            with open(label_path, 'r') as f:
                for line in f.readlines():
                    class_id, x_center, y_center, width, height = map(float, line.strip().split())
                    labels.append(int(class_id))
                    boxes.append([x_center, y_center, width, height])

        return boxes, labels


In [47]:
Kitti_dataset = YoloDarknetDataset(images_dir=img_dir, labels_dir=labels_dir, classes=["Cyclist", "Pedestrian", "car"])

In [48]:
sample = Kitti_dataset[8]

In [50]:
sample

(<PIL.Image.Image image mode=RGB size=1242x375>,
 {'boxes': tensor([[0.1620, 0.7552, 0.3239, 0.4843],
          [0.3862, 0.7346, 0.2332, 0.5149],
          [0.8769, 0.7619, 0.2445, 0.4710],
          [0.5308, 0.5831, 0.0993, 0.2266],
          [0.6173, 0.5030, 0.0411, 0.1056],
          [0.7411, 0.5580, 0.0579, 0.1650]]),
  'labels': tensor([2, 2, 2, 2, 2, 2])})

In [43]:
len(Kitti_dataset)

7481