YOLOv2 Output Cell Shape Explanation

The output cell shape of the YOLOv2 architecture refers to the structure of the final feature map produced by the network, which is used for object detection. To understand this, let’s break it down step-by-step:

1. **Grid Division**  
YOLOv2 divides the input image into a grid of size S x S. Each cell in this grid is responsible for predicting bounding boxes for objects whose centers fall within that cell. For instance, if the input image size is 416 x 416, a common grid size in YOLOv2 is 13 x 13, meaning each grid cell covers 32 x 32 pixels of the original image.


2. **Bounding Box Predictions**  
Each grid cell predicts:

A fixed number of bounding boxes (typically 5).
For each bounding box, it predicts:
The coordinates: (x, y, w, h) representing the center coordinates of the box relative to the cell, as well as its width and height.
The confidence score that indicates the likelihood that the box contains an object.


3. **Class Predictions**  
For each bounding box, YOLOv2 also predicts the probabilities that the object belongs to one of the predefined classes. If there are C classes, then for each bounding box, there are C class scores.


4. **Output Tensor Shape**  
The final output of YOLOv2 has the shape: S x S x (B x (5 + C))
Where:

S x S is the grid size (e.g., 13 x 13).
B is the number of bounding boxes predicted per grid cell (typically 5).
5 + C refers to the 5 values for each bounding box (4 for x, y, w, h and 1 for the confidence score) plus the class predictions C.
Example:
For a 13 x 13 grid, 5 bounding boxes per cell, and 20 classes (like in the Pascal VOC dataset):

The output shape would be 13 x 13 x (5 x (5 + 20)) = 13 x 13 x 125.
Each cell in this final output represents predictions for multiple bounding boxes and the associated class probabilities.

This grid of predictions is then post-processed using techniques like non-maximum suppression (NMS) to filter out overlapping and low-confidence boxes.

In [1]:
from model import model_builder
import torch
import lightnet as ln
from torch.utils.data import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [18]:
# # Loading darknet weights (download: http://pjreddie.com/media/files/darknet19_448.weights)
# model = ln.models.Darknet19(1000)
# model.load('weights/darknet19_448.weights')

# # Save as PyTorch weight file (Not strictly necessary, but it is faster than darknet weight files)
# model.save('weights/darknet19_448.pt')

# # Converting Darknet19 weights to Yolo (This is the same as the darknet19_448.conv.23.weights from darknet)
# model.save('weights/yolo-pretrained_darknet.pt', remap=ln.models.YoloV2.remap_darknet19)

# # Load yolo weights (Requires `strict=False`, because not all layers have weights in this file)
# detection_model = ln.models.YoloV2(20)
# detection_model.load('weights/yolo-pretrained_darknet.pt', strict=False)

## 

In [2]:
model = model_builder(num_classes=3)

  state = torch.load(weights_file, 'cpu')
Modules not matching, performing partial update


In [3]:
## Test model shape

X = torch.randn((16, 3, 416, 416))
print(model(X).shape)



torch.Size([16, 40, 13, 13])


In [4]:
## Test framework loss

loss_fn = ln.network.loss.RegionLoss(
    num_classes= model.num_classes,
    anchors=model.anchors,
    network_stride=model.stride
)


In [44]:
# Create accompanying loss (minimal required arguments for it to work with our defined Yolo network)
loss = ln.network.loss.RegionLoss(
    num_classes=model.num_classes,
    anchors=model.anchors,
    network_stride=model.stride
)
print(loss)

# Use loss
input_tensor = torch.rand(1, 3, 416, 416)   # batch, channel, height, width
target_tensor = torch.rand(1, 22, 5)         # batch, num_anno, 5 (see RegionLoss docs)

output_tensor = model(input_tensor)
loss_value = loss(output_tensor, target_tensor)
#loss_value.backward()

# Print loss
print(loss.values)
print(loss.values["total"].item())

RegionLoss(
  classes=3, network_stride=32, IoU threshold=0.6, seen=0
  coord_scale=1.0, object_scale=5.0, noobject_scale=1.0, class_scale=1.0
  anchors=[1.3221, 1.7314] [3.1927, 4.0094] [5.0559, 8.0989] [9.4711, 4.8405] [11.236, 10.007]
)
{'total': tensor(252.8157), 'conf': tensor(158.1815), 'coord': tensor(64.0366), 'class': tensor(30.5976)}
252.81565856933594


In [7]:
loss.values["total"].item()

268.1571044921875

## Custom Dataset 

In [5]:
from pathlib import Path

root_dir = Path("data")
labels_dir = root_dir / "labels"
img_dir = root_dir / "data_object_image_2/training/image_2"

In [49]:
from dataset import YoloDarknetDataset

In [50]:
Kitti_dataset = YoloDarknetDataset(images_dir=img_dir, labels_dir=labels_dir, classes=["Cyclist", "Pedestrian", "car"])

In [8]:
sample = Kitti_dataset[8]

In [55]:
target_tensor = sample[1]["boxes"]
target_tensor = target_tensor.unsqueeze(dim=0)

In [61]:
target_tensor.shape

torch.Size([1, 22, 5])

In [60]:
output_tensor.shape

torch.Size([1, 40, 13, 13])

In [67]:
loss_value = loss(output_tensor, target_tensor)

#loss_value.backward()

# Print loss
print(loss.values)


{'total': tensor(223.0321), 'conf': tensor(126.0392), 'coord': tensor(77.2070), 'class': tensor(19.7859)}


In [2]:
import os
def max_labels_in_folder(folder_path):
    max_labels = 0
    max_labels_file = ""
    
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'r') as file:
                label_count = sum(1 for line in file)  # Count lines in the file
                if label_count > max_labels:
                    max_labels = label_count
                    max_labels_file = filename  # Update filename with the maximum labels
    
    return max_labels_file, max_labels

# Set the path to your folder with the txt files
folder_path = '/home/gustavo/workstation/depth_estimation/codes/rgbd-yolov2/data/labels'
max_labels = max_labels_in_folder(folder_path)
print(f"The maximum number of labels in a file is: {max_labels}")


The maximum number of labels in a file is: ('004139.txt', 22)


In [17]:
import torch
from torch.utils.data import DataLoader
from dataset import YoloDarknetDataset
from torchvision import transforms
import torch.optim as optim
from train import train_yolov2
import os
from model import model_builder
import lightnet as ln

IMG_DIR = "/home/gustavo/workstation/depth_estimation/codes/rgbd-yolov2/data/images_test/"
LABEL_DIR =  "/home/gustavo/workstation/depth_estimation/codes/rgbd-yolov2/data/labels"
BATCH_SIZE = 8
NUM_WORKERS = os.cpu_count()
LEARNING_RATE = 2e-5
NUM_EPOCHS = 5
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE = "cpu"

print(f"Using Device {DEVICE}")

model = model_builder(num_classes=3)

loss_fn = ln.network.loss.RegionLoss(
    num_classes= model.num_classes,
    anchors=model.anchors,
    network_stride=model.stride
)

optimizer = optim.Adam(
    model.parameters(),
    lr=LEARNING_RATE,
)

train_transforms = transforms.Compose([
    transforms.Resize((416, 416)),
    transforms.ToTensor()
])

train_dataset = YoloDarknetDataset(
    images_dir=IMG_DIR,
    labels_dir=LABEL_DIR,
    classes=["Cyclist", "Pedestrian", "car"],
    transform=train_transforms,
)

train_dataloader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=NUM_WORKERS,
    pin_memory=True
)


train_yolov2(model=model, 
             train_dataloader=train_dataloader, 
             loss_fn=loss_fn, 
             optimizer=optimizer, 
             num_epochs=NUM_EPOCHS, 
             device=DEVICE)

Using Device cpu


  state = torch.load(weights_file, 'cpu')
Modules not matching, performing partial update


Epoch [1/5], Loss: 185.4051
Epoch [2/5], Loss: 215.8996
Epoch [3/5], Loss: 165.4923
Epoch [4/5], Loss: 141.7955
Epoch [5/5], Loss: 109.8457


[185.40510995047433,
 215.8995840890067,
 165.49234662737166,
 141.79553876604353,
 109.84571838378906]

In [6]:
train_dataset[8][1]["boxes"]

tensor([[ 2.0000,  0.1620,  0.7552,  0.3239,  0.4843],
        [ 2.0000,  0.3862,  0.7346,  0.2332,  0.5149],
        [ 2.0000,  0.8769,  0.7619,  0.2445,  0.4710],
        [ 2.0000,  0.5308,  0.5831,  0.0993,  0.2266],
        [ 2.0000,  0.6173,  0.5030,  0.0411,  0.1056],
        [ 2.0000,  0.7411,  0.5580,  0.0579,  0.1650],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-1.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [-

In [15]:
loss_fn = ln.network.loss.RegionLoss(
    num_classes= model.num_classes,
    anchors=model.anchors,
    network_stride=model.stride
).to(DEVICE)

input_tensor = torch.randn(1, 3, 416, 416).to(DEVICE)
output_tensor = model(input_tensor)

targets = train_dataset[8][1]["boxes"].unsqueeze(dim=0)
targets = targets.to(DEVICE)


print(loss_fn(output_tensor, targets))

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [4]:
from torchinfo import summary
model = model_builder(num_classes=3)
# Print a summary using torchinfo (uncomment for actual output)
summary(model=model, 
        input_size=(1, 3, 416, 416), # make sure this is "input_size", not "input_shape"
        # col_names=["input_size"], # uncomment for smaller output
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
) 

  state = torch.load(weights_file, 'cpu')
Modules not matching, performing partial update


Layer (type (var_name))                       Input Shape          Output Shape         Param #              Trainable
YoloV2 (YoloV2)                               [1, 3, 416, 416]     [1, 40, 13, 13]      --                   True
├─FeatureExtractor (backbone)                 [1, 3, 416, 416]     [1, 1024, 13, 13]    --                   True
│    └─Sequential (module)                    [1, 3, 416, 416]     [1, 1024, 13, 13]    --                   True
│    │    └─Conv2dBatchAct (1_convbatch)      [1, 3, 416, 416]     [1, 32, 416, 416]    928                  True
│    │    └─MaxPool2d (2_max)                 [1, 32, 416, 416]    [1, 32, 208, 208]    --                   --
│    │    └─Conv2dBatchAct (3_convbatch)      [1, 32, 208, 208]    [1, 64, 208, 208]    18,560               True
│    │    └─MaxPool2d (4_max)                 [1, 64, 208, 208]    [1, 64, 104, 104]    --                   --
│    │    └─Conv2dBatchAct (5_convbatch)      [1, 64, 104, 104]    [1, 128, 104, 104]  