<a href="https://colab.research.google.com/github/iannstronaut/YOLOv3_From_Scratch/blob/main/Build_Yolo_From_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Model Implementation

In [37]:
import torch
import torch.nn as nn

In [54]:
model_config = [
    (32, 3, 1),
    (64, 3, 2),
    ["B", 1],
    (128, 3, 2),
    ["B", 2],
    (256, 3, 2),
    ["B", 8],
    (512, 3, 2),
    ["B", 8],
    (1024, 3, 2),
    ["B", 4], # To this point is Darknet-53
    (512, 1, 1),
    (1024, 3, 1),
    "S",
    (256, 1, 1),
    "U",
    (256, 1, 1),
    (512, 3, 1),
    "S",
    (128, 1, 1),
    "U",
    (128, 1, 1),
    (256, 3, 1),
    "S"
]

Variabel model_config adalah sebuah daftar (list) yang berisi konfigurasi arsitektur dari model YOLOv3, dimulai dari Darknet-53 sebagai backbone, lalu dilanjutkan dengan bagian head untuk deteksi objek.

Setiap elemen dalam list ini dapat berupa:

* Tuple (C, K, S) yang menyatakan layer Convolution (Conv2D):

  * C: Jumlah filter/output channels

  * K: Ukuran kernel (kernel size)

  * S: Langkah (stride)

* List ["B", N] menyatakan blok residual sebanyak N kali (bagian dari Darknet-53).

* String "S" menyatakan deteksi skala (scale detection).

* String "U" menyatakan upsampling layer (untuk feature map fusion dari feature map sebelumnya).

In [None]:
class CNNBlock(nn.Module):
  def __init__(self, in_channels, out_channels, bn_act=True, **kwargs):
    super().__init__()
    self.conv = nn.Conv2d(in_channels, out_channels, bias=not bn_act, **kwargs)
    self.bn = nn.BatchNorm2d(out_channels)
    self.leaky = nn.LeakyReLU(0.1)
    self.use_bn_act = bn_act

  def forward(self, x):
    if self.use_bn_act:
      return self.leaky(self.bn(self.conv(x)))
    else:
      return self.conv(x)

Class CNNBlock adalah sebuah modul (class) dari PyTorch yang digunakan untuk membangun blok dasar dalam arsitektur jaringan konvolusional seperti YOLOv3. Blok ini menggabungkan beberapa operasi utama yang umum digunakan dalam CNN modern, yaitu Conv2D (konvolusi), Batch Normalization, dan Leaky ReLU dalam satu kesatuan.

Blok ini memiliki parameter masukan in_channels dan out_channels yang masing-masing menentukan jumlah channel pada input dan output dari layer konvolusi. Parameter bn_act adalah boolean yang digunakan untuk menentukan apakah akan menggunakan Batch Normalization dan aktivasi Leaky ReLU setelah konvolusi atau tidak. Jika bn_act bernilai True, maka konvolusi akan dilanjutkan dengan proses normalisasi dan aktivasi, sedangkan jika False, hanya layer konvolusi saja yang digunakan (biasanya untuk layer output YOLO).

Operasi dalam blok ini dilakukan sebagai berikut:

* Jika bn_act = True, maka input akan melewati Conv2D → BatchNorm2D → LeakyReLU.

* Jika bn_act = False, maka input hanya akan melewati Conv2D tanpa aktivasi atau normalisasi.

Selain itu, bias pada layer konvolusi hanya diaktifkan jika bn_act = False. Hal ini karena jika menggunakan BatchNorm, nilai bias menjadi tidak relevan dan sebaiknya di-nonaktifkan untuk efisiensi.

Dengan fleksibilitas seperti ini, CNNBlock dapat digunakan secara dinamis baik untuk bagian-bagian awal jaringan, blok intermediate, maupun layer akhir yang tidak memerlukan aktivasi tambahan.

In [None]:
class ResidualBlock(nn.Module):
  def __init__(self, channels, use_residual=True, num_repeats=1):
    super().__init__()
    self.layers = nn.ModuleList()
    for _ in range(num_repeats):
      self.layers += [
          nn.Sequential(
            CNNBlock(channels, channels//2, kernel_size=1),
            CNNBlock(channels//2, channels, kernel_size=3, padding=1)
          )
      ]

    self.use_residual = use_residual
    self.num_repeats = num_repeats

  def forward(self, x):
    for layer in self.layers:
      x = layer(x) + x if self.use_residual else layer(x)
    return x

In [None]:
class ScalePrediction(nn.Module):
  def __init__(self, in_channels, num_classes):
    super().__init__()
    self.pred = nn.Sequential(
        CNNBlock(in_channels, 2*in_channels, kernel_size=3, padding=1),
        CNNBlock(2*in_channels, (num_classes + 5 ) * 3, bn_act=False, kernel_size=1)
    )
    self.num_classes = num_classes

  def forward(self, x):
    return (
        self.pred(x)
        .reshape(x.shape[0], 3, self.num_classes + 5, x.shape[2], x.shape[3])
        .permute(0, 1, 3, 4, 2)
    )

In [55]:
class YOLOv3(nn.Module):
  def __init__(self, in_channels=3, num_classes=20):
    super().__init__()
    self.num_classes = num_classes
    self.in_channels = in_channels
    self.layers = self._create_conv_layers()

  def forward(self, x):
    output = []
    route_connections = []

    for layer in self.layers:
      if isinstance(layer, ScalePrediction):
        output.append(layer(x))
        continue

      x = layer(x)
      print(x.shape)

      if isinstance(layer, ResidualBlock) and layer.num_repeats == 8:
        route_connections.append(x)

      elif isinstance(layer, nn.Upsample):
        x = torch.cat([x, route_connections[-1]], dim = 1)
        route_connections.pop()

    return output

  def _create_conv_layers(self):
    layers = nn.ModuleList()
    in_channels = self.in_channels

    for module in model_config:
      if isinstance(module, tuple):
        out_channels, kernel_size, stride = module
        layers.append(CNNBlock(
            in_channels,
            out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=1 if kernel_size == 3 else 0
          )
        )
        in_channels = out_channels

      elif isinstance(module, list):
        num_repeats = module[1]
        layers.append(ResidualBlock(in_channels, num_repeats=num_repeats))

      elif isinstance(module, str):
        if module == "S":
          layers += [
              ResidualBlock(in_channels, use_residual=False, num_repeats=1),
              CNNBlock(in_channels, in_channels//2, kernel_size=1),
              ScalePrediction(in_channels//2, num_classes=self.num_classes)
          ]
          in_channels = in_channels // 2

        elif module == "U":
          layers.append(nn.Upsample(scale_factor=2))
          in_channels = in_channels * 3

    return layers

In [56]:
if __name__ == "__main__":
  num_classes = 20
  IMAGE_SIZE = 416
  model = YOLOv3(num_classes=num_classes)
  x = torch.randn((2, 3, IMAGE_SIZE, IMAGE_SIZE))
  out = model(x)
  assert model(x)[0].shape == (2, 3, IMAGE_SIZE//32, IMAGE_SIZE//32, num_classes + 5)
  assert model(x)[1].shape == (2, 3, IMAGE_SIZE//16, IMAGE_SIZE//16, num_classes + 5)
  assert model(x)[2].shape == (2, 3, IMAGE_SIZE//8, IMAGE_SIZE//8, num_classes + 5)
  print("Success!")

torch.Size([2, 32, 416, 416])
torch.Size([2, 64, 208, 208])
torch.Size([2, 64, 208, 208])
torch.Size([2, 128, 104, 104])
torch.Size([2, 128, 104, 104])
torch.Size([2, 256, 52, 52])
torch.Size([2, 256, 52, 52])
torch.Size([2, 512, 26, 26])
torch.Size([2, 512, 26, 26])
torch.Size([2, 1024, 13, 13])
torch.Size([2, 1024, 13, 13])
torch.Size([2, 512, 13, 13])
torch.Size([2, 1024, 13, 13])
torch.Size([2, 1024, 13, 13])
torch.Size([2, 512, 13, 13])
torch.Size([2, 256, 13, 13])
torch.Size([2, 256, 26, 26])
torch.Size([2, 256, 26, 26])
torch.Size([2, 512, 26, 26])
torch.Size([2, 512, 26, 26])
torch.Size([2, 256, 26, 26])
torch.Size([2, 128, 26, 26])
torch.Size([2, 128, 52, 52])
torch.Size([2, 128, 52, 52])
torch.Size([2, 256, 52, 52])
torch.Size([2, 256, 52, 52])
torch.Size([2, 128, 52, 52])
torch.Size([2, 32, 416, 416])
torch.Size([2, 64, 208, 208])
torch.Size([2, 64, 208, 208])
torch.Size([2, 128, 104, 104])
torch.Size([2, 128, 104, 104])
torch.Size([2, 256, 52, 52])
torch.Size([2, 256, 52, 5

## Dataset Class

In [58]:
!git clone https://github.com/aladdinpersson/Machine-Learning-Collection.git
%cd Machine-Learning-Collection/ML/Pytorch/object_detection/YOLOv3
!cp utils.py /content/
!cp config.py /content/
!cp dataset.py /content/
%cd /content/
!rm -rf Machine-Learning-Collection

Cloning into 'Machine-Learning-Collection'...
remote: Enumerating objects: 1360, done.[K
remote: Counting objects: 100% (335/335), done.[K
remote: Compressing objects: 100% (200/200), done.[K
remote: Total 1360 (delta 172), reused 135 (delta 135), pack-reused 1025 (from 2)[K
Receiving objects: 100% (1360/1360), 120.81 MiB | 30.61 MiB/s, done.
Resolving deltas: 100% (565/565), done.
/content/Machine-Learning-Collection/ML/Pytorch/object_detection/YOLOv3
/content


In [34]:
import config
import numpy as np
import os
import pandas as pd

from PIL import Image, ImageFile
from torch.utils.data import Dataset, DataLoader
from utils import (
    iou_width_height as iou,
    non_max_suppression as nms
)

In [33]:
ImageFile.LOAD_TRUNCATED_IMAGES = True

In [46]:
class YOLODataset(Dataset):
  def __init__(
      self,
      csv_file,
      img_dir, label_dir,
      anchors,
      image_size=416,
      S=[13, 26, 52],
      C=20,
      transform=None
  ):
    self.annotations = pd.read_csv(csv_file)
    self.img_dir = img_dir
    self.label_dir = label_dir
    self.transform = transform
    self.S = S
    self.anchors = torch.tensor(anchors[0] + anchors[1] + anchors[2])
    self.num_anchors = self.anchors.shape[0]
    self.num_anchors_per_scale = self.num_anchors // 3
    self.C = C
    self.ignore_iou_thresh = 0.5

  def __len__(self):
    return len(self.annotations)

  def __getitem__(self, index):
    label_path = os.path.join(self.label_dir, self.annotations.iloc[index, 1])
    bboxes = np.roll(np.loadtxt(fname=label_path, delimiter=" ", ndim=2), 4, axis=1).tolist()
    img_path = os.path.join(self.img_dir, self.annotations.iloc[index, 0])
    image = np.array(Image.open(img_path).convert("RGB"))

    if self.transform:
      augmentations = self.transform(image=image, bboxes=bboxes)
      image = augmentations["image"]
      bboxes = augmentations["bboxes"]

    targets = [torch.zeros((self.num_anchors // 3, S, S, 6)) for S in self.S]

    for box in bboxes:
      iou_anchors = iou(torch.tensor(box[2:4]), self.anchors)
      anchor_indicies = iou_anchors.argsort(descending=True, dim =0)
      x, Y, width, height, class_label = box
      has_anchor = [False, False, False]

      for anchor_idx in anchor_indicies:
        scale_idx = anchor_idx // self.num_anchors_per_scale
        anchor_on_scale = anchor_idx % self.num_anchors_per_scale
        S = self.S[scale_idx]
        i, j = int(S * Y), int(S * x)
        anchor_taken = targets[scale_idx][anchor_on_scale, i, j, 0]

        if not anchor_taken and not has_anchor[scale_idx]:
          targets[scale_idx][anchor_on_scale, i, j, 0] = 1
          x_cell, y_cell = S * x - j
          width_cell, height_cell = (
              width * S,
              height * S
          )
          box_coordinates = torch.tensor(
              [x_cell, y_cell, width_cell, height_cell]
          )
          targets[scale_idx][anchor_on_scale, i, j, 1:5] = box_coordinates
          targets[scale_idx][anchor_on_scale, i, j, 5] = int(class_label)
          has_anchor[scale_idx] = True

        elif not anchor_taken and iou_anchors[anchor_idx] > self.ignore_iou_thresh:
          targets[scale_idx][anchor_on_scale, i, j, 0] = -1

    return image, tuple(targets)

## Loss Implementation

In [36]:
from utils import intersection_over_union

In [44]:
class YOLOLoss(nn.Module):
  def __init__(self):
    super().__init__()
    self.mse = nn.MSELoss()
    self.bce = nn.BCEWithLogitsLoss()
    self.entropy = nn.CrossEntropyLoss()
    self.sigmoid = nn.Sigmoid()

    self.lambda_class = 1
    self.lambda_noobj = 10
    self.lambda_obj = 1
    self.lambda_box = 10

  def forward(self, prediction, target, anchors):
    obj = target[..., 0] == 1
    noobj = target[..., 0] == 0

    # No object loss
    no_object_loss = self.bce(
        (prediction[..., 0:1][noobj]), (target[..., 0:1][noobj])
    )

    # Object loss
    anchors = anchors.reshape(1, 3, 1, 1, 2)
    box_preds = torch.cat([self.sigmoid(prediction[..., 1:3]), torch.exp(prediction[..., 3:5] * anchors)], dim=1)
    ious = intersection_over_union(box_preds[obj], target[..., 1:5][obj]).detach()
    object_loss = self.bce((prediction[..., 0:1][obj]), (ious * target[..., 0:1][obj]))

    #Box Coordinate loss
    prediction[..., 1:3] = self.sigmoid(prediction[..., 1:3])
    target[..., 3:5] = torch.log(
        (1e-16 + target[..., 3:5] / anchors)
    )
    box_loss = self.mse(prediction[..., 1:5][obj], target[..., 1:5][obj])

    #Class loss
    class_loss = self.entropy(
        (prediction[..., 5:][obj]), (target[..., 5][obj].long())
    )

    return (
        self.lambda_box * box_loss
        + self.lambda_obj * object_loss
        + self.lambda_noobj * no_object_loss
        + self.lambda_class * class_loss
    )

## Training

In [39]:
import torch.optim as optim
from tqdm import tqdm
from utils import (
    mean_average_precision,
    cells_to_bboxes,
    get_evaluation_bboxes,
    save_checkpoint,
    load_checkpoint,
    check_class_accuracy,
    get_loaders,
    plot_couple_examples
)

In [41]:
torch.backends.cudnn.benchmark = True

In [42]:
def train_fn(train_loader, model, optimizer, loss_fn, scaler, scaled_anchors):
  loop = tqdm(train_loader, leave=True)
  losses = []

  for batch_idx, (x, y) in enumerate(loop):
    x = x.to(config.DEVICE)
    y0, y1, y2 = (
        y[0].to(config.DEVICE),
        y[1].to(config.DEVICE),
        y[2].to(config.DEVICE)
    )

    with torch.cuda.amp.autocast():
      out = model(x)
      loss = (
          loss_fn(out[0], y0, scaled_anchors[0])
          + loss_fn(out[1], y1, scaled_anchors[1])
          + loss_fn(out[2], y2, scaled_anchors[2])
      )

    losses.append(loss.item())
    optimizer.zero_grad()
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    mean_loss = sum(losses) / len(losses)
    loop.set_postfix(loss=mean_loss)

In [50]:
def main():
  model = YOLOv3(num_classes=config.NUM_CLASSES).to(config.DEVICE)
  optimizer = optim.Adam(
      model.parameters(), lr=config.LEARNING_RATE, weight_decay=config.WEIGHT_DECAY
  )
  loss_fn = YOLOLoss()
  scaler = torch.cuda.amp.GradScaler()

  train_loader, test_loader, train_eval_loader = get_loaders(
      train_csv_path=config.DATASET + "/100examples.csv",
      test_csv_path=config.DATASET + "/100examples.csv",
  )

  if config.LOAD_MODEL:
    load_checkpoint(
        config.CHECKPOINT_FILE, model, optimizer, config.LEARNING_RATE
    )

  scaled_anchors = (
      torch.tensor(config.ANCHORS)
      * torch.tensor(config.S).unsqueeze(1).unsqueeze(1).repeat(1, 3, 2)
  ).to(config.DEVICE)

  for epoch in range(config.NUM_EPOCHS):
    train_fn(train_loader, model, optimizer, loss_fn, scaler, scaled_anchors)

    if config.SAVE_MODEL:
      save_checkpoint(model, optimizer, filename=f"checkpoint.pth.tar")

    if epoch % 10 == 0 and epoch > 0:
      print("On Test loader:")
      check_class_accuracy(model, test_loader, threshold=config.CONF_THRESHOLD)

      pred_boxes, true_boxes = get_evaluation_bboxes(
          test_loader,
          model,
          iou_threshold=config.NMS_IOU_THRESH,
          anchors=config.ANCHORS,
          threshold=config.CONF_THRESHOLD,
      )

      mapval = mean_average_precision(
          pred_boxes,
          true_boxes,
          iou_threshold=config.MAP_IOU_THRESH,
          box_format="midpoint",
          num_classes=config.NUM_CLASSES,
      )
      print(f"MAP: {mapval.item()}")

## Test on Pascal Voc

In [47]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("aladdinpersson/pascal-voc-dataset-used-in-yolov3-video")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/aladdinpersson/pascal-voc-dataset-used-in-yolov3-video?dataset_version_number=1...


100%|██████████| 4.31G/4.31G [01:12<00:00, 63.9MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/aladdinpersson/pascal-voc-dataset-used-in-yolov3-video/versions/1


In [49]:
!cp -r /root/.cache/kagglehub/datasets/aladdinpersson/pascal-voc-dataset-used-in-yolov3-video/versions/1 /content/pascal