# Theme 5: Object Detection with YOLOv2 Tiny



# Introduction

**Object Detection** is a computer vision technique that works to identify and locate objects within an image or video. Specifically, object detection draws bounding boxes around these detected objects, which allow us to locate where said objects are in (or how they move through) a given scene.

**YOLO (You Only Look Once)** is a family of object detection models known for their speed and efficiency. Unlike two-stage detectors (like R-CNN) that first generate region proposals and then classify them, YOLO processes the image in a single pass, predicting bounding boxes and class probabilities directly from full images. This makes it extremely fast and suitable for real-time applications.

**YOLOv2 Tiny** is a lightweight version of YOLOv2 optimized for speed and to run on devices with lower computational power. It uses the **Darknet** architecture as its backbone feature extractor.

### A Note on Transposed Convolutions
While YOLOv2 relies heavily on downsampling (pooling/strided convolutions) to aggregate features, modern architectures (like GANs or Semantic Segmentation networks like U-Net) often need to **upsample** lower-resolution feature maps back to the original image size. **Transposed Convolutions** (sometimes incorrectly called de-convolutions) are a learnable way to perform this upsampling, essentially reversing the spatial effect of a standard convolution.



# Part 1: Theoretical Questions



In this assignment, you will get hands-on experience in object detection. The object detection pipeline we use is YOLO version 2 Tiny built on the DarkNet feature extractor backbone. The YOLO v2 pipeline is available within TensorFlow / Keras.



### Question 1



YOLOv2 Tiny is based on a neural network named "Darknet Reference". Navigate to https://pjreddie.com/darknet/imagenet/ and scroll down to see a comparison table "Pre-Trained Models." Compared to the popular ResNet 18 network, how much less computational operations (Ops column) does Darknet Reference need?

**Answer:**
Darknet Reference requires approx **4.9x less** operations.
*   ResNet-18 Ops: ~4.69 Billion
*   Darknet Reference Ops: ~0.96 Billion
*   Ratio: $4.69 / 0.96 pprox 4.885$



### Question 2



Click on "cfg" on the table's row titled "Darknet Reference". What is the layer structure of Darknet Reference? (how many layers and which type?)

**Answer:**
The Darknet Reference model has **16 layers** (counting convs and fully connected/softmax, excluding pooling which is an operation).
The structure typically follows a pattern of Convolution followed by Maxpooling.
Structure: `conv-maxp-conv-maxp-conv-maxp-conv-maxp-conv-maxp-conv-avgp-conv-smax`



Before proceeding to Question 3, we set up our environment by cloning the repository.


In [None]:
!git clone https://github.com/jboutell/keras-YOLOv3-model-set.git


### Question 3



The Darknet Reference network is used as the feature extractor part of YOLO v2 Tiny. Compare the YOLO v2 Tiny network structure to the Darknet Reference structure. What differences can you find?
(The YOLO v2 Tiny network can be found in `keras-YOLOv3-model-set/cfg/yolov2-tiny.cfg`)

**Answer:**
The **last 3 layers** differ significantly.
*   **Darknet Reference** ends with: `Global Average Pooling` -> `Convolution` (1000 for ImageNet) -> `Softmax` (Classification).
*   **YOLOv2 Tiny** ends with: `Convolution` -> `Convolution` -> `Region` (Detection).
Basically, the classification head is replaced by a detection head.



### Question 4



What is the purpose of the last layer of Darknet Reference? What is the purpose of the last layer of YOLO v2 Tiny?

**Answer:**
*   **Darknet Reference**: It is a classification architecture. Its last `SoftMax` layer classifies the extracted features into per-class probabilities (e.g., "This image contains a Cat").
*   **YOLO v2 Tiny**: It is an object detection architecture. Its last layer output determines **bounding boxes** [x, y, w, h], **objectness scores**, and **class probabilities** for each box grid. It outputs a volume of predictions rather than a single vector.



### Question 5


Observe the model summary (layers) that are printed during model execution.


In [None]:

# We interpret the answers based on a standard YOLOv2 Tiny implementation (416x416 input).
# If we were to run the code in an environment with the library installed:
# import os
# os.chdir('keras-YOLOv3-model-set')
# # Code to load model and print summary would typically involve:
# # model = load_model(...)
# # model.summary()




**a) What is the three-dimensional output size of the last conv layer of YOLO v2 Tiny?**
**Answer:** `13 x 13 x 425`

**b) Looking at the three-dimensional output size, what is the size of the feature grid?**
**Answer:** `13 x 13`. This means the image is divided into a 13x13 grid, and each cell is responsible for detecting objects centered within it.



### Question 6


Read the YOLO v2 paper 'YOLO9000: better, faster, stronger' and answer the following questions:



**a) How many bounding boxes at each (feature grid) cell does YOLO v2 use for object detection?**
**Answer:** The paper states there are **5 anchor boxes** (priors) per feature grid cell.

**b) The Yolov2-Tiny model in our github repository has been trained to recognize 80 object classes. Now that you know how many bounding boxes there are per (feature grid) cell, can you explain the size of the last dimension (425) of the model output?**
**Answer:**
Each grid cell outputs a vector for each of the 5 anchor boxes.
For *one* bounding box, we predict:
*   4 coordinates: $t_x, t_y, t_w, t_h$
*   1 objectness score: $P(Object)$
*   80 class probabilities: $P(Class_i | Object)$
Total per box = $4 + 1 + 80 = 85$.
With 5 boxes, the depth is $5 \times 85 = 425$.

**c) Open the file configs/yolov2-tiny_anchors.txt â€“ what is the meaning of the ten decimal values shown in this file?**
**Answer:**
The ten values represent the **width and height** of the 5 anchor boxes (priors) relative to the grid size.
Pairs: $(w_1, h_1), (w_2, h_2), ..., (w_5, h_5)$.



### Question 7



The Yolo v2 postprocessing is implemented in the file `yolo2/postprocessing_np.py`. Follow through the function and briefly explain what happens in the sub-parts of post-processing.

**Answer:**
*   `Yolo_decode`: Unpacks the raw tensor $(13, 13, 425)$ into separate components: box coordinates, objectness scores, and class scores. It applies the sigmoid activation to $x, y, objectness$ and exponentiates $w, h$ with anchors.
*   `Yolo_correct_boxes`: Converts the relative grid coordinates of the bounding boxes to the actual input image coordinates (pixels).
*   `Yolo_handle_predictions`: Filters out boxes with low objectness scores. It usually applies **Non-Maximum Suppression (NMS)** here to remove overlapping duplicate boxes for the same object, keeping only the best one.
*   `Yolo_adjust_boxes`: Adjusts box aspect ratios or final corner coordinates $(ymin, xmin, ymax, xmax)$ for drawing.



# Part 3: Visual Results and Discussion



To verify our model, we run it on a sample image (e.g., standard 'dog.jpg').



In [None]:

# Example code flow to run prediction (pseudocode/repo-specific)
# !python tools/image_demo.py --image images/dog.jpg --model configs/yolov2-tiny.h5
#
# import matplotlib.pyplot as plt
# img = plt.imread('output/dog_prediction.jpg')
# plt.imshow(img)
# plt.show()




**Observations:**
The model should correctly identify the dog, the bicycle, and potentially the truck in the background. The bounding boxes will display the class label and the confidence score.

**Why YOLOv2 Tiny?**
*   **Speed**: Due to fewer layers (Darknet vs ResNet) and the "one-shot" nature (no region proposals), it is exceptionally fast.
*   **Efficiency**: The "Tiny" variant reduces parameters further, making it viable for mobile or embedded devices.
*   **Trade-off**: It might be less accurate than full YOLOv2 or Faster R-CNN on small or crowded objects, but the speed gain is massive.

**Connection to Transposed Convolutions:**
While YOLOv2 uses `maxpool` to downsample to 13x13, semantic segmentation models (which label *every pixel*) need to go back up to the original size. They use **Transposed Convolutions** to learn how to "paint" the low-res features back onto the high-res canvas. This is crucial for understanding how we move from Detection (boxes on a grid) to Segmentation (pixel masks).



# Conclusion



We have explored the architecture of YOLOv2 Tiny, understood its output tensor shape $(13 \times 13 \times 425)$, and analyzed how raw predictions are post-processed into usable bounding boxes.



In [None]:
# !pip install imgaug
# !pip install "numpy<2.0"
!pip install tensorflow-model-optimization

In [None]:


# !git clone https://github.com/david8862/keras-YOLOv3-model-set.git/


!python ./keras-YOLOv3-model-set/yolo.py --model_type=tiny_yolo --darknet --weights_path=weights/yolov2-tiny.h5 --anchors_path=configs/yolov-tiny_anchors.txt


2025-12-27 09:03:49.028299: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-27 09:03:51.325481: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
  if not hasattr(np, "object"):
Traceback (most recent call last):
  File [35m"c:\Users\danie\Documents\Projects\ml4cv\keras-YOLOv3-model-set\yolo.py"[0m, line [35m17[0m, in [35m<module>[0m
    from tensorflow_model_optimization.sparsity import keras as sparsity
[1;35mModuleNotFoundError[0m: [35mNo module named 'tensorflow_model_optimization'[0m


In [None]:

from PIL import Image
display(Image.open('output.png', mode='r'))

# First example: Nearest-neighbor upsampling mimic using Conv2DTranspose
from numpy import asarray
from keras.models import Sequential
from keras.layers import Conv2DTranspose

# define input data
X = asarray([[1, 2],
             [3, 4]])

# show input data for context
print(X)

# reshape input data into one sample with a channel
X = X.reshape((1, 2, 2, 1))

# define model
model = Sequential()
model.add(Conv2DTranspose(1, (2,2), strides=(2,2), padding='same', input_shape=(2, 2, 1)))
model.summary()

# define weights that mimic nearest neighbor upsampling
weights = [asarray([[[[1, 1],
                     [1, 1]]]]), asarray([0])]

weights[0] = weights[0].reshape(2,2,1,1)
model.set_weights(weights)

yhat = model.predict(X)

# reshape output to remove channel to make printing easier
yhat = yhat.reshape((4, 4))
print(yhat)

# Second example: Bilinear interpolation mimic using Conv2DTranspose
from numpy import asarray
from keras.models import Sequential
from keras.layers import Conv2DTranspose

# define input data
X = asarray([[1, 2],
             [3, 4]])

# show input data for context
print(X)

# reshape input data into one sample with a channel
X = X.reshape((1, 2, 2, 1))

# define model
model = Sequential()
model.add(Conv2DTranspose(1, (4,4), strides=(2,2), padding='same', input_shape=(2, 2, 1)))
model.summary()

# define weights that mimic bilinear interpolation
weights = [asarray([[[[0.0625, 0.1875, 0.1875, 0.0625],
                     [0.1875, 0.5625, 0.5625, 0.1875],
                     [0.1875, 0.5625, 0.5625, 0.1875],
                     [0.0625, 0.1875, 0.1875, 0.0625]]]]), asarray([0])]

weights[0] = weights[0].reshape(4,4,1,1)
model.set_weights(weights)

yhat = model.predict(X)

# reshape output to remove channel to make printing easier
yhat = yhat.reshape((4, 4))
print(yhat)

# Third example: Upscaling a low-resolution image (e.g., a dog) using bilinear-like transpose convolution
import cv2
from matplotlib import pyplot as plt
from numpy import asarray
from keras.models import Sequential
from keras.layers import Conv2DTranspose

# Define the model (same as second example)
model = Sequential()
model.add(Conv2DTranspose(1, (4,4), strides=(2,2), padding='same', input_shape=(32, 32, 1)))

# Bilinear-like weights
weights = [asarray([[[[0.0625, 0.1875, 0.1875, 0.0625],
                     [0.1875, 0.5625, 0.5625, 0.1875],
                     [0.1875, 0.5625, 0.5625, 0.1875],
                     [0.0625, 0.1875, 0.1875, 0.0625]]]]), asarray([0])]

weights[0] = weights[0].reshape(4,4,1,1)
model.set_weights(weights)

# Load and display original image
img_bgr = cv2.imread('example.png', cv2.IMREAD_COLOR)  # Replace with your low-res image path
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
plt.imshow(img_rgb)
plt.axis('off')
plt.show()

# Split channels and reshape
img_r = img_bgr[:,:,2].reshape((1, 32, 32, 1))  # Note: BGR order, R is [:,:,2]
img_g = img_bgr[:,:,1].reshape((1, 32, 32, 1))
img_b = img_bgr[:,:,0].reshape((1, 32, 32, 1))

# Upscale each channel
up_r = model.predict(img_r).reshape((64, 64))
up_g = model.predict(img_g).reshape((64, 64))
up_b = model.predict(img_b).reshape((64, 64))

# Merge and display upscaled image
img_up = cv2.merge((up_b, up_g, up_r)).astype('uint8')  # Back to BGR for correct colors if needed
plt.imshow(cv2.cvtColor(img_up, cv2.COLOR_BGR2RGB))
plt.axis('off')
plt.show()

