# Object Detection with Pascal VOC and Pretrained SSD

By the end of this notebook, you will be able to:
- Explain the difference between image classification and object detection
- Interpret bounding box annotations in TFDS
- Visualize bounding boxes on images
- Run inference using a pretrained SSD MobileNet model
  - SSD = Single Shot Detector
  - MobileNet = Lightweight CNN backbone
- Understand the role of confidence scores and Non-Max Suppression (NMS)

In previous chapters, our model predicted: Input Image → One Label (e.g., Flower image → "daisy")

Object detection predicts: Input Image → Multiple (Bounding Box + Label + Confidence Score)

Each object in the image gets:
- A bounding box
- A class label
- A confidence score
This makes detection structurally more complex than classification.


In [1]:
# Load Pascal VOC from TFDS
# We will use the Pascal VOC 2007 dataset 
# It contains 20 object classes, bounding box annotations, multiple objects per image

import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import os

tf.random.set_seed(42)

dataset, info = tfds.load(
    "voc/2007",
    split="train",
    with_info=True
)

num_classes = info.features["objects"]["label"].num_classes
class_names = info.features["objects"]["label"].names

print("Number of classes:", num_classes)
print("Class names:", class_names)


Number of classes: 20
Class names: ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']


In [4]:
# Inspect One Sample
# Bounding boxes are normalized (ymin, xmin, ymax, xmax), multiple objects per image

sample = next(iter(dataset))

image = sample["image"]
boxes = sample["objects"]["bbox"]
labels = sample["objects"]["label"]

print("Image shape:", image.shape)
print("Bounding boxes shape:", boxes.shape)
print("Labels:", [class_names[int(l)] for l in labels])


Image shape: (480, 389, 3)
Bounding boxes shape: (4, 4)
Labels: ['horse', 'person', 'horse', 'person']


In [5]:
# Create Output Folder and Save Function

GROUND_TRUTH_DIR = "04a_ground_truth"
os.makedirs(GROUND_TRUTH_DIR, exist_ok=True)

print("Saving ground truth images to:")
print(os.path.abspath(GROUND_TRUTH_DIR))

# Save Several Examples

def save_image_with_boxes(image, boxes, index):
    """
    image: (H, W, 3)
    boxes: (num_boxes, 4) normalized ymin,xmin,ymax,xmax
    """

    # Convert to float32 [0,1]
    image_float = tf.cast(image, tf.float32) / 255.0
    image_batch = tf.expand_dims(image_float, axis=0)

    boxes_batch = tf.expand_dims(boxes, axis=0)

    boxed_image = tf.image.draw_bounding_boxes(
        image_batch,
        boxes_batch,
        colors=[[1.0, 0.0, 0.0]]  # red
    )

    boxed_image = tf.cast(boxed_image[0] * 255.0, tf.uint8)

    encoded = tf.io.encode_jpeg(boxed_image)

    filename = os.path.join(GROUND_TRUTH_DIR, f"ground_truth_{index}.jpg")
    tf.io.write_file(filename, encoded)

    print(f"Saved: {filename}")

for i, sample in enumerate(dataset.take(5)):
    image = sample["image"]
    boxes = sample["objects"]["bbox"]

    save_image_with_boxes(image, boxes, i)


Saving ground truth images to:
C:\Users\Jason Eckert\Documents\cv\04_detect_segment\04a_ground_truth
Saved: 04a_ground_truth\ground_truth_0.jpg
Saved: 04a_ground_truth\ground_truth_1.jpg
Saved: 04a_ground_truth\ground_truth_2.jpg
Saved: 04a_ground_truth\ground_truth_3.jpg
Saved: 04a_ground_truth\ground_truth_4.jpg


In [14]:
# Load Pretrained SSD MobileNet (Inference Only)
# NOTE:
# The warning about pkg_resources being deprecated comes from tensorflow_hub.
# It does NOT affect model loading or inference.
# It can safely be ignored.

import tensorflow_hub as hub

detector = hub.load("https://tfhub.dev/tensorflow/ssd_mobilenet_v2/2")

print(type(detector))


<class 'tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject'>


In [12]:
# Run Inference

def run_inference(image):
    # Resize but KEEP uint8 dtype
    image_resized = tf.image.resize(image, (320, 320))

    # tf.image.resize converts to float32 → convert back to uint8
    image_resized = tf.cast(image_resized, tf.uint8)

    image_resized = tf.expand_dims(image_resized, axis=0)

    results = detector(image_resized)
    return results


In [13]:
# Saving SSD Predictions

PREDICTION_DIR = "04a_ssd_predictions"
os.makedirs(PREDICTION_DIR, exist_ok=True)

print("Saving predictions to:")
print(os.path.abspath(PREDICTION_DIR))

def save_predictions(image, results, index, threshold=0.5):
    image_float = tf.cast(image, tf.float32) / 255.0
    image_batch = tf.expand_dims(image_float, axis=0)

    boxes = results["detection_boxes"][0]
    scores = results["detection_scores"][0]

    # Filter by confidence threshold
    mask = scores > threshold
    boxes = tf.boolean_mask(boxes, mask)

    if tf.shape(boxes)[0] == 0:
        print(f"No detections above threshold for image {index}")
        return

    boxes_batch = tf.expand_dims(boxes, axis=0)

    boxed_image = tf.image.draw_bounding_boxes(
        image_batch,
        boxes_batch,
        colors=[[0.0, 1.0, 0.0]]  # green
    )

    boxed_image = tf.cast(boxed_image[0] * 255.0, tf.uint8)

    encoded = tf.io.encode_jpeg(boxed_image)

    filename = os.path.join(PREDICTION_DIR, f"prediction_{index}.jpg")
    tf.io.write_file(filename, encoded)

    print(f"Saved: {filename}")

for i, sample in enumerate(dataset.take(5)):
    image = sample["image"]
    results = run_inference(image)

    save_predictions(image, results, i, threshold=0.5)


Saving predictions to:
C:\Users\Jason Eckert\Documents\cv\04_detect_segment\04a_ssd_predictions
Saved: 04a_ssd_predictions\prediction_0.jpg
Saved: 04a_ssd_predictions\prediction_1.jpg
Saved: 04a_ssd_predictions\prediction_2.jpg
Saved: 04a_ssd_predictions\prediction_3.jpg
Saved: 04a_ssd_predictions\prediction_4.jpg


## Discussion:
- Anchor boxes
- NMS (non-max suppression)
- Why detection is computationally heavier than classification

# Key Detection Concepts
- Confidence Score = Probability the predicted box contains an object.
  - Lower threshold → more detections
  - Higher threshold → fewer but more confident detections
- Non-Max Suppression (NMS) removes overlapping boxes and keeps the best one.
- Without NMS: Many boxes would overlap the same object.

# Mini Exercise
- Change the threshold to 0.3 and re-run predictions.
- Increase threshold to 0.8. What happens?
- Try .take(10) instead of .take(5).
- Compare ground truth vs predictions. Where does SSD struggle?