## YOLO Object Detection on MP4 Videos

This python code will do object detection on mp4 videos using the YOLO object detection vision model. The model will try to predict and put bounding boxes on each frame of the video. The output is a new mp4 video with bouding boxes embedded in the video.

This implimentation will take the input video resolution and 'slice' it into smaller image squares (e.g., 640x640 pixels) to do the predictions. 

User inputs include the video path, the output path, the model to use, the classes to detect, and the confidence threshold.

The model used is the YOLOv8 model that has been fine tuned on the [WALDO dataset](https://huggingface.co/StephanST/WALDO30). The dataset itself is not public, but the weights of this fine tuned model are available on Hugging Face. WALDO has been trained to identify 12 different objects. 0 = LightVehicle, 1 = Person, 2 = Building, 3 = UPole, 4 = Boat, 5 = Bike, 6 = Container, 7 = Truck, 8 = Gastank, 10 = Digger, 11 = SolarPanels, 12 = Bus. 

The WALDO fine tuned model is available on Hugging Face [here](https://huggingface.co/StephanST/WALDO30/resolve/main/WALDO30_yolov8m_640x640.pt?download=true).

After running prediction on your videos, you can choose to fine-tune the model on your own dataset to improve results. The code for that is also available in this notebook. 

In [1]:
import cv2
import sys
from sahi.auto_model import AutoDetectionModel
from sahi.predict import get_sliced_prediction
import supervision as sv
import numpy as np
import os

In [None]:
##### User defined parameters #########
input_video_path = '/home/jgillan/Documents/yolo_drone/2_italians.mp4'
output_video_path = '/home/jgillan/Documents/yolo_drone/2_italians_predict10.mp4'
#model_path = '/home/jgillan/Documents/yolo_drone/WALDO30_yolov8m_640x640.pt'
model_path = '/home/jgillan/Documents/yolo_drone/runs/detect/train2/weights/best.pt'


TARGET_CLASSES = [0, 1] #eg, for vehicle & person
confidence_threshold = 0.5

slice_height = int(640)
slice_width = int(640)
overlap_height_ratio = float(0.1)
overlap_width_ratio = float(0.1)

In [None]:
###Runs the prediction and outputs a new mp4 video 

# Initialize the YOLOv8 model
detection_model = AutoDetectionModel.from_pretrained(
    model_type='yolov8',
    model_path=model_path,
    confidence_threshold=confidence_threshold,
    device='cuda'  # or 'cpu'
)


# Open input video
cap = cv2.VideoCapture(input_video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")


# Set up output video writer
out = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))


# Create bounding box and label annotators
#box_annotator = sv.BoundingBoxAnnotator(thickness=1)
box_annotator = sv.BoxCornerAnnotator(thickness=2)
label_annotator = sv.LabelAnnotator(text_scale=0.5, text_thickness=2)



# Process each frame
frame_count = 0
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Perform sliced inference on the current frame using SAHI
    
    result = get_sliced_prediction(
        image=frame,
        detection_model=detection_model,
        slice_height=slice_height,
        slice_width=slice_width,
        overlap_height_ratio=overlap_height_ratio,
        overlap_width_ratio=overlap_width_ratio
    )

    # Extract data from SAHI result
    object_predictions = [
        pred for pred in result.object_prediction_list if pred.category.id in TARGET_CLASSES
    ]    

    # Initialize lists to hold the data
    xyxy = []
    confidences = []
    class_ids = []
    class_names = []

    # Loop over the object predictions and extract data
    for pred in object_predictions:
        bbox = pred.bbox.to_xyxy()  # Convert bbox to [x1, y1, x2, y2]
        xyxy.append(bbox)
        confidences.append(pred.score.value)
        class_ids.append(pred.category.id)
        class_names.append(pred.category.name)

    # Check if there are any detections
    if xyxy:
        # Convert lists to numpy arrays
        xyxy = np.array(xyxy, dtype=np.float32)
        confidences = np.array(confidences, dtype=np.float32)
        class_ids = np.array(class_ids, dtype=int)

        # Create sv.Detections object
        detections = sv.Detections(
            xyxy=xyxy,
            confidence=confidences,
            class_id=class_ids
        )

        # Prepare labels for label annotator
        labels = [
            f"{class_name} {confidence:.2f}"
            for class_name, confidence in zip(class_names, confidences)
        ]

        # Annotate frame with detection results
        annotated_frame = frame.copy()
        annotated_frame = box_annotator.annotate(scene=annotated_frame, detections=detections)
        annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
    else:
        # If no detections, use the original frame
        annotated_frame = frame.copy()

    # Write the annotated frame to the output video
    out.write(annotated_frame)

    frame_count += 1
    print(f"Processed frame {frame_count}", end='\r')

# Release resources
cap.release()
out.release()
print("\nInference complete. Video saved at", output_video_path)

## TRAIN    

The following code will train the YOLOv8 model on your own labeled dataset. The dataset should be in the YOLOv8 format, which includes a .yaml file with the class names and a folder with images and labels. I recommend using Roboflow to label your images and export them in the YOLOv8 format. The labels (classes and order) should be exactly the same as the WALDO dataset.

['LightVehicle', 'Person', 'Building', 'UPole', 'Boat', 'Bike', 'Container', 'Truck', 'Gastank', 'Digger', 'SolarPanels', 'Bus']

In [None]:
import os
import shutil
import json
import random
from pathlib import Path
import cv2

from sahi.slicing import slice_coco
from ultralytics import YOLO

In [None]:
#### User Inputs #######
original_images_dir = Path("/home/jgillan/Documents/yolo_drone/drone_detect.v2i.yolov8/train/images")
original_labels_dir = Path("/home/jgillan/Documents/yolo_drone/drone_detect.v2i.yolov8/train/labels")

working_dir = Path("/home/jgillan/Documents/yolo_drone/drone_detect.v2i.yolov8/sliced")  
coco_json_path = working_dir / "coco.json"
slice_output_name = working_dir / "coco_sliced"

sliced_coco_json_path = f"{slice_output_name}_coco.json"
sliced_images_dir = working_dir / "images"
sliced_labels_dir = working_dir / "labels"

class_names = ['LightVehicle', 'Person', 'Building', 'UPole', 'Boat', 'Bike', 'Container', 'Truck', 'Gastank', 'Digger', 'SolarPanels', 'Bus']  # Edit for your classes
imgsz = 640
overlap = 0.1


In [None]:
##### Slice the Video training data into Smaller tiles #######
#It will write new training labels/images to the working_dir

# ==== Clean output dirs ====
if working_dir.exists():
    shutil.rmtree(working_dir)
sliced_images_dir.mkdir(parents=True)
sliced_labels_dir.mkdir(parents=True)

# ==== CONVERT YOLO → COCO ====
def convert_yolo_to_coco(images_dir, labels_dir, class_name_list, output_json_path):
    images = []
    annotations = []
    ann_id = 1
    img_id = 1

    for image_file in sorted(Path(images_dir).glob("*")):
        if image_file.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
            continue
        img = cv2.imread(str(image_file))
        if img is None:
            continue
        height, width = img.shape[:2]
        images.append({
            "id": img_id,
            "file_name": image_file.name,
            "width": width,
            "height": height
        })
        label_file = Path(labels_dir) / (image_file.stem + ".txt")
        if label_file.exists():
            with open(label_file) as f:
                for line in f:
                    parts = line.strip().split()
                    if len(parts) != 5:
                        continue
                    class_id, x_center, y_center, w_rel, h_rel = map(float, parts)
                    x = (x_center - w_rel / 2) * width
                    y = (y_center - h_rel / 2) * height
                    w = w_rel * width
                    h = h_rel * height
                    annotations.append({
                        "id": ann_id,
                        "image_id": img_id,
                        "category_id": int(class_id),
                        "bbox": [x, y, w, h],
                        "area": w * h,
                        "iscrowd": 0
                    })
                    ann_id += 1
        img_id += 1

    categories = [{"id": i, "name": name} for i, name in enumerate(class_name_list)]
    coco_dict = {"images": images, "annotations": annotations, "categories": categories}
    with open(output_json_path, "w") as f:
        json.dump(coco_dict, f, indent=2)

convert_yolo_to_coco(original_images_dir, original_labels_dir, class_names, coco_json_path)

# ==== SLICE COCO DATASET ====
slice_coco(
    coco_annotation_file_path=str(coco_json_path),
    image_dir=str(original_images_dir),
    output_coco_annotation_file_name=str(slice_output_name),
    ignore_negative_samples=True,
    output_dir=str(sliced_images_dir),
    slice_height=imgsz,
    slice_width=imgsz,
    overlap_height_ratio=overlap,
    overlap_width_ratio=overlap,
    verbose=True,
)

# ==== CONVERT SLICED COCO → YOLO ====
def convert_coco_to_yolo(coco_json_path, output_label_dir, class_names):
    with open(coco_json_path) as f:
        coco = json.load(f)

    output_label_dir = Path(output_label_dir)
    output_label_dir.mkdir(parents=True, exist_ok=True)

    # Build lookup dicts
    image_lookup = {img["id"]: img for img in coco["images"]}
    category_lookup = {cat["id"]: cat["name"] for cat in coco["categories"]}

    # Collect annotations per image
    annotations_per_image = {}
    for ann in coco["annotations"]:
        image_id = ann["image_id"]
        if image_id not in annotations_per_image:
            annotations_per_image[image_id] = []
        annotations_per_image[image_id].append(ann)

    for image_id, image_info in image_lookup.items():
        file_name = Path(image_info["file_name"])
        width = image_info["width"]
        height = image_info["height"]

        yolo_lines = []

        anns = annotations_per_image.get(image_id, [])
        for ann in anns:
            cat_id = ann["category_id"]
            x, y, w, h = ann["bbox"]
            # Convert to YOLO format
            x_center = (x + w / 2) / width
            y_center = (y + h / 2) / height
            w_norm = w / width
            h_norm = h / height
            class_id = class_names.index(category_lookup[cat_id])
            yolo_lines.append(f"{class_id} {x_center:.6f} {y_center:.6f} {w_norm:.6f} {h_norm:.6f}")

        if yolo_lines:
            label_path = output_label_dir / (file_name.stem + ".txt")
            with open(label_path, "w") as f:
                f.write("\n".join(yolo_lines))

# Use the function
convert_coco_to_yolo(
    coco_json_path=sliced_coco_json_path,
    output_label_dir=sliced_labels_dir,
    class_names=class_names
)



# ==== WRITE data.yaml ====
data_yaml_path = working_dir / "data.yaml"
with open(data_yaml_path, "w") as f:
    f.write(f"train: ../images\n")
    f.write(f"val: ../images\n")
    f.write(f"nc: {len(class_names)}\n")
    f.write(f"names: {class_names}\n")

In [None]:
##See what classes are in the Waldo fine-tuned model

model_waldo = YOLO("/home/jgillan/Documents/yolo_drone/WALDO30_yolov8m_640x640.pt")
print(list(model_waldo.names.values()))

In [None]:
### Training Run!!!! ########

model = YOLO("/home/jgillan/Documents/yolo_drone/WALDO30_yolov8m_640x640.pt") # Path to the Waldo pre-trained model
results = model.train(data=str(data_yaml_path), epochs=10, imgsz=imgsz, batch=8, device=[0, 1])

Once the training has completed, the fine-tuned model will be saved to '/home/jgillan/Documents/yolo_drone/runs/detect/train2/weights/best.pt'

You can plug this path into 'model_path' in the prediction code above to predict on the video. 