## Computer vision: object detection  
### Introduction

You Only Look Once (YOLO) is a family of real-time object detection models that can quickly and accurately identify multiple objects within an image or video frame. Unlike traditional approaches that scan an image in multiple passes, YOLO reframes object detection as a single regression problem meaning the model predicts both bounding boxes and class probabilities in one go.

This makes YOLO exceptionally fast and well-suited for real-time applications such as surveillance, robotics, autonomous vehicles, and augmented reality. Applications include recognising people in a crowd, tracking moving vehicles, or spotting everyday objects in a scene. In this section, we'll explore how YOLO works, how it differs from other object detection methods, and how to use it in practice.

### Counting fish
<div style="width: 100%; height: 250px; overflow: hidden; margin: 0 auto; border-radius:3px;">
  <img src="https://upload.wikimedia.org/wikipedia/commons/b/b9/Poecilia_wingei_Campoma_male_021en_20130303.jpg"
       style="width: 100%; height: 100%; object-fit: cover; object-position: center;">
</div>

I breed tropical fish as a hobby (mainly Endlers). Every three months I have an increased population with many baby fish, meaning I need to rehome the adults. At this point it is time to send them to the local pet store.

I call the owner and ask "how much space do you have to take the fish off my hands?". The usual reply is "how many have you got?". It looks like I have 100's, but counting moving objects is very difficult, so I can never give a good approximation to the store owner.

In this practical, let's show how an everyday problem can be shaped into a machine learning problem using a custom dataset. Our first step is to create our own image dataset and appropriate labels for object detection. We will then train a model to detect the same specifies of fish given different shapes, sizes, and patterns. Once the model has been trained, we will detect tropical fish (Endlers) from aquarium video footage.

Unlike image classification, which labels an entire image, object detection pinpoints *where* objects appear by drawing bounding boxes, and it can detect and label multiple objects in a single image. For this task we will use a model known as 

- *Speed and Efficiency*: YOLO is designed for real-time performance. It predicts bounding boxes and class probabilities in a single network pass.
- *Unified Architecture*: The model combines object localisation and classification into a single process.
- *Great for counting*: If your goal is to detect and count multiple objects in each image, YOLO's one-pass detection is ideal.

### The YOLO architecture

The YOLO model is one of the most popular and efficient approaches for real-time object detection. Traditional systems will scan an image multiple times or divide it into smaller regions, whereas YOLO processes the entire image in a single pass through the network. This design allows it to detect objects quickly and accurately.

At a high level, the YOLO architecture is composed of three main components: the *backbone*, the *neck*, and the *head*. Each of these stages plays a specific role in the process of detecting and classifying objects within an image.

The *backbone* is the first part of the network and is responsible for extracting visual features from the raw input image. It typically consists of a deep convolutional neural network (CNN) that learns to detect edges, textures, shapes, and patterns that are useful for identifying objects. In many YOLO implementations, including YOLOv5 and YOLOv8, this backbone is built using a variant of CSPNet (Cross Stage Partial Network), which improves efficiency and accuracy by enabling a more streamlined flow of information through the network.

> *Cross Stage Partial Network (CSPNet)*
>
> When a neural network learns from an image, it passes the data through many layers to extract useful patterns. Normally, all the data goes through every single layer, which can be slow and use a lot of memory. 
>
> CSPNet solves this by splitting the data in half partway through. One half goes through a series of complex layers (to learn detailed features), while the other half skips ahead untouched. At the end, the two halves are joined back together. This means the model works faster because it processes less data in the heavy layers. It also uses less memory and is easier to train, and it still keeps important information by combining both paths.
> 

After the backbone has extracted features from the image, these are passed to the *neck*. The neck’s purpose is to combine and refine features from different layers of the backbone, allowing the model to detect objects at multiple scales. This is important because objects can appear at various sizes depending on their distance from the camera. YOLO often uses techniques like Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN) in the neck to ensure that both fine details and broader contextual features are retained and passed on to the next stage.

> *Feature Pyramid Networks (FPN) and Path Aggregation Networks (PAN)*
>
> FPN and PAN help the model "see" at all sizes from small, medium, and large, and then blend fine details with overall structure. This makes object detection more accurate and reliable, especially in complex scenes like underwater footage.

The final component of the YOLO architecture is the *head*, which takes the processed features and produces the actual predictions. For each region of the image (represented as a grid cell in the feature map), the head predicts one or more bounding boxes, the confidence score for each box, and the probabilities for each class (e.g., fish, crab, coral). These predictions are made in parallel for many parts of the image. A final post-processing step known as Non-Maximum Suppression (NMS) is then used to filter out duplicate detections and keep only the most confident ones.

> *Non-Maximum Suppression (NMS)* 
NMS is like a smart filter. It looks at all the boxes the model predicted and says, which one is the best?
It then keeps the box with the highest confidence (the one the model is most sure about) and removes the others that are too similar or overlapping. This way, instead of showing five boxes around one fish, the model shows just one clean accurate box, which is easier to understand and avoids clutter.

Based on this, we can motivate the use of YOLO for our purposes for the following reasons:
- *Multiple objects*: a single tank has several fish. YOLO detects multiple items in one pass.
- *Real-time speed*: helps if we want to process many frames (from live video) in near real-time.
- *Classification of variations*: although this example uses one class ("Fish"), YOLO can also differentiate between multiple classes (like "Male" and "Female" fish) if needed and if we have annotated data for each class.

### The dataset

To train a YOLO object detection model, we first need to tell it what to look for, and where. This means supplying a large set of images where each object of interest (in this case, fish) is marked with a *bounding box*. A bounding-box is just that, a rectangle we draw to encapsulate te object we wish to detect. Essentially the model will 'crop' this image and just focus on its contents, so we need to make these bounding boxes efficient by only encapsulating what is necessary.

For this project, I began by capturing a short video of three different fish tanks, each with its own background, mix of plants, lighting, and sizes of fish. These variations are important, as they help the model generalise to new environments during testing.

**Training image – frame 0**:

<img src="data/train/images/VID_20250209_150552_frame0.jpg" style="margin-left:12%; width: 75%; border-radius: 3px;"></img>

As you can see, real world images are not perfect, the lighting conditions are not ideal, and the slow shutter speed means that many fish appear blurry due to sudden movements. We will need to train our model to work with these images to get good results, rather than using stock images so that it will work in a real-world context. So always base your computer vision projects on real-world data from the start.

The next step is *labelling*, where each fish in each frame is enclosed in a our bounding box, or rectangle. You will likely need to do this manually using tools such as [LabelImg](https://github.com/tzutalin/labelImg) or cloud platforms like [Roboflow](https://roboflow.com/), which provide an easy interface for drawing boxes and assigning class labels:

<div style="display: flex; flex-wrap: wrap; gap: 10px;">
  <img src="img/roboflow_project_setup.png" style="width: 48%; border-radius: 3px;">
  <img src="img/roboflow_annotation.png" style="width: 48%; border-radius: 3px;">
<small>Credit: <a href="https://app.roboflow.com/">Roboflow</a>, a tool for annotating images and preparing labelled data.
</small>
</div>

These tools also export your annotations in the correct format. In the case of Roboflow, you can specify a model type, and output the labels specifically in YOLO's plain-text format, where each line represents a bounding box using the class ID and four normalised values:
`<class_id> <x_center> <y_center> <width> <height>`:

```
0 0.195101 0.285285 0.103041 0.081081 <- Fish 1
0 0.896959 0.585586 0.158784 0.111111 <- Fish 2
0 0.542230 0.304805 0.152027 0.090090 <- etc.
0 0.648649 0.869369 0.111486 0.138138
0 0.768581 0.536036 0.094595 0.114114
...
```
Each line corresponds to bounding box coordinates for each hand-labelled fish, which are *normalised*, meaning they are scaled relative to the width and height of the image (values are between 0 and 1). This allows the model to work with images of different resolutions.

#### Organising the dataset
Once labelled, the dataset is typically arranged into separate folders for images and labels, split into *training*, *validation*, and *test* sets. A typical YOLO project might look like this:

```
data/
  train/
    images/
      frame01.jpg
      frame02.jpg
      ...
    labels/
      frame01.txt
      frame02.txt
      ...
 
  val/
    images/
      frameA.jpg
      frameB.jpg
      ...
    labels/
      frameA.txt
      frameB.txt
      ...
```

YOLO expects this structure when loading data for training. It allows the model to learn from one portion of the data (training set) while checking its accuracy on another (validation set), reducing the risk of overfitting.

Labelling a few frames manually is straightforward, but to train a good model, you’ll need hundreds, ideally thousands, of labelled examples. A good baseline is around *300 labelled fish*, spread across a variety of frames and conditions. This provides enough variety for the model to learn meaningful patterns while still being manageable with common labelling tools. If needed, you can augment your dataset with flipped, rotated, or colour-adjusted images to further boost generalisation.


### Loading the dataset
YOLO expects a `.yaml` file that tells it where to find the training, validation, and (optionally) test images and labels, as well as how many classes there are, and the names of those classes. This is our `data.yaml` file with the paths to each data folder with `train`, `val`, and `test` representing the folders containing our images. We also have the number of classes (`nc`), and the list of labels `Fish` for our target object (we only have one class):

```yaml
train: ./train/images/
val: ./valid/images/
test: ./test/images/

nc: 1
names: ['Fish']
```
In our code we define each of the paths pointing to images and our labelled data:

### Install Python libraries

In [None]:
!python -m pip install ultralytics lap

In [None]:
# Define paths for dataset
output_base = "data/"
split_paths = {
    "train": {
        "images": os.path.join(output_base, "train/images"), 
        "labels": os.path.join(output_base, "train/labels")
    },
    "valid": {
        "images": os.path.join(output_base, "valid/images"), 
        "labels": os.path.join(output_base, "valid/labels")
    },
    "test": {
        "images": os.path.join(output_base, "test/images"), 
        "labels": os.path.join(output_base, "test/labels")
    }
}

### The model

We'll train our object detection model using the *Ultralytics YOLOv8* library in Python. Before training, we'll need to give the model a few parameters so it knows how to handle our data and where to save its progress:

- `data`: This is the path to our `.yaml` file that tells YOLO where our images and labels are stored, and what classes it should learn to detect, in our case `fish`.

- `epochs`: This sets how many full passes the model should make through our training data. More epochs usually lead to better learning, though after a while, the improvements may level off. We have set this to 50 for reasonable results, but you can tweak this depending on your machine and time.

- `imgsz`: This is the size (in pixels) that all images are resized to before training. YOLO models expect square inputs, so for instance, `imgsz=640` means all images become 640 X 640.

- `project` and `name`: YOLO will create a folder named after our `project` and save all results (like model weights, logs, and training graphs) inside a subfolder with the specified `name` (you can have a look in this folder when training finishes).

- `batch`: This sets how many images are processed at once. Larger batch sizes can speed up training but require more memory, so you may need to experiment to find what our machine can handle comfortably.

Another thing to note is that YOLO comes in a number of different sizes. Therefore, you can change the model you will be starting from by tweaking the filename of the model (see code below):

| *Model filename* | *Size* | *Description* |
|----------------|----------------|------------------|
| `yolov8n.pt`   | Nano         | The smallest and fastest model. Good for very limited hardware (e.g. Raspberry Pi), but less accurate. |
| `yolov8s.pt`   | Small        | Still lightweight, but with better accuracy. Suitable for basic tasks or real-time video. |
| `yolov8m.pt`   | Medium       | A good middle-ground. More accurate than small, but needs more computing power. |
| `yolov8l.pt`   | Large        | Higher accuracy, slower speed. Suitable for desktop GPUs and demanding tasks. |
| `yolov8x.pt`   | Extra large  | The most accurate, but also the slowest and most resource-hungry. For high-end systems. |

We will use the medium size model `yolov8m.pt` for this practical. YOLO will download this file when we perform training. You can choose a different model based on your hardware.  Now that we are set up, let's run the training (this may take an hour or two):

In [None]:
import os
import shutil

from ultralytics import YOLO

project_name = 'FishDetector/'
train_run_name = 'fish_model'

# Check if an older folder for a previous training run exists, and delete the contents
if os.path.exists(project_name):
    print("Deleting previous training run:", project_name)
    shutil.rmtree(project_name)


# Create the YOLO model choose a suitable variant like 'yolov8n.pt', 'yolov8s.pt', 'yolov8m.pt', etc.
model = YOLO('yolov8m.pt')

# Train the model
num_epochs = 50
results = model.train(
    data='data.yaml',  # path to our  configuration file (for data)
    epochs=num_epochs,        # number of training epochs
    imgsz=640,        # image size
    project=project_name , # Create a folder for training output
    name=train_run_name,      # Unique name for this training run.
    batch=8           # you can adjust batch size based on your GPU memory
)


As the model trains, you'll see regular updates with numbers that show how it's doing. These include precision (how many detections are correct), recall (how many true objects it found), mAP (a summary of overall performance), and loss (how far off the model's guesses are). These help you tell whether the model is actually learning to spot the objects we want it to detect.


### Evaluating the model
YOLO performs validation during training (using the validation set specified in your YAML). After training completes, you can also run a test set evaluation (check the file exists first), using the following command line statement with the file path pointing to the best weights achieved (this is in your project folder created by YOLO `/FishDetector/fish_model/weights/`):

In [None]:
!yolo task=detect mode=val model=FishDetector/fish_model/weights/best.pt data=data.yaml

This command evaluates your best-trained weights on the dataset's validation or test split, producing final metrics. The resulting plots are also saved inside your project folder (e.g. `./FishDetector/fish_model/`) when it completes:

It also shows the output of training validation:
<img src="./runs/detect/val/val_batch0_labels.jpg"></img>


### Prediction
Once our model has now been successfully trained to recognise fish, the next step is to put it to use by running predictions on new images, known as inference. The model analyses a new image and returns a set of detections, which are the objects it believes it has found in that image.

Each detection includes a few important pieces of information. First, there is a bounding box composed of a rectangle drawn around the object to indicate its location in the image, which is produced by the model. Alongside this, the model provides a confidence score which tells us how sure it is that the object is indeed what it thinks it is. Finally, the model outputs our class label to describe the type of object detected.

To achieve this in practice, we load the image and pass it to the model using `results = model(img_path)`, where `img_path` points to a specific image in our collection. This runs the model on the image and gives us a result object. Even though we're only using one image here, the result is returned in a list, so we access the first item with `results[0]`.

Inside this result object is a collection of all the bounding boxes for each detected fish. We can count how many there are by simply checking the length of this list with `len(detections)`. Each item in this list contains the prediction details for one detected object:

In [None]:
# Test image path – this is the image we will test for fish detection
img_path = './data/test/images/VID_20250209_150552_frame49.jpg'

# Run the model on the image; this performs inference and returns predictions
results = model(img_path)  # The model detects objects and returns a list of results

# Take the first result (since we're only passing one image)
result = results[0]

# Extract the list of detected bounding boxes from the result
detections = result.boxes

# Count how many objects (e.g. fish) have been detected in the image
count_fish = len(detections)
print(f"Number of fish detected: {count_fish}")

# Loop through each detection to print out details
for det in detections:
    # Get the predicted class ID (e.g. 0 for fish) and convert it from tensor to integer
    cls_id = int(det.cls[0].item())

    # Get the confidence score (between 0 and 1), also converting it from tensor to float
    conf = det.conf[0].item()

    # Look up the readable class name using the class ID
    label = model.names[cls_id]

    # Print a summary of what was detected and how confident the model is
    print(f"Detected {label} with confidence {conf:.2f}")


### Visualisation
To turn these predictions into something more human-friendly, we can overlay the bounding boxes directly onto the original image. This involves drawing the bounding-boxes returned around each detected fish, along with the class label and confidence score, for instance, the resulting label might say `Fish 0.92` to indicate a high-confidence prediction.

The `result` object returned by the model also has a built-in `.plot()` method we can use, so we can simply plot this to see how the model performed. We loop through each detection and extract the predicted class ID (a number such as 0 or 1), the confidence score (usually a number between 0 and 1), and then we convert the class ID into a readable label like `Fish` using the model's internal dictionary of class names:

In [None]:
import cv2
import matplotlib.pyplot as plt

%matplotlib inline

# Our test image
img_path = './data/test/images/VID_20250209_150552_frame49.jpg'

# Run prediction on a single image
results = model(img_path)

# Take the first result
result = results[0]

# Use the built-in .plot() method to draw bounding boxes
img_with_boxes = result.plot()

# Display the result using matplotlib (convert from BGR to RGB)
plt.figure(figsize=(15, 6))

plt.imshow(cv2.cvtColor(img_with_boxes, cv2.COLOR_BGR2RGB))

plt.axis('off')
plt.title("Model predictions")

plt.show()

This kind of visual feedback is particularly important when working with image-based models, because it gives us an immediate sense of whether the model is working correctly. For example, if the bounding boxes are clearly around fish in the image, that's a good sign. But if boxes are placed in empty areas, or the model consistently misses fish, then we may need to revisit our training data or model settings.

Aside from images, a useful application is to be able to provide a live stream to the model, and show the bounding boxes in real-time (we'll use a pre-recorded video to simulate this):

In [None]:
import cv2

# Set input/output video paths
input_path = './data/video/fishtestvideo.mp4'              # Path to input video file
output_path = './data/video/annotated_fish_video.mkv'      # Path where output video will be saved

# Open the input video using OpenCV
cap = cv2.VideoCapture(input_path)

# Get video metadata: frames per second and frame dimensions
fps = cap.get(cv2.CAP_PROP_FPS)                            # Frame rate of input video
width, height = int(cap.get(3)), int(cap.get(4))           # Frame width and height

# Set up output video writer using a codec compatible with MKV
fourcc = cv2.VideoWriter_fourcc(*'XVID')                   # 'XVID' codec works well with .mkv containers
out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))  # Create a video writer object

# Define the object classes we want to track (e.g. only 'fish')
target_classes = ['fish']  

frame_count = 0  # Initialise frame counter

# Read and process video frame by frame
while cap.isOpened():
    ret, frame = cap.read()  # Read a single frame
    if not ret:
        break                # Stop if no frame is returned (end of video)

    # Run YOLO model inference with tracking on the current frame
    results = model.track(frame, persist=True)  # 'persist=True' helps maintain tracking IDs over time

    # Extract prediction results for this frame
    result = results[0]  # We use the first result (only one frame at a time)

    # Make a copy of the frame to draw annotations on
    annotated_frame = frame.copy()

    # Loop over all detected bounding boxes in the frame
    for box in result.boxes:
        cls_id = int(box.cls[0])                # Get class ID (as integer)
        label = model.names[cls_id]             # Convert class ID to our human-readable label

        # Skip objects that are not in our target_classes list
        if label.lower() not in target_classes:
            continue

        conf = float(box.conf[0])               # Extract the confidence score for the detection
        track_id = int(box.id[0]) if box.id is not None else -1  # Get unique tracking ID (if available)

        # Get bounding box coordinates and convert to integers
        x1, y1, x2, y2 = map(int, box.xyxy[0])  # top-left and bottom-right corner coordinates

        # Draw a green rectangle around the detected object
        cv2.rectangle(annotated_frame, (x1, y1), (x2, y2), (0, 255, 0), 2)

        # Construct label text showing the class, confidence, and tracking ID
        label_text = f"{label} {conf:.2f}  ID:{track_id}"

        # Draw the label text just above the bounding box
        cv2.putText(annotated_frame, label_text, (x1, y1 - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

    # Write the annotated frame to the output video file
    out.write(annotated_frame)

    # Update frame counter and print progress
    frame_count += 1
    print(f'Processed frame {frame_count}', end='\r')  # Inline progress indicator

# Clean up release input and output video handles
cap.release()
out.release()
cv2.destroyAllWindows()

print(f"\nVideo saved to: {output_path}")


### What have we learnt?
We provided an overview of the YOLO model architecture. We saw how a single-pass architecture like YOLO compresses the tasks of localisation and classification into one network pass, speeding up detection.
With grid-based regression we gained several advantages. Instead of scanning multiple regions, YOLO divides the image into a grid and makes predictions directly. This gives real-time performance, which makes YOLO very efficient for real-time object detection.

In the case of counting fish, this model is just a first step in developing a more interesting application, for example rapid population counting, where we tally fish in each tank (take an average over a window of time). We can monitor many different tanks to keep track of population changes.
And a more extended analysis with more classes (male, female, or different fish types) could be employed, as well as, fish health tracking through additional images of diseased fish (not that I have any).

YOLOv8 provides an updated interface and strong default performance with models of varying size suitable for a range of hardware so is a good place to start. You may consider experimenting with some of the hyperparameters (image size, batch size, epochs) to find the best balance of speed and accuracy, and you might also consider your own object detection task, and use some of the tools we have mentioned to create your own custom labelled dataset.