<img src="https://drive.google.com/uc?export=view&id=1x-QAgitB-S5rxGGDqxsJ299ZQTfYtOhb" width=180, align="center"/>

Master's degree in Intelligent Systems

Subject: 11754 - Deep Learning

Year: 2023-2024

Professor: Miguel Ángel Calafat Torrens

This notebook has been correctly executed with the versions below on Google Colab:

- python=3.10.12
- ultralytics=8.0.201
- ffmpeg=1.4

Feel free to create a virtual environment with these versions to make your own executions locally.

# FINAL LAB

**Object Detection** is a computer vision task that involves identifying and locating objects of interest within an image or video frame. Unlike image classification, where the task is to assign a label to an entire image, object detection aims to identify multiple objects and their locations within the same image.

**How Object Detection Works:**

**Bounding Boxes:** Objects are typically represented by bounding boxes. Each bounding box is defined by coordinates (usually the top-left and bottom-right corners) that encapsulate the object.

**Class Labels:** Along with the bounding box, each detected object is associated with a class label (e.g., "person", "car", "bicycle").

There are various neural network architectures designed specifically for object detection. Some popular ones include:

 * R-CNN and its variants (Fast R-CNN, Faster R-CNN): These methods use region proposal networks to suggest potential object locations and then classify each region.

 * YOLO (You Only Look Once): This approach divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell in a single pass.

 * SSD (Single Shot MultiBox Detector): Like YOLO, SSD detects objects in a single forward pass of the network but uses multiple feature maps at different scales to detect objects of various sizes.

In this final lab we will be using YOLO as an inference pipeline.

YOLO is a reference in computer vision, so you probably heard something about it.

Let's start with a bit of history.

## Introduction

Below is a brief summary of YOLO history. Perhaps you are interested in going deeper into some of the aspects mentioned below. Please, see [this link](https://deci.ai/blog/history-yolo-object-detection-models-from-yolov1-yolov8/) for more information or read [this paper](https://arxiv.org/pdf/2304.00501.pdf).

### YOLOv1 ([paper](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf))

Before the advent of YOLO, traditional object detection methods relied on a two-step process: first, they would generate potential bounding boxes in the image, and then they would classify each box. This approach was not only computationally intensive but also slower in real-time scenarios. The need to process multiple regions in an image and then classify each one made these methods less efficient and less suited for real-time applications.

YOLOv1 revolutionized object detection in June 2016 by introducing a single-stage process. Instead of treating the detection problem as a two-step process (region proposal followed by classification), YOLO proposed a unified model that looked at the image only once and predicted bounding boxes and class probabilities in one forward pass. This approach was not only faster but also more efficient, making real-time object detection feasible. The original YOLO model emphasized high-speed object detection, making it stand out from its predecessors.

<img src="https://drive.google.com/uc?export=view&id=1Z718kU6VqjFlPImbkiQoM3Z1GxGEd9_r" width=500, align="center"/>

Source: https://pjreddie.com/darknet/yolo/

### YOLOv2
YOLOv2 (July 2017) incorporated batch normalization and a high-resolution classifier was added, resulting in better performance for higher-resolution inputs.

Furthermore, in incorporated anchor boxes. One of the main drawbacks of YOLOv1 was its poor performance in localizing boxes since bounding boxes were learned entirely from data. YOLOv2 introduced anchor boxes (priors) to aid localization. This change allowed a cell to predict different objects with different boxes, enhancing recall.

### YOLOv3
YOLOv3 (April 2018) enhanced accuracy and efficiency by introducing multi-scale detection, a new architecture, and changes in prediction strategy and loss function.

* Multi-scale detection: Detects objects of various sizes using three different sizes of anchor boxes.
* Darknet-53 Architecture: It incorporates a new hybrid of Darknet-19 and ResNet architecture.
* Class Prediction: It uses logistic classifiers instead of Softmax for better multi-label classification.
* Loss Function: It modified the loss function to balance objectness score and classification loss.

### YOLOv4
YOLOv4 (April 2020) builds upon the foundation of YOLOv3 by introducing several advancements in object recognition, making it faster and more accurate.

Among other changes, it also incorporates new strategies to improve its performance without significantly increasing computational cost, such as Bag of Freebies (BoF) and Bag of Specials (BoS). BoF includes methods that don't add extra inference cost but improve training, while BoS includes methods that might add a little inference cost but provide a significant boost in performance.

### YOLOv5
YOLOv5 (June 2020) was released just a couple of months before it’s predecessor and it was some [controversy about the name](https://news.ycombinator.com/item?id=23480884).

This YOLO version was released by a company called Ultralytics and it was available through a [GitHub repository](https://github.com/ultralytics/yolov5). This approach made it more accessible to the developer community, allowing for immediate implementation and feedback.
One key point is that it was developed using [PyTorch](https://pytorch.org/) instead of [DarkNet](https://pjreddie.com/darknet/).

### YOLOv6

YOLOv6 (September 2022) demonstrated stronger performance than its predecessor, YOLOv5, especially when benchmarked against the MS COCO dataset.

It introduced significant changes to its architecture:

* EfficientRep Backbone and Rep-PAN Neck: YOLOv6 redesigned the YOLO backbone and neck with hardware efficiency in mind. The model introduced the EfficientRep Backbone and a Rep-PAN Neck.
* Decoupled Head: Unlike previous YOLO models where classification and box regression heads shared the same features, YOLOv6 decoupled the head. This separation of features led to an empirical increase in model performance.
* Training Enhancements: The YOLOv6 repository incorporated improvements to the training pipeline, including anchor-free training, SimOTA tag assignment, and SIoU box regression loss.


### YOLOv7

YOLOv7 (July 2022) introduced several key enhancements over its predecessors:

* Extended Efficient Layer Aggregation (E-ELAN): Optimized the efficiency of convolutional layers for faster inference.
* Model Scaling Techniques: Scaled the network depth and width together for optimal model architectures across different sizes.
* Re-parameterization Planning: Used gradient flow paths to determine which network modules should employ re-parameterization strategies.
* Auxiliary Head Coarse-to-Fine Approach: Added a mid-network auxiliary head with varying supervision levels to improve training efficiency.
These improvements aimed to boost YOLOv7's object detection accuracy while maintaining competitive inference speeds.

### YOLOv8

[Ultralytics](https://www.ultralytics.com/), the company behind YOLOv5, released YOLOv8 in January 2023. As of now, there isn't a detailed paper discussing the specifics of YOLOv8, but it's evident that with each iteration, the YOLO series has aimed to strike a balance between accuracy and speed, making it one of the most preferred object detection inference pipeline in the industry.

Furthermore, this inference pipeline is not only designed for object detection, but it also incorporates image classification, instance segmentation, pose estimation and object tracking.

YOLOv8's Unique Features:
* Anchor-Free Detection: Unlike previous versions that relied on anchor boxes, YOLOv8 is an anchor-free model. This means it predicts the center of an object directly, eliminating the complexities associated with anchor boxes.
* New Convolutions: The model introduced changes in its convolutional layers, including replacing the stem's first 6x6 convolution with a 3x3 and introducing a new module called C2f, which offers a more streamlined approach than the previous C3 module.
* Mosaic Augmentation: This augmentation technique stitches four images together, enhancing the model's ability to recognize objects in varying locations and conditions. However, it's turned off for the last ten training epochs to optimize performance.

Developer Experience:
* YOLOv8 introduces a user-friendly interface via a PIP package, making it easier for developers to integrate and use the model.
* The [YOLOv8 code repository](https://github.com/ultralytics/ultralytics) is designed for community use and iteration, with the expectation of continuous improvements and updates.

## Recomendations

This practice can be done just using the information from Ultralytics that you'll find [in its docs](https://docs.ultralytics.com/) and [in its repo](https://github.com/ultralytics/ultralytics). I highly recommend that you check it out and spend some time getting familiar with it.

## Set up

So now that you know what is this all about, let's start.

In [None]:
# Select your path as in the practices and execute it
MY_GDRIVE_PATH = '/content/gdrive/MyDrive/Colab Notebooks/2023-2024-Lab.DL/FinalProject'

# Connect to your drive
from google.colab import drive
drive.mount('/content/gdrive')
%cd {MY_GDRIVE_PATH}
%ls -l

# Here the path of the project folder (which is where this file is) is inserted
# into the python path.
from pathlib import Path
import sys

PROJECT_DIR = str(Path().resolve())
sys.path.append(PROJECT_DIR)

In [None]:
# Install YOLOv8
%pip install ultralytics==8.1

# Also install this library to record, convert and stream video
%pip install ffmpeg==1.4

from ultralytics import YOLO, checks
checks()

In [None]:
# Feel free to use more libraries
import os
import random
import numpy as np
import matplotlib.pyplot as plt
import project_helper as ph
from importlib import reload

In [None]:
# Use it if you modify project_helper.py
# reload(ph)

## Datasets and annotation type in Object Detection

In YOLO object detection, the annotation format typically used for training is called the **Darknet annotation format**.

The Darknet annotation format consists of **a text file for each image** in your training dataset. Each text file contains lines that describe the objects (bounding boxes) present in the corresponding image. Each line in the annotation file follows a specific format as shown in following figure.

<img src="https://drive.google.com/uc?export=view&id=1C4skRQnV2b9CA0ATj5KjNY-0jwCNnIE9" width=700, align="center"/>


`<object-class>`: This is an integer representing the class label or ID of the object detected in the image. Class IDs typically start from 0 and increase for each unique object class.

`<x>`: The x-coordinate of the center of the bounding box, normalized to the width of the image. This means that 0 represents the left edge of the image, and 1 represents the right edge.

`<y>`: The y-coordinate of the center of the bounding box, normalized to the height of the image. 0 represents the top edge of the image, and 1 represents the bottom edge.

`<width>`: The width of the bounding box, also normalized to the width of the image.

`<height>`: The height of the bounding box, normalized to the height of the image.

Here's an example of what an entry in a Darknet annotation file might look like (note that numbers are separated by single spaces):

```
0 0.31 0.22 0.08 0.77
```

In [None]:
# Create a folder in which we'll leave the images
# used for inference

# Create inference folder
!mkdir inference

# Create inference/photos folder
!mkdir inference/photos

# Go to inference folder
%cd inference/photos

# Download an example photo in this folder
!gdown "1uzWLdk_13EdyOaVxlvIoo3AMSoLXbHJg" -O example01.jpg

# And go back to the project folder
%cd ../..

In [None]:
# Convert the string into a Path object
path_object = os.path.join(PROJECT_DIR, "inference/photos/example01.jpg")

# Display the original photo
ph.show(path_object)

In [None]:
# Load a pretrained YOLOv8n detection model
model = YOLO('yolov8n.pt')

In [None]:
# Predict with the model on the image downloaded
results = model(source=path_object,
                show=False,
                conf=0.3,
                save=True)

In [None]:
# As informed in the output cell above, the results
# have been saved
print(f'Results saved to {results[0].save_dir}')

In [None]:
# Display the results
ph.show(results)

Explore the 'results' object. You'll find that it has quite a lot of information in it, as you can see in [the documentation](https://docs.ultralytics.com/reference/engine/results/#ultralytics.engine.results.Boxes)

In [None]:
print(results[0].boxes)

Once explored, answer the following questions using code (and **extracting the info from 'results'**):

In [None]:
# How many bounding boxes have your model found?
# Of course you know the solution is 5, but I don't want you to just type 5,
# you have to extract the number of boxes from 'results'
val = len(results[0].boxes)
val = results[0].boxes.shape[0]
print(f'The number of bounding boxes found are: {val}')

In [None]:
# What's the width of the third bounding box in the range [0-1] (relative to
# the width of the image). Return the result as a single value, not a tensor or
# a numpy array.

b3 = 2  # The third bounding box index
val = results[0].boxes.xywhn[b3, 2].item()
print(f'The width of the third bounding box is: {val}')

In [None]:
# What coordinates (in pixels) has the center of the second bounding box found
# in the figure. Return the result as a tuple of values, i.e. "(928.5, 1486.2)"
# an not as a tensor or a numpy array
b2 = 1  # The second bounding box index
val = tuple(np.array(results[0].boxes.xywh[b2, :2].cpu()))
print(f'The coords of the center of the second bounding box are: {val}')

In [None]:
# What is the model's confidence in the result of the first bounding box?
# Return the result as a single value, not a tensor or a numpy array.
b1 = 0  # The first bounding box index
val = results[0].boxes.conf[b1].cpu().item()
print(f'The confidence of the first bounding box is are: {val}')

In [None]:
# Ok, now extract the same data asked in the last 4 cells above
# but in this case get it from the raw bbox tensor 'data'

img_shape = results[0].boxes.orig_shape
dt = results[0].boxes.data.cpu()

# How many bounding boxes are there in the image?
val1 = dt.shape[0]
print(f'The number of bounding boxes found are: {val1}')

# What's the width of the third bounding box in the range [0-1]?
b3 = 2  # The third bounding box index
val2 = (dt[b3, 2] - dt[b3, 0]) / img_shape[1]
print(f'The width of the third bounding box is: {val2}')

# What coordinates (in pixels) has the center of the second bounding box found
# in the figure. Return the result as a tuple of values.
b2 = 1  # The second bounding box index
val3 = (((dt[b2, 2] - dt[b2, 0]) / 2 + dt[b2, 0]).item(),
        ((dt[b2, 3] - dt[b2, 1]) / 2 + dt[b2, 1]).item())
print(f'The coords of the center of the second bounding box are: {val3}')

# What is the model's confidence in the result of the first bounding box?
b1 = 0  # The first bounding box index
val4 = dt[b1, 4].item()
print(f'The confidence of the first bounding box is are: {val4}')

You've already seen a usecase of object detection with YOLO. In this case the model has detected persons, chairs and a potted plant. But, what classes can it detect by default?

To see the full list of objects detectable by YOLOv8 by default, you just have to check the dataset in which it was trained on: [the COCO dataset](https://docs.ultralytics.com/datasets/detect/coco/#dataset-yaml)

You have the coco.yaml file in the folder you have been given, so extract the conversions dict and list with the code below.

In [None]:
# Extract data from yaml file
id2class, class2id = ph.load_classes_from_yaml('coco.yaml')

In [None]:
# See the class names found in the image
[id2class[int(id)] for id in results[0].boxes.cls.cpu().tolist()]

Most of the times, when using public datasets, we already have it annotated. Lest's do some real example to adapt annotations.

First of all go into your kaggle account. Don't you have one? Don't worry. You just have to go to [Kaggle](kaggle.com) and sign in. You can do it directly with your Google account.

Once in Kaggle click on _datasets_ on the left side bar. Now in the text box search for "_face mask detection_" and download the first dataset found.

<img src="https://drive.google.com/uc?export=view&id=19QYsPd3avFuwR_heB1p7iWGZiyAxRl89" width=1000, align="center"/>

Now unzip the original dataset with the following instruction and leave it a folder called 'dataset' at the current path. Also, make sure you rename the "_annotations_" folder to "_annotations_XML_".

In [None]:
# Define the file paths
zip_file_path = 'archive.zip'
new_folder_path = 'dataset'

# Get annotations names
original_annotations_path = os.path.join(new_folder_path, 'annotations')
renamed_annotations_path = os.path.join(new_folder_path, 'annotations_XML')

# Create a new folder if it doesn't exist
if not os.path.exists(new_folder_path):
    os.makedirs(new_folder_path)

    # Unzip the file into the new folder using the !unzip magic command
    !unzip -q $zip_file_path -d $new_folder_path

    # Rename the 'annotations' folder to 'annotations_XML'
    os.rename(original_annotations_path, renamed_annotations_path)

As you can see, in your _dataset_ directory you have two folders, one for the images and another for annotations.

Let's have a look at some of those images. Note that we're trying to distinguish between three kinds of objects: A face with a mask, a face without a mask, and a face with a mask incorrectly weared.

In [None]:
# Get a list with image filenames
img_filenames = sorted(ph.list_files_in_folder(
    os.path.join(PROJECT_DIR, "dataset/images"),
    absolute=True))

In [None]:
# See some of those images
for im in (img_filenames[0], img_filenames[1], img_filenames[38]):
    print(im)
    ph.show(im)

Check for the files in the annotations folder, you will see that all of them are XML files, that's why we've renamed as "_annotation_XML_". This usually corresponds to Pascal VOC annotation format.

In [None]:
# Get a list with annotation filenames
ann_filenames = sorted(ph.list_files_in_folder(
    os.path.join(PROJECT_DIR, "dataset/annotations_XML")))

In [None]:
# See a XML file with Pascal VOC annotation
# Note that the class names are inside the tags <name></name>
# inside each <object></object>
ann_id = 0
ann_path = os.path.join(PROJECT_DIR, "dataset/annotations_XML",
                       ann_filenames[ann_id])
print(ann_path)
with open(ann_path) as f:
    contents = f.read()
    print(contents)

As said before, the annotation files are in Pascal VOC format, so we are going to use the following cells to transform them to YOLO Darknet format.

In [None]:
# Create a new folder for the YOLO Darknet format annotations
new_annot_folder = os.path.join(PROJECT_DIR, 'dataset/annotations')
if not os.path.exists(new_annot_folder):
    os.makedirs(new_annot_folder)

In [None]:
# Check what class names there are in the annotations
class_names = sorted(ph.get_classes_from_voc("dataset/annotations_XML"),
                     reverse=True)
print(class_names)

Note: In order to correctly follow the next cell, I suggest you take a look at the provided function 'voc_to_yolo'.

In [None]:
# Define the lambda function to check if the folder is empty
is_folder_empty = lambda folder_path: not os.listdir(folder_path)

# If the folder is empty, do the transformations.
if is_folder_empty(new_annot_folder):
    # Now transform the files from Pascal VOC format to YOLO Darknet format
    ph.voc_to_yolo("dataset/annotations_XML", "dataset/annotations", class_names)

In [None]:
# Get a list with annotation filenames
ann_filenames = sorted(ph.list_files_in_folder(new_annot_folder))

In the next cell you'll check the YOLO Darknet annotation files. I suggest you try to identify the class number for each class name and verify that it is correct.

In [None]:
# See a txt file with YOLO Darknet annotation
ann_id = 0  # (0, 1, 38)
with open(os.path.join(PROJECT_DIR, "dataset/annotations",
                       ann_filenames[ann_id])) as f:
    contents = f.read()
    print(contents)

## Seting up the custom dataset structure

Now we're going to train a model on this custom dataset to be able to predict the objects in an image.

Before training with YOLOv8 you need to have an specific structure of the dataset.

```
dataset/
├── train/
│   ├── images/
│   │   ├── img1_train.png
│   │   └── img2_train.png
│   └── labels/
│       ├── label1_train.txt
│       └── label2_train.txt
├── val/
│   ├── images/
│   │   ├── img1_val.png
│   │   └── img2_val.png
│   └── labels/
│       ├── label1_val.txt
│       └── label2_val.txt
└── test/
    ├── images/
    │   ├── img1_test.png
    │   └── img2_test.png
    └── labels/
        ├── label1_test.txt
        └── label2_test.txt
```

So first of all we're going to arrange the dataset so it suits this structure.

In [None]:
# Get a list of all image files in the dataset folder
img_filenames = sorted(ph.list_files_in_folder(
    os.path.join(PROJECT_DIR, "dataset/images"),
    absolute=True))

# Get a list of all annotation files in the dataset folder
ann_filenames = sorted(ph.list_files_in_folder(
    os.path.join(PROJECT_DIR, "dataset/annotations"),
    absolute=True))

# Pair the elements of the two lists and shuffle
paired_list = list(zip(img_filenames, ann_filenames))
random.seed(42)
random.shuffle(paired_list)

# Unzip the pairs back into two lists
img_filenames, ann_filenames = zip(*paired_list)

# Select the percentage for testing and validation
test_percent = 18
val_percent = 20
test_size = int(len(img_filenames) * test_percent / 100)
val_size = int(len(img_filenames) * val_percent / 100)

# Select the train, validation and test lists
test_img_filenames = img_filenames[:test_size]
test_ann_filenames = ann_filenames[:test_size]
val_img_filenames = img_filenames[test_size:test_size+val_size]
val_ann_filenames = ann_filenames[test_size:test_size+val_size]
train_img_filenames = img_filenames[test_size+val_size:]
train_ann_filenames = ann_filenames[test_size+val_size:]

In [None]:
# Create a new folder called "data"
!mkdir data

# Create the subfolders 'train', 'val' and 'test'
!mkdir data/train
!mkdir data/val
!mkdir data/test

# Create the subfolders of images and labels
!mkdir data/train/images
!mkdir data/train/labels
!mkdir data/val/images
!mkdir data/val/labels
!mkdir data/test/images
!mkdir data/test/labels

In [None]:
# Copy files to the correspondig folders (it may take a while)
ph.copy_files(test_img_filenames, 'data/test/images')
ph.copy_files(test_ann_filenames, 'data/test/labels')
ph.copy_files(val_img_filenames, 'data/val/images')
ph.copy_files(val_ann_filenames, 'data/val/labels')
ph.copy_files(train_img_filenames, 'data/train/images')
ph.copy_files(train_ann_filenames, 'data/train/labels')

## Training on custom dataset

When training with YOLOv8 you'll need to provide a yaml file similar to the one seen before [here](https://docs.ultralytics.com/datasets/detect/coco/#dataset-yaml)

Create this file with all the necessary contents to do the training. You can skip the '_download_' section. The name of this file must be '**dataset_mask_1.yaml**'. Save the file at the project directory.

In [None]:
def create_dataset_txt(main_path, train, val, test, class_list):
    """
    Creates a text file describing a dataset with paths and class names.

    Parameters:
    main_path (str): Absolute path to the main dataset directory.
    train (str): Relative path from main_path to the training set directory.
    val (str): Relative path from main_path to the validation set directory.
    test (str): Relative path from main_path to the test set directory.
    class_list (list of str): List of class names.

    Returns:
    None
    """

    # Convert the class list to a string
    class_list_str = ', '.join(class_list)

    # Prepare the content to be written in the text file
    content = f"""
# Dataset path

# You can use absolute path (recommended)
path: {main_path}

# Train/val/test sets as 1) dir: path/to/imgs,
#                        2) file: path/to/imgs.txt, or
#                        3) list: [path/to/imgs1, path/to/imgs2, ..]

# Here references must be relative to path
train: {train}
val: {val}
test: {test}

# Class Names
names: [{class_list_str}]
    """

    # Write the content to a text file in the current directory
    with open('dataset_mask_1.yaml', 'w') as file:
        file.write(content)

create_dataset_txt(PROJECT_DIR,
                   'data/train/images',
                   'data/val/images',
                   'data/test/images',
                   class_names)

In [None]:
# So it's time to train. Training is so easy as shown in the following
# line. You can tweak the parameters shown in the documentation, of course.
model.train(data='dataset_mask_1.yaml', epochs=30)

In [None]:
# Get the folder where the info of the last training has been saved
tr_folder = sorted([name for name in os.listdir('runs/detect')
                   if os.path.isdir(os.path.join('runs/detect', name)) and
                   name.startswith('train')])[-1]

print(tr_folder)

In [None]:
# See the metrics. Explore and understand what you see.
%load_ext tensorboard
%tensorboard --logdir {'runs/detect/' + tr_folder}

## Object detection

Let's see how the trained model works.

In [None]:
# Get the best model
model = YOLO(f'runs/detect/{tr_folder}/weights/best.pt')

In [None]:
# Get some test images
path = 'data/test/images'
test_images = sorted(ph.list_files_in_folder(path, absolute=True))[2:11]

In [None]:
# Calculate inference
results = model(source=test_images,
                show=False,
                conf=0.3,
                save=True)

In [None]:
# Show some results
for k in range(len(results)):
    ph.show(results[k])

In [None]:
# The way to test the model is with the test dataset.

# Test the model with split='test'. Dataset and settings are remembered.
metrics = model.val(split='test')

print(metrics.box.map)    # map50-95
print(metrics.box.map50)  # map50
print(metrics.box.map75)  # map75
print(metrics.box.maps)   # a list contains map50-95 of each category

## K-fold cross validation

Once you've been able to do a training with your custom dataset, it has arrived the moment to go one step further.

You may have already realized that the strategy of configuring a folder structure with 'train', 'val' and 'test' is quite rigid in the sense that it is not agile to change files from one folder to another.

There must be a more agile way of managing the files that will be part of each set, and in fact there is. Find out!

In this section you have to set up a training based on 8-fold cross validation, do this training and report the average mAP50 metric. You should do this using the files intended for 'train' and 'val'.

Naturally, the management of the files that will make up each set has to be dynamic, and it is not acceptable for these files to change their location on disk.

Note: It's enough that you train for 5 epochs.

In [None]:
from sklearn.model_selection import KFold

In [None]:
# Load a pretrained YOLOv8n detection model
model = YOLO('yolov8n.pt')

In [None]:
def kfold_cv(length, k, test_percent):
    """
    Generates indices for test data and k-fold cross validation on a dataset.

    Args:
    length (int): Total number of images in the dataset.
    k (int): Number of folds for k-fold cross validation.
    test_percent (float): Percentage of the dataset to be used for testing.

    Returns:
    tuple:
        - test_ids (list): A list of indices for test data.
        - kfold_generator (generator): A generator that yields training and
            validation indices for each fold.
    """
    # Calculate the number of test images
    num_test_images = int(length * test_percent / 100)

    # Randomly select test indices
    all_indices = np.arange(length)
    np.random.shuffle(all_indices)
    test_ids = list(all_indices[:num_test_images])

    # Remaining indices for k-fold
    remaining_indices = all_indices[num_test_images:]

    # K-Fold Cross Validation
    kf = KFold(n_splits=k)
    kfold_generator = ((list(remaining_indices[train_index]),
                        list(remaining_indices[val_index])) for train_index,
                        val_index in kf.split(remaining_indices))

    return test_ids, kfold_generator

In [None]:
def select_strings(strings, indices, pre='', post=''):
    """
    Selects and returns strings from a list based on a list of indices.

    Args:
    strings (list of str): The list of strings.
    indices (list of int): The list of indices to select strings.
    pre (string): String to add before the indexed string.
    post (string): String to add after the indexed string.

    Returns:
    list of str: The list of selected strings.

    Example:
    select_strings(['John', 'Tom', 'Jack', 'Sam'], [0, 3], pre='Hello ')
    >>> ['Hello John', 'Hello Sam']
    """
    return [pre + strings[i] + post for i in indices if i < len(strings)]

In [None]:
def create_text_file(strings, folder_path=None, file_name="output.txt"):
    """
    Creates a text file with each string from the list on a separate line
    in UTF-8 format.

    Args:
    strings (list of str): The list of strings to be written to the file.
    folder_path (str, optional): The path to the folder where the file will
        be saved. If not provided, the file is saved in the current directory.
    file_name (str, optional): The name of the file to be created. Defaults
        to 'output.txt'.
    """
    # If a folder path is provided, use it; otherwise, use the current
    # directory
    if folder_path:
        full_path = os.path.join(folder_path, file_name)
    else:
        full_path = file_name

    # Writing the strings to the file in UTF-8 format
    with open(full_path, 'w', encoding='utf-8') as file:
        for string in strings:
            file.write(string + '\n')


In [None]:
# Before executing next cell make sure you have your file
# 'dataset_mask_2.yaml' properly configured. Specifically, it must
# point to the files `test.txt`, `val.txt` and `train.txt`.

# Get the dataset folder
dataset_path = os.path.join(PROJECT_DIR, 'data')

# List all image files
all_files = [f for f in ph.list_files_in_folder(dataset_path,
    absolute=True, recursivity=True) if str(f).endswith('.png')]

# Select parameters in k-fold cross validation
k = 8
test_percent = 18  # Percentage of data for testing
test_ids, kfold_gen = kfold_cv(len(all_files), k, test_percent)

# Get paths list for testing and create text file
test_paths = select_strings(all_files, test_ids)
create_text_file(test_paths, dataset_path, 'test.txt')

In [None]:
# Train the models
metric = []
for train_ids, val_ids in kfold_gen:
    # Get training and validation images paths
    train_paths = select_strings(all_files, train_ids)
    val_paths = select_strings(all_files, val_ids)

    # Create text files with the lists of images paths
    create_text_file(train_paths, dataset_path, 'train.txt')
    create_text_file(val_paths, dataset_path, 'val.txt')

    # Train the model
    model.train(data='dataset_mask_2.yaml', epochs=5, resume=False)
    metric.append(model.metrics.results_dict['metrics/mAP50(B)'])

In [None]:
# See the reported metric of each model
print(f'Metric mAP50: {metric}')

# See the average:
print(f'Average of all mAP50: {np.mean(metric)}')

## Squat counter

Now that you have been able to practice with YOLOv8, we are going to face the last exercise. You must do this last exercise **locally on your laptop**. You don't have to worry about the GPU as no training is required.

The exercise consists of creating a **squat counter application**. This application must be able to receive as input both a video file or directly the webcam input. Only one person must appear in the image. The application must be able to distinguish whether the person is up (standing) or down (squatting) taking as reference the angles between the hip-knee and knee-ankle vectors of the two legs. Additionally, the application must be able to count how many squats are done in the video.

You are provided with a video in which a person appears doing squats and some pieces of code that could be usefull. **The following chunks of code aren't mandatory, they are just suggestions**

It's highly recommended to create a virtual environment with python 3.10 and ultralytics.

In [None]:
# Fully functional local version of the squat counter
# Runs on a virtual environment with Python 3.10 and ultralytics 8.0.201

##################################################
# Libraries
##################################################

import math
from collections import deque
import cv2
import numpy as np
from PIL import Image, ImageDraw
from ultralytics import YOLO


##################################################
# Global variables and general set up
##################################################

# Body parts ordered as indicated in keypoints
idx2bparts = ["Nose", "Left Eye", "Right Eye", "Left Ear", "Right Ear",
    "Left Shoulder", "Right Shoulder", "Left Elbow", "Right Elbow",
    "Left Wrist", "Right Wrist", "Left Hip", "Right Hip", "Left Knee",
    "Right Knee", "Left Ankle", "Right Ankle"]

# Index of body parts
bparts2idx = {key: ix for ix, key in enumerate(idx2bparts)}

# State and squat count
STATE = 'UP'
COUNT = 0
state_stack = deque(maxlen=6)
CHECK = True  # Used for debugging
ONE_IMAGE = False

# Load the Yolov8 model
model = YOLO('src/models/yolov8s-pose.pt')

# Open the video file
source = 0
# source = "src/inference/videos/MySquats.mp4"


##################################################
# Helper functions
##################################################

def add_annotations(frame):
    """
    Add state (up/down) and squats count (number) to the image.

    Args:
        frame (numpy array): Current frame captured

    Returns:
        frame with added text
    """
    # Display state and count on the image
    state_text = f"State: {STATE}"
    count_text = f"Count: {COUNT}"

    # Define the position and font settings for the text
    text_position1 = (10, 30)
    text_position2 = (10, 60)
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.7
    green = (0, 255, 0)
    red = (0, 0, 255)
    font_color = green if STATE == 'UP' else red
    font_thickness = 2

    frame_with_text = frame.copy()
    cv2.putText(frame_with_text, state_text, text_position1, font,
                font_scale, font_color, font_thickness)
    cv2.putText(frame_with_text, count_text, text_position2, font,
                font_scale, green, font_thickness)

    return frame_with_text


def legs_angles(left, right, verbose=False):
    """
    It calculates the minimum angle that make up the vector hip-knee with
    the vector knee-ankle in each leg. The inputs are numpy arrays with
    shape 3x2 (3 points x 2 coordinates) and the output is a numpy array
    of shape [2,] with each angle in degrees.

    Args:
        left (numpy array): Coordinates of joints hip, knee and ankle of
            the left leg. The matrix has the following shape:
            [x hip  , y hip  ]
            [x knee , y knee ]
            [x ankle, y ankle]
        right (numpy array): Coordinates of joints hip, knee and ankle of
            the right leg. The matrix has the same shape as 'left'
        verbose (bool, optional): Print info. Defaults to False.

    Returns:
        A numpy array with shape [2,] with the angles of the two legs in
            degrees.
    """

    angles = []

    for v in [left, right]:
        # Define the coordinates of three points (x1, y1), (x2, y2), and (x3, y3)
        x1, y1 = v[0, 0], v[0, 1]
        x2, y2 = v[1, 0], v[1, 1]
        x3, y3 = v[2, 0], v[2, 1]

        # Calculate the vectors from p2 to p1 and from p2 to p3
        vector1 = (x1 - x2, y1 - y2)
        vector2 = (x3 - x2, y3 - y2)

        # Calculate the dot product of the vectors
        dot_product = vector1[0] * vector2[0] + vector1[1] * vector2[1]

        # Calculate the magnitudes of the vectors
        magnitude1 = math.sqrt(vector1[0]**2 + vector1[1]**2)
        magnitude2 = math.sqrt(vector2[0]**2 + vector2[1]**2)

        # Calculate the cosine of the angle using the dot product
        cosine_theta = dot_product / (magnitude1 * magnitude2)

        # Calculate the angle in radians
        theta_radians = math.acos(max(-1, min(cosine_theta, 1)))

        # Convert the angle from radians to degrees
        theta_degrees = math.degrees(theta_radians)

        # Append the angles to the list
        angles.append(theta_degrees)

        if verbose:
            print((f"The angle in the knee (triangle knee-hip-ankle) is "
                   f"{theta_degrees:.2f} degrees."))

    return np.array(angles)


def get_legs_coords(kpts):
    """
    It gets the keypoints of the result object and extract those from
    hip, knee and ankle of left and right legs. The outputs are np arrays
    with the coordinates x, y and the confidence value

    Args:
        kpts (ultralytics keypoints): Keypoints object from the Result
            object in a pose estimation.

    Returns:
        left_leg_coords (numpy array): 3x3 numpy array with the coordinates
            (x, y, confidence) of the left hip, left knee and left ankle
            in the image
        left_leg_coords (numpy array): 3x3 numpy array with the coordinates
            (x, y, confidence) of the left hip, left knee and left ankle
            in the image
    """
    # Indices of left and right hip, knee and ankle
    left_leg = [11, 13, 15]
    right_leg = [12, 14, 16]

    # Left leg
    left_leg_coords = kpts.data[0, left_leg, :].cpu().numpy()

    # Right leg
    right_leg_coords = kpts.data[0, right_leg, :].cpu().numpy()

    return left_leg_coords, right_leg_coords


def extract(result):
    """
    Explore the Results object of Ultralytics for pose estimation

    This is just a helper function in the sense it could help how to explore
    some fields in the Results objects. You won't really need this function
    to implement any functionality.

    Args:
        result (Ultralytics Results): Object extracted from a Results generator
            or a Results list.

    Returns:
        None. It prints out some info contained in the input object.
    """
    # Body parts ordered as indicated in keypoints
    idx2bparts = ["Nose", "Left Eye", "Right Eye", "Left Ear", "Right Ear",
        "Left Shoulder", "Right Shoulder", "Left Elbow", "Right Elbow",
        "Left Wrist", "Right Wrist", "Left Hip", "Right Hip", "Left Knee",
        "Right Knee", "Left Ankle", "Right Ankle"]

    # Index of body parts
    bparts2idx = {key: ix for ix, key in enumerate(idx2bparts)}

    # Process result generator
    output_str = ''
    for ix, r in enumerate(result):
        names = r.names

        # Boxes object for bbox outputs
        box = r.boxes
        output_str += "\n\nBOXES\n-----\n"
        output_str += f"Box {ix}\n"
        output_str += f"Name of object: {names[int(box.cls.item())]}\n"
        output_str += f"Normalized coordinates of the box (xyxy): {box.xyxyn}\n"
        output_str += f"Confidence of detection: {box.conf.item()}\n"

        kpts = r.keypoints  # Keypoints object for pose outputs
        output_str += "\n\nKEYPOINTS\n---------\n"
        output_str += "Coordinates normalized\n"
        for kp in kpts:
            output_str += f"Nose: {kp.xyn[0, bparts2idx['Nose']]}\n"
            output_str += f"Left Shoulder: {kp.xyn[0, bparts2idx['Left Shoulder']]}\n"
            output_str += f"Right Shoulder: {kp.xyn[0, bparts2idx['Right Shoulder']]}\n"
            output_str += f"Left Hip: {kp.xyn[0, bparts2idx['Left Hip']]}\n"
            output_str += f"Right Hip: {kp.xyn[0, bparts2idx['Right Hip']]}\n"
            output_str += f"Left Knee: {kp.xyn[0, bparts2idx['Left Knee']]}\n"
            output_str += f"Right Knee: {kp.xyn[0, bparts2idx['Right Knee']]}\n"
            output_str += f"Left Ankle: {kp.xyn[0, bparts2idx['Left Ankle']]}\n"
            output_str += f"Right Ankle: {kp.xyn[0, bparts2idx['Right Ankle']]}\n"

        print(output_str)

        # You could also explore masks and probs

        # Masks object for segmentation masks outputs
        # masks = result.masks

        # Probs object for classification outputs
        # probs = result.probs


def evaluate_position(result, limit_conf=0.3, verbose=False):
    """
    Evaluate position of the body in the image

    It updates the global variables STATE (UP or DOWN) and the number
    of squats done (COUNT)

    Args:
        result (Ultralytics Results): Results object from Ultralytics. It
            contains all the data of the pose estimation.
        limit_conf (float, optional): It's the limiting confidence. Greater
            confidences in (all) points estimation will be considered,
            otherwise they will be descarted. Defaults to 0.3.
        verbose (bool, optional): Print info. Defaults to False.
    """

    # Global variables
    global COUNT
    global STATE
    global state_stack

    # Loop through Ultralytics Results
    for r in result:

        # Get bounding boxes
        box = r.boxes
        if r.names[int(box.cls.item())] != 'person':
            print("First box is not a person")
            break

        # Get keypoints
        kpts = r.keypoints  # Keypoints object for pose outputs

        # Get coordinates of the joints of the left and right legs
        left_coords, right_coords = get_legs_coords(kpts)

        # Check for confidences
        if (left_coords[:, 2] > limit_conf).all() and (right_coords[:, 2] > limit_conf).all():

            # Calculate the minimum angle in both legs
            angles = legs_angles(left_coords[:, :2], right_coords[:, :2])

            # Legs bent or stretched
            if (angles < 120).all() and STATE=='UP':
                STATE = 'DOWN'
            elif (angles > 150).all() and STATE=='DOWN':
                STATE = 'UP'

            # Update stack of states and count
            state_stack.append(STATE)
            if len(state_stack)==6:
                if state_stack == deque(
                    ['DOWN', 'DOWN', 'DOWN', 'UP', 'UP', 'UP']):
                    COUNT += 1

    # Show info if required
    if verbose:
        print(f"State: {STATE}")
        print(f"Count: {COUNT}")


def draw_grid_on_image(img, grid_size=(10, 10)):
    """
    Function to draw a grid on an image.
    """
    draw = ImageDraw.Draw(img)

    # Get image dimensions
    img_width, img_height = img.size

    # Calculate cell dimensions
    cell_width = img_width / grid_size[0]
    cell_height = img_height / grid_size[1]

    # Calculate vertical line positions
    vertical_lines = [(i * cell_width, 0, i * cell_width, img_height) for
                      i in range(grid_size[0] + 1)]

    # Calculate horizontal line positions
    horizontal_lines = [(0, i * cell_height, img_width, i * cell_height) for
                        i in range(grid_size[1] + 1)]

    # Draw all lines
    for line in vertical_lines + horizontal_lines:
        draw.line(line, fill="black")

    # Return the image with the grid
    return img


##################################################
# Main program
##################################################

# Select source
cap = cv2.VideoCapture(source)
stream = True  # If stream=True the output is a generator
               # otherwise it's a list

# Loop through the video frames
cont = 0
while cap.isOpened():
    # Read a frame from the video
    success, frame = cap.read()

    # If the frame is empty, break the loop
    if not success:
        break

    # Perform pose estimation on this single frame
    results = model(source=frame,
                    show=True,
                    conf=0.3,  # Confidence greater than
                    save=False,
                    stream=stream)  # Create a generator instead of a list

    # Extract data from results
    if not stream:  # En caso de que stream=False
        r = results[0]
    else:
        r = next(results)

    if cont == 0:
        if ONE_IMAGE:
            cv2.destroyWindow('image0.jpg')
        else:
            cv2.setWindowTitle('image0.jpg', 'YoloV8 Results')

    # Convert to image
    if CHECK:
        im = draw_grid_on_image(Image.fromarray(r.plot()[..., ::-1]))
        im.show()

    # Evaluate position
    evaluate_position(r)

    frame_with_text = add_annotations(frame)

    # Display the annotated frame
    cv2.imshow('Squat Counter Window', frame_with_text)

    # Check for user input to break the loop (e.g., press 'q' to exit)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

    # Increment frame counter
    cont +=1

# Release the video capture object and close all windows
cap.release()
cv2.destroyAllWindows()

## Assesment

Deliver this notebook **with the results of the executed cells**. You also must include **your squat counter app as a main.py file** and attach any other file needed to execute the code (helper functions).

I hope you've learned a lot and you've enjoyed this final lab.