# **Splitting the Datasets**

Here, the datasets which were cleaned and augmented in the Dataset_Combination file and then manually combined, are further split into training, test, and validation sets.

# **Step 1: Install required dependencies:**

In this step, I install essential libraries that will be used for augmentation, image processing, and file handling. The following packages are installed:

*   albumentations, for image augmentation
*   opencv-python, for image processing tasks
*   pillow, for handling image file formats
*   numpy, for numerical operations
*   gdown, for downloading files from GDrive

In [None]:
#Set up libraries and dependencies
!pip install albumentations==1.0.3 opencv-python==4.5.3.56 pillow numpy
!pip install --upgrade --no-cache-dir gdown

Collecting albumentations==1.0.3
  Downloading albumentations-1.0.3-py3-none-any.whl.metadata (31 kB)
Collecting opencv-python==4.5.3.56
  Downloading opencv-python-4.5.3.56.tar.gz (89.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.2/89.2 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Installing build dependencies ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a prob

# **Step 2: Set Up the Environment**

This section mounts the GDrive to the Colab environment to access files and directories stored ther. The dataset and other related files will be located and stored in GDrive until manual download.

In [None]:
#Set up environment
from google.colab import drive
import os

drive.mount('/content/drive')

Mounted at /content/drive


#**Step 3: Augmentation (Obsolete)**

The following code was originally used to apply further augmentation steps to images in the dataset, however I decided it was best to do this in the original handling of the separate datasets, to prevent increased likeliness of corrupt files.

The code below has been left as reference.

In [1]:
#Augmentation: obsolete
import os
import cv2
import albumentations as A
from google.colab import drive
from PIL import Image
import numpy as np
from tqdm import tqdm

# Mount Google Drive
drive.mount('/content/drive')

# Paths to dataset
dataset_path = '/content/drive/MyDrive/create_data/complete_dataset'
images_path = os.path.join(dataset_path, 'images')
labels_path = os.path.join(dataset_path, 'labels')

# Output paths
augmented_images_path = os.path.join(dataset_path, 'augmented_images')
augmented_labels_path = os.path.join(dataset_path, 'augmented_labels')

# Create directories if they do not exist
os.makedirs(augmented_images_path, exist_ok=True)
os.makedirs(augmented_labels_path, exist_ok=True)

# Define augmentation pipeline
transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.VerticalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.1, rotate_limit=30, p=0.5),
    A.GaussianBlur(p=0.2),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

# Function to read YOLO labels
def read_yolo_labels(label_path):
    with open(label_path, 'r') as file:
        labels = []
        for line in file:
            class_id, x_center, y_center, width, height = map(float, line.strip().split())
            labels.append([x_center, y_center, width, height, class_id])
    return labels

# Function to write YOLO labels
def write_yolo_labels(label_path, labels):
    with open(label_path, 'w') as file:
        for label in labels:
            x_center, y_center, width, height, class_id = label
            file.write(f"{int(class_id)} {x_center} {y_center} {width} {height}\n")

# Augment images and labels
for image_filename in tqdm(os.listdir(images_path)):
    if image_filename.endswith('.jpg'):
        image_path = os.path.join(images_path, image_filename)
        label_path = os.path.join(labels_path, image_filename.replace('.jpg', '.txt'))

        # Read image and labels
        image = np.array(Image.open(image_path).convert('RGB'))
        bboxes = read_yolo_labels(label_path)
        class_labels = [label[4] for label in bboxes]

        # Apply augmentation
        augmented = transform(image=image, bboxes=bboxes, class_labels=class_labels)
        augmented_image = augmented['image']
        augmented_bboxes = augmented['bboxes']

        # Save augmented image and labels
        augmented_image_path = os.path.join(augmented_images_path, image_filename)
        augmented_label_path = os.path.join(augmented_labels_path, image_filename.replace('.jpg', '.txt'))

        Image.fromarray(augmented_image).save(augmented_image_path)
        write_yolo_labels(augmented_label_path, augmented_bboxes)


ModuleNotFoundError: No module named 'google'

#**Step 4: Split Dataset into Train, Test, and Validation Sets**

In this step, the unified, cleaned, and augmented datasets which have been unified are split into training, test, and validation sets for use with the custom YOLOv5 model for the current project.

The dataset is split into 80% for training, 20% for testing, and, from the training set, 25% is used for validation as well. This is done using the train_test_split function from the sklearn library, ensuring a fair random divide between the data, while maintaining correct proportions.

The paths are defined, copied into their respective directories, and then later zipped in the main project notebook.

Note: These steps are performed in separate notebooks to reduce the clutter within the main project notebook.

In [None]:
# Splitting the dataset

import os
import shutil
import numpy as np
from sklearn.model_selection import train_test_split

os.chdir('/content/drive/MyDrive/final_unified_dataset')

# Paths to the augmented dataset
dataset_path = '/content/drive/MyDrive/final_unified_dataset'
images_path = os.path.join(dataset_path, 'images')
labels_path = os.path.join(dataset_path, 'labels')

# Output paths for split datasets
output_base_path = '/content/drive/MyDrive/split_dataset'
train_images_path = os.path.join(output_base_path, 'train', 'images')
train_labels_path = os.path.join(output_base_path, 'train', 'labels')
val_images_path = os.path.join(output_base_path, 'val', 'images')
val_labels_path = os.path.join(output_base_path, 'val', 'labels')
test_images_path = os.path.join(output_base_path, 'test', 'images')
test_labels_path = os.path.join(output_base_path, 'test', 'labels')

# Create directories if they do not exist
os.makedirs(train_images_path, exist_ok=True)
os.makedirs(train_labels_path, exist_ok=True)
os.makedirs(val_images_path, exist_ok=True)
os.makedirs(val_labels_path, exist_ok=True)
os.makedirs(test_images_path, exist_ok=True)
os.makedirs(test_labels_path, exist_ok=True)

# List all image files
image_files = [f for f in os.listdir(images_path) if f.endswith('.jpg')]

# Split the dataset
train_files, test_files = train_test_split(image_files, test_size=0.2, random_state=42)
train_files, val_files = train_test_split(train_files, test_size=0.25, random_state=42)  # 0.25 x 0.8 = 0.2

# Function to copy files to destination
def copy_files(files, src_img_dir, src_lbl_dir, dst_img_dir, dst_lbl_dir):
    for file in files:
        shutil.copy(os.path.join(src_img_dir, file), os.path.join(dst_img_dir, file))
        label_file = file.replace('.jpg', '.txt')
        shutil.copy(os.path.join(src_lbl_dir, label_file), os.path.join(dst_lbl_dir, label_file))

# Copy the files to the respective directories
copy_files(train_files, images_path, labels_path, train_images_path, train_labels_path)
copy_files(val_files, images_path, labels_path, val_images_path, val_labels_path)
copy_files(test_files, images_path, labels_path, test_images_path, test_labels_path)
