First, you need to mount your Google Drive. This will prompt you to authorize this notebook to access your Drive files.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Once your Drive is mounted, you can list the contents of your Drive to find your data. For example, to list the contents of your 'My Drive' folder:

In [None]:
!ls /content/drive/MyDrive

'ambyolo (1).ipynb'  'Colab Notebooks'	 kosom_merged_data   Takeout


# Task
Inspect the `kosom_merged_data` folder to understand the arrangement of image and annotation files, and then confirm the exact format of the annotation files (e.g., Pascal VOC XML, COCO JSON, or YOLO .txt).

## Inspect Dataset Folder

### Subtask:
Inspect the contents of the `kosom_merged_data` folder to understand the arrangement of image and annotation files.


**Reasoning**:
To inspect the contents of the `kosom_merged_data` folder, I will use the `!ls` command to list its contents.



In [None]:
import os

folder_path = '/content/drive/MyDrive/kosom_merged_data'
print(f"Listing contents of: {folder_path}")
!ls -F '{folder_path}'

Listing contents of: /content/drive/MyDrive/kosom_merged_data
test/  train/  val/


**Reasoning**:
Now that we know the top-level structure includes 'test/', 'train/', and 'val/' folders, I will inspect one of these subfolders, specifically the 'train/' folder, to see how the images and annotation files are arranged within it. This will help confirm the dataset's organization.



In [None]:
import os

train_folder_path = os.path.join(folder_path, 'train')
print(f"Listing contents of: {train_folder_path}")
!ls -F '{train_folder_path}' | head -n 10 # List first 10 items to get a glimpse

Listing contents of: /content/drive/MyDrive/kosom_merged_data/train
images/
labels/


**Reasoning**:
Now that the 'train' folder is confirmed to contain 'images/' and 'labels/' subfolders, I will inspect the contents of the 'images/' folder within 'train/' to understand how image files are stored. This will help confirm the structure and file types of the image data.



In [None]:
import os

train_images_path = os.path.join(train_folder_path, 'images')
print(f"Listing contents of: {train_images_path}")
!ls -F '{train_images_path}' | head -n 10 # List first 10 items to get a glimpse of image files

Listing contents of: /content/drive/MyDrive/kosom_merged_data/train/images
100.jpg
101.jpg
102.jpg
103.jpg
104.jpg
105.jpg
106.jpg
107.jpg
108.jpg
109.jpg


**Reasoning**:
Now that the structure of the image files has been determined, I will inspect the contents of the 'labels/' folder within 'train/' to identify the format and type of the annotation files. This is crucial for understanding how the bounding box and class information is stored.



In [None]:
import os

train_labels_path = os.path.join(train_folder_path, 'labels')
print(f"Listing contents of: {train_labels_path}")
!ls -F '{train_labels_path}' | head -n 10 # List first 10 items to get a glimpse of label files

Listing contents of: /content/drive/MyDrive/kosom_merged_data/train/labels
100.txt
101.txt
102.txt
103.txt
104.txt
105.txt
106.txt
107.txt
108.txt
109.txt


**Reasoning**:
Based on the previous output showing `.txt` annotation files, I will now read the content of one of these files to confirm the exact annotation format, likely YOLO.



In [None]:
import os

# Get the full path to one of the label files
# The previous attempt failed because !ls outputted multiple filenames on a single line,
# causing label_files[0] to contain multiple filenames separated by spaces,
# leading to an invalid file path.
# We will use os.listdir() to get a proper list of individual filenames.

all_items_in_labels = os.listdir(train_labels_path)

# Filter to get only files, and sort to ensure consistent selection of the 'first' file
label_files_list = sorted([item for item in all_items_in_labels if os.path.isfile(os.path.join(train_labels_path, item))])

if label_files_list:
    sample_label_filename = label_files_list[0]
    sample_label_file = os.path.join(train_labels_path, sample_label_filename)
    print(f"Reading content of sample label file: {sample_label_file}")
    with open(sample_label_file, 'r') as f:
        content = f.read()
    print(content)
else:
    print(f"No label files found in {train_labels_path}")

Reading content of sample label file: /content/drive/MyDrive/kosom_merged_data/train/labels/1.txt
0 0.177647 0.139151 0.317647 0.268868
0 0.516471 0.139151 0.308235 0.254717
0 0.837647 0.141509 0.305882 0.268868
0 0.176471 0.455189 0.310588 0.278302
0 0.524706 0.466981 0.310588 0.273585
0 0.842353 0.459906 0.296471 0.273585
0 0.272941 0.791274 0.409412 0.346698
0 0.727059 0.791274 0.423529 0.337264



**Reasoning**:
The previous code failed because `!ls` outputted multiple filenames on a single line, causing `label_files[0]` to contain multiple filenames separated by spaces, leading to an invalid file path. To fix this, I will use `os.listdir()` to get a proper list of individual filenames and then select the first one to construct the correct path.



In [None]:
import os

# Get a list of all files and directories in the labels folder
all_items_in_labels = os.listdir(train_labels_path)

# Filter to get only files, and sort to ensure consistent selection of the 'first' file
label_files_list = sorted([item for item in all_items_in_labels if os.path.isfile(os.path.join(train_labels_path, item))])

if label_files_list:
    sample_label_filename = label_files_list[0]
    sample_label_file = os.path.join(train_labels_path, sample_label_filename)
    print(f"Reading content of sample label file: {sample_label_file}")
    with open(sample_label_file, 'r') as f:
        content = f.read()
    print(content)
else:
    print(f"No label files found in {train_labels_path}")

Reading content of sample label file: /content/drive/MyDrive/kosom_merged_data/train/labels/1.txt
0 0.177647 0.139151 0.317647 0.268868
0 0.516471 0.139151 0.308235 0.254717
0 0.837647 0.141509 0.305882 0.268868
0 0.176471 0.455189 0.310588 0.278302
0 0.524706 0.466981 0.310588 0.273585
0 0.842353 0.459906 0.296471 0.273585
0 0.272941 0.791274 0.409412 0.346698
0 0.727059 0.791274 0.423529 0.337264



## Confirm Annotation Format

### Subtask:
Confirm the exact format of the annotation files (e.g., Pascal VOC XML, COCO JSON, or YOLO .txt).


The content of the sample annotation file `/content/drive/MyDrive/kosom_merged_data/train/labels/1.txt` is:
```
0 0.177647 0.139151 0.317647 0.268868
0 0.516471 0.139151 0.308235 0.254717
0 0.837647 0.141509 0.305882 0.268868
0 0.176471 0.455189 0.310588 0.278302
0 0.524706 0.466981 0.310588 0.273585
0 0.842353 0.459906 0.296471 0.273585
0 0.272941 0.791274 0.409412 0.346698
0 0.727059 0.791274 0.423529 0.337264
```
This format clearly indicates that the annotation files are in **YOLO .txt format**, where each line represents an object with `class_index normalized_center_x normalized_center_y normalized_width normalized_height`. Since the dataset's annotations are already in the required YOLO format, **no conversion is necessary** for this task.

## Organize Dataset Directory Structure

### Subtask:
Create the required directory structure for YOLOv5s training, typically `/dataset/images/train`, `/dataset/images/val`, `/dataset/images/test`, and similar paths for labels. This ensures that YOLOv5s can easily find your data.


**Reasoning**:
I will create the main directory for the YOLOv5s dataset, which will serve as the base for the new structured data.



In [None]:
import os

yolov5_dataset_root = '/content/yolov5_dataset'
os.makedirs(yolov5_dataset_root, exist_ok=True)
print(f"Created YOLOv5 dataset root directory: {yolov5_dataset_root}")

Created YOLOv5 dataset root directory: /content/yolov5_dataset


**Reasoning**:
Now that the root directory is created, I will create the `images` and `labels` subdirectories, and within each of them, create `train`, `val`, and `test` subdirectories to establish the full YOLOv5s dataset structure.



In [None]:
import os

# Define the base directory for the YOLOv5 dataset
yolov5_dataset_root = '/content/yolov5_dataset'

# Create images and labels base directories
images_dir = os.path.join(yolov5_dataset_root, 'images')
labels_dir = os.path.join(yolov5_dataset_root, 'labels')
os.makedirs(images_dir, exist_ok=True)
os.makedirs(labels_dir, exist_ok=True)
print(f"Created: {images_dir}")
print(f"Created: {labels_dir}")

# Create train, val, test subdirectories within images
for subset in ['train', 'val', 'test']:
    os.makedirs(os.path.join(images_dir, subset), exist_ok=True)
    print(f"Created: {os.path.join(images_dir, subset)}")

# Create train, val, test subdirectories within labels
for subset in ['train', 'val', 'test']:
    os.makedirs(os.path.join(labels_dir, subset), exist_ok=True)
    print(f"Created: {os.path.join(labels_dir, subset)}")

Created: /content/yolov5_dataset/images
Created: /content/yolov5_dataset/labels
Created: /content/yolov5_dataset/images/train
Created: /content/yolov5_dataset/images/val
Created: /content/yolov5_dataset/images/test
Created: /content/yolov5_dataset/labels/train
Created: /content/yolov5_dataset/labels/val
Created: /content/yolov5_dataset/labels/test


**Reasoning**:
To avoid duplicating large files and efficiently manage disk space, I will create symbolic links from the original dataset's image and label folders (`kosom_merged_data`) to the corresponding newly created YOLOv5s dataset structure (`yolov5_dataset`). This will be done for `train`, `val`, and `test` subsets.



In [None]:
import os

# Define source and destination root directories
source_root = '/content/drive/MyDrive/kosom_merged_data'
dest_root = '/content/yolov5_dataset'

# Subsets to process
subsets = ['train', 'val', 'test']

# Create symbolic links for images and labels
for subset in subsets:
    # Images
    source_images_path = os.path.join(source_root, subset, 'images')
    dest_images_path = os.path.join(dest_root, 'images', subset)
    if os.path.exists(source_images_path):
        if not os.path.exists(dest_images_path) or os.path.islink(dest_images_path):
            # Remove existing link/directory if present to avoid errors
            if os.path.exists(dest_images_path): os.remove(dest_images_path)
            os.symlink(source_images_path, dest_images_path)
            print(f"Created symbolic link: {dest_images_path} -> {source_images_path}")
        else:
            print(f"Destination directory already exists (not a link), skipping: {dest_images_path}")
    else:
        print(f"Source images path not found, skipping: {source_images_path}")

    # Labels
    source_labels_path = os.path.join(source_root, subset, 'labels')
    dest_labels_path = os.path.join(dest_root, 'labels', subset)
    if os.path.exists(source_labels_path):
        if not os.path.exists(dest_labels_path) or os.path.islink(dest_labels_path):
            # Remove existing link/directory if present to avoid errors
            if os.path.exists(dest_labels_path): os.remove(dest_labels_path)
            os.symlink(source_labels_path, dest_labels_path)
            print(f"Created symbolic link: {dest_labels_path} -> {source_labels_path}")
        else:
            print(f"Destination directory already exists (not a link), skipping: {dest_labels_path}")
    else:
        print(f"Source labels path not found, skipping: {source_labels_path}")

Created symbolic link: /content/yolov5_dataset/images/train -> /content/drive/MyDrive/kosom_merged_data/train/images
Created symbolic link: /content/yolov5_dataset/labels/train -> /content/drive/MyDrive/kosom_merged_data/train/labels
Created symbolic link: /content/yolov5_dataset/images/val -> /content/drive/MyDrive/kosom_merged_data/val/images
Created symbolic link: /content/yolov5_dataset/labels/val -> /content/drive/MyDrive/kosom_merged_data/val/labels
Created symbolic link: /content/yolov5_dataset/images/test -> /content/drive/MyDrive/kosom_merged_data/test/images
Created symbolic link: /content/yolov5_dataset/labels/test -> /content/drive/MyDrive/kosom_merged_data/test/labels


**Reasoning**:
The previous attempt to create symbolic links failed because the target directories (e.g., `/content/yolov5_dataset/images/train`) were incorrectly created as actual directories in an earlier step, preventing the `os.symlink` function from executing. To fix this, I will first remove these incorrectly created directories using `shutil.rmtree()` and then re-execute the symbolic link creation to ensure the proper directory structure is established with symbolic links pointing to the original data.



In [None]:
import os
import shutil

# Define source and destination root directories
source_root = '/content/drive/MyDrive/kosom_merged_data'
dest_root = '/content/yolov5_dataset'

# Subsets to process
subsets = ['train', 'val', 'test']

# Remove any incorrectly created directories at the symbolic link target paths
# before attempting to create symbolic links
for subset in subsets:
    dest_images_path = os.path.join(dest_root, 'images', subset)
    dest_labels_path = os.path.join(dest_root, 'labels', subset)

    if os.path.isdir(dest_images_path) and not os.path.islink(dest_images_path):
        shutil.rmtree(dest_images_path)
        print(f"Removed existing directory: {dest_images_path}")
    if os.path.isdir(dest_labels_path) and not os.path.islink(dest_labels_path):
        shutil.rmtree(dest_labels_path)
        print(f"Removed existing directory: {dest_labels_path}")

# Now, create symbolic links for images and labels
for subset in subsets:
    # Images
    source_images_path = os.path.join(source_root, subset, 'images')
    dest_images_path = os.path.join(dest_root, 'images', subset)
    if os.path.exists(source_images_path):
        # Ensure the destination path does not exist before creating a symlink
        if os.path.exists(dest_images_path) or os.path.islink(dest_images_path):
            os.remove(dest_images_path) # Remove existing file or symlink
        os.symlink(source_images_path, dest_images_path)
        print(f"Created symbolic link: {dest_images_path} -> {source_images_path}")
    else:
        print(f"Source images path not found, skipping: {source_images_path}")

    # Labels
    source_labels_path = os.path.join(source_root, subset, 'labels')
    dest_labels_path = os.path.join(dest_root, 'labels', subset)
    if os.path.exists(source_labels_path):
        # Ensure the destination path does not exist before creating a symlink
        if os.path.exists(dest_labels_path) or os.path.islink(dest_labels_path):
            os.remove(dest_labels_path) # Remove existing file or symlink
        os.symlink(source_labels_path, dest_labels_path)
        print(f"Created symbolic link: {dest_labels_path} -> {source_labels_path}")
    else:
        print(f"Source labels path not found, skipping: {source_labels_path}")

Created symbolic link: /content/yolov5_dataset/images/train -> /content/drive/MyDrive/kosom_merged_data/train/images
Created symbolic link: /content/yolov5_dataset/labels/train -> /content/drive/MyDrive/kosom_merged_data/train/labels
Created symbolic link: /content/yolov5_dataset/images/val -> /content/drive/MyDrive/kosom_merged_data/val/images
Created symbolic link: /content/yolov5_dataset/labels/val -> /content/drive/MyDrive/kosom_merged_data/val/labels
Created symbolic link: /content/yolov5_dataset/images/test -> /content/drive/MyDrive/kosom_merged_data/test/images
Created symbolic link: /content/yolov5_dataset/labels/test -> /content/drive/MyDrive/kosom_merged_data/test/labels


## Split Data into Train, Validation, and Test Sets

### Subtask:
Verify the contents of the existing train, validation, and test splits by listing a few files and counting them in the symbolically linked directories.


**Reasoning**:
To verify the contents and counts of the symbolically linked directories, I will iterate through the 'train', 'val', and 'test' subsets, listing the first few files and counting all files in both the image and label directories for each subset.



In [None]:
import os

yolov5_dataset_root = '/content/yolov5_dataset'
subsets = ['train', 'val', 'test']

for subset in subsets:
    print(f"\n--- Verifying {subset} subset ---")

    # Verify images directory
    images_path = os.path.join(yolov5_dataset_root, 'images', subset)
    if os.path.exists(images_path):
        print(f"Listing first 5 image files in {images_path}:")
        !ls -F '{images_path}' | head -n 5
        num_images = len(os.listdir(images_path))
        print(f"Total number of image files: {num_images}")
    else:
        print(f"Images directory not found: {images_path}")

    # Verify labels directory
    labels_path = os.path.join(yolov5_dataset_root, 'labels', subset)
    if os.path.exists(labels_path):
        print(f"Listing first 5 label files in {labels_path}:")
        !ls -F '{labels_path}' | head -n 5
        num_labels = len(os.listdir(labels_path))
        print(f"Total number of label files: {num_labels}")
    else:
        print(f"Labels directory not found: {labels_path}")


--- Verifying train subset ---
Listing first 5 image files in /content/yolov5_dataset/images/train:
/content/yolov5_dataset/images/train@
Total number of image files: 663
Listing first 5 label files in /content/yolov5_dataset/labels/train:
/content/yolov5_dataset/labels/train@
Total number of label files: 665

--- Verifying val subset ---
Listing first 5 image files in /content/yolov5_dataset/images/val:
/content/yolov5_dataset/images/val@
Total number of image files: 331
Listing first 5 label files in /content/yolov5_dataset/labels/val:
/content/yolov5_dataset/labels/val@
Total number of label files: 331

--- Verifying test subset ---
Listing first 5 image files in /content/yolov5_dataset/images/test:
/content/yolov5_dataset/images/test@
Total number of image files: 99
Listing first 5 label files in /content/yolov5_dataset/labels/test:
/content/yolov5_dataset/labels/test@
Total number of label files: 99


## Generate dataset.yaml Configuration File

### Subtask:
Create the `dataset.yaml` file required by YOLOv5s. This file will define the paths to your training and validation images/labels, the number of classes, and the class names.


**Reasoning**:
To create the `dataset.yaml` file, I need to define the paths, number of classes, and class names, and then write this information to a YAML file using the `yaml` library.



In [None]:
import yaml
import os

# 1. Define the full path for the dataset.yaml file
data_yaml_path = '/content/yolov5_dataset/dataset.yaml'

# 2. Define the relative paths for the training and validation images and labels
# These paths are relative to the 'yolov5_dataset' root, where dataset.yaml will reside.
train_images_path = './images/train'
val_images_path = './images/val'
test_images_path = './images/test'

# The labels are expected to be in corresponding 'labels' directories
# e.g., for train images in './images/train', labels are expected in './labels/train'

# 3. Set the number of classes (nc) to 1 as identified from inspection
nc = 1

# 4. Define the class name(s) (names)
names = ['object']  # Using 'object' as a placeholder for class index 0

# 5. Create a Python dictionary containing these entries
data_config = {
    'path': '/content/yolov5_dataset', # Absolute path to the dataset root
    'train': train_images_path, # Relative path to train images
    'val': val_images_path,   # Relative path to val images
    'test': test_images_path,   # Relative path to test images
    'nc': nc,
    'names': names
}

# 6. Write this dictionary to the dataset.yaml file
print(f"Creating dataset.yaml at: {data_yaml_path}")
with open(data_yaml_path, 'w') as file:
    yaml.dump(data_config, file, sort_keys=False)

print(f"Content of {data_yaml_path}:\n")
with open(data_yaml_path, 'r') as file:
    print(file.read())


Creating dataset.yaml at: /content/yolov5_dataset/dataset.yaml
Content of /content/yolov5_dataset/dataset.yaml:

path: /content/yolov5_dataset
train: ./images/train
val: ./images/val
test: ./images/test
nc: 1
names:
- object



In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
!pip install -r requirements.txt

Cloning into 'yolov5'...
remote: Enumerating objects: 17778, done.[K
remote: Counting objects: 100% (152/152), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 17778 (delta 97), reused 50 (delta 50), pack-reused 17626 (from 4)[K
Receiving objects: 100% (17778/17778), 17.09 MiB | 17.82 MiB/s, done.
Resolving deltas: 100% (12074/12074), done.
/content/yolov5
Collecting thop>=0.1.1 (from -r requirements.txt (line 14))
  Downloading thop-0.1.1.post2209072238-py3-none-any.whl.metadata (2.7 kB)
Collecting ultralytics>=8.2.64 (from -r requirements.txt (line 18))
  Downloading ultralytics-8.3.250-py3-none-any.whl.metadata (37 kB)
Collecting ultralytics-thop>=2.0.18 (from ultralytics>=8.2.64->-r requirements.txt (line 18))
  Downloading ultralytics_thop-2.0.18-py3-none-any.whl.metadata (14 kB)
Downloading thop-0.1.1.post2209072238-py3-none-any.whl (15 kB)
Downloading ultralytics-8.3.250-py3-none-any.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Make sure you have adjusted 'data.yaml' in the previous step to point to your dataset!
# Set WANDB_MODE to 'disabled' to prevent wandb from prompting for input
%env WANDB_MODE=disabled
!python /content/yolov5/train.py --img 320 --batch 16 --epochs 150 --data /content/yolov5_dataset/dataset.yaml --weights yolov5s.pt --workers 2

## Final Task

### Subtask:
Confirm that the data is prepared correctly for YOLOv5s training and provide instructions on how to proceed with model training.


## Summary:

### Q&A
Yes, the data has been prepared correctly for YOLOv5s training. The next step is to proceed with model training using the prepared dataset and configuration.

### Data Analysis Key Findings
*   The raw data, located in `/content/drive/MyDrive/kosom_merged_data`, is structured into standard `test`, `train`, and `val` splits, each containing separate `images` (in JPG format) and `labels` (in TXT format) subfolders.
*   Annotation files are confirmed to be in the **YOLO .txt format**, with each line following the structure: `class_index normalized_center_x normalized_center_y normalized_width normalized_height`. This means no format conversion is required.
*   A new YOLOv5s-compatible dataset directory structure has been successfully established at `/content/yolov5_dataset`, utilizing symbolic links to the original image and label files to avoid data duplication.
*   The data splits contain the following file counts:
    *   **Train Set:** 663 image files and 665 label files.
    *   **Validation Set:** 331 image files and 331 label files.
    *   **Test Set:** 99 image files and 99 label files.
    A minor discrepancy of 2 extra label files was noted in the training set compared to image files.
*   A `dataset.yaml` configuration file was generated, correctly defining the dataset paths, one class (`nc: 1`), and the class name (`names: ['object']`), ready for YOLOv5s training.

### Insights or Next Steps
*   The dataset is now completely set up and ready for YOLOv5s model training using the prepared `dataset.yaml` configuration.
*   It is recommended to investigate the minor discrepancy in the training set (663 images vs. 665 labels) to ensure a one-to-one correspondence between images and their annotation files, which is crucial for robust model training.
