### Data/Label Organizer

#### This script is to tidy data/label pairs and organize the folder structure for Yolo Model Intake. The script will perform the following:
1) Import the folder where data/labels are stored
2) Match file name convention for utilization in model
3) Identify the label id and their respective names
4) Setup folder structure
5) Populate structure to begin inferencing on main script

#### We will go over each kernel to explain the steps

In [1]:
### Lets import the modules needed for this task:

import os
import shutil
import pandas as pd
from pathlib import Path
import re
import random

#### 1) Import the folder where data/labels are stored

In [2]:
### Import image folder

imgData = Path('./MODELDATA\Cleanse_v1-20251104T204656Z-1-001\Cleanse_v1')

### Import Annotations for cleansed data images
### The images cleansed were extracted from VID02, VID04, VID05, VID06, and VID10
### Therefore lets retrieve those respective folders only to pair up.

annotVID02 = Path('./MODELDATA\yolo_annotations\yolo_annotations\VID02')
annotVID04 = Path('./MODELDATA\yolo_annotations\yolo_annotations\VID04')
annotVID05 = Path('./MODELDATA\yolo_annotations\yolo_annotations\VID05')
annotVID06 = Path('./MODELDATA\yolo_annotations\yolo_annotations\VID06')
annotVID10 = Path('./MODELDATA\yolo_annotations\yolo_annotations\VID10')

#### 2) Match file name convention for utilization in model

In [3]:
### We have to edit the name of annotations to match the image naming convention to be able to pair up
### Since the images are named "VID02_000..." while the annotations are named "VID_2_000...."
### We have to edit recursively

### Lets copy the respective folders to avoid editing original files

ann_folders = [annotVID02, annotVID04, annotVID05, annotVID06, annotVID10]

backup_root = Path("./annot_backup")
backup_root.mkdir(exist_ok=True)

copied_folders = []

for folder in ann_folders:
    dest = backup_root / folder.name
    if dest.exists():
        shutil.rmtree(dest)  ### Remove backup incase of re-run
    shutil.copytree(folder, dest)
    copied_folders.append(dest)
    print(f"Copied {folder} → {dest}")



Copied MODELDATA\yolo_annotations\yolo_annotations\VID02 → annot_backup\VID02
Copied MODELDATA\yolo_annotations\yolo_annotations\VID04 → annot_backup\VID04
Copied MODELDATA\yolo_annotations\yolo_annotations\VID05 → annot_backup\VID05
Copied MODELDATA\yolo_annotations\yolo_annotations\VID06 → annot_backup\VID06
Copied MODELDATA\yolo_annotations\yolo_annotations\VID10 → annot_backup\VID10


In [4]:
### Lets change the file names accordingly and recursively

pattern = re.compile(r"^VID_(\d+)_(.+)$", re.I)

for folder in copied_folders:
    for file in folder.rglob("*.txt"):
        m = pattern.match(file.stem)
        if not m:
            continue
        
        num = m.group(1)      ###number part (after VID_)
        rest = m.group(2)     ###rest of the filename

        ###Pad single digits and double digits just move
        if len(num) == 1:
            num = f"0{num}"

        new_name = f"VID{num}_{rest}{file.suffix}"
        new_path = file.with_name(new_name)

        file.rename(new_path)
        print(f"Renamed: {file.name} → {new_name}")


Renamed: VID_2_000000.txt → VID02_000000.txt
Renamed: VID_2_000001.txt → VID02_000001.txt
Renamed: VID_2_000002.txt → VID02_000002.txt
Renamed: VID_2_000003.txt → VID02_000003.txt
Renamed: VID_2_000004.txt → VID02_000004.txt
Renamed: VID_2_000005.txt → VID02_000005.txt
Renamed: VID_2_000006.txt → VID02_000006.txt
Renamed: VID_2_000007.txt → VID02_000007.txt
Renamed: VID_2_000008.txt → VID02_000008.txt
Renamed: VID_2_000009.txt → VID02_000009.txt
Renamed: VID_2_000010.txt → VID02_000010.txt
Renamed: VID_2_000011.txt → VID02_000011.txt
Renamed: VID_2_000012.txt → VID02_000012.txt
Renamed: VID_2_000013.txt → VID02_000013.txt
Renamed: VID_2_000014.txt → VID02_000014.txt
Renamed: VID_2_000015.txt → VID02_000015.txt
Renamed: VID_2_000016.txt → VID02_000016.txt
Renamed: VID_2_000017.txt → VID02_000017.txt
Renamed: VID_2_000018.txt → VID02_000018.txt
Renamed: VID_2_000019.txt → VID02_000019.txt
Renamed: VID_2_000020.txt → VID02_000020.txt
Renamed: VID_2_000021.txt → VID02_000021.txt
Renamed: V

#### 3) Identify the label id and their respective names

In [5]:
### We have to match now the cleansed images with their respective annotations from the folders to be able to pair up
### For YOLO Classification, lets say we have VID02_000042.png, we would have to have VID02_000042.txt to pair for the task
### We can do this by recursively verifying the folders and matching with regular expressions

annotVID02bu = Path('./annot_backup\VID02')
annotVID04bu = Path('./annot_backup\VID04')
annotVID05bu = Path('./annot_backup\VID05')
annotVID06bu = Path('./annot_backup\VID06')
annotVID10bu = Path('./annot_backup\VID10')

ann_foldersbu = [annotVID02bu, annotVID04bu, annotVID05bu, annotVID06bu, annotVID10bu]

out_dir  = Path("./organizedAnnot")
out_dir.mkdir(exist_ok=True)

img_re = re.compile(r"^(.+?)\.png$", re.I)

image_count = 0
match_count = 0
unmatched_images = []

for img_path in imgData.rglob("*.png"):
    image_count += 1
    m = img_re.match(img_path.name)
    if not m:
        unmatched_images.append(str(img_path))
        continue

    base = m.group(1)  # captured text before .png
    found = False

    # search each annotation folder recursively; use regex to match ann stem
    ann_pattern = re.compile(rf"^{re.escape(base)}$", re.I)
    for ann_root in ann_foldersbu:
        for ann in ann_root.rglob("*.txt"):
            if ann_pattern.match(ann.stem):
                shutil.copy(ann, out_dir / ann.name)
                match_count += 1
                found = True
                break
        if found:
            break

    if not found:
        unmatched_images.append(str(img_path))

print("Total images found:", image_count)
print("Annotations matched and copied:", match_count)
print("Unmatched images:", len(unmatched_images))
# optional: print details
for p in unmatched_images:
    print("  -", p)




Total images found: 3779
Annotations matched and copied: 3779
Unmatched images: 0


#### 4) Setup folder structure

In [6]:
from pathlib import Path

# Paths to annotation folders
ann_foldersbu = [annotVID02bu, annotVID04bu, annotVID05bu, annotVID06bu, annotVID10bu]

label_ids = set()  # use a set to avoid duplicates

for folder in ann_foldersbu:
    for txt_file in folder.rglob("*.txt"):
        try:
            with open(txt_file, "r") as f:
                content = f.read().strip()
                if content:  # skip empty files
                    label_id = int(content)
                    label_ids.add(label_id)
        except Exception as e:
            print(f"Error reading {txt_file}: {e}")

# Convert to sorted list
unique_labels = sorted(label_ids)
print("Unique label IDs:", unique_labels)
print("Total unique labels:", len(unique_labels))

Unique label IDs: [-1, 3, 4, 7, 8, 10, 11, 12, 13, 15, 17, 18, 19, 20, 21, 22, 23, 26, 29, 44, 45, 46, 51, 52, 57, 58, 59, 60, 61, 66, 68, 69, 72, 78, 79, 82, 90, 92, 94, 95, 96, 97, 98, 99]
Total unique labels: 44


#### 5) Populate structure to begin inferencing on main script

In [7]:
NAMES = [
  'bipolar,coagulate,abdominal-wall/cavity',
  'bipolar,coagulate,blood-vessel',
  'bipolar,coagulate,cystic-artery',
  'bipolar,coagulate,cystic-duct',
  'bipolar,coagulate,cystic-pedicle',
  'bipolar,coagulate,cystic-plate',
  'bipolar,coagulate,gallbladder',
  'bipolar,coagulate,liver',
  'bipolar,coagulate,omentum',
  'bipolar,coagulate,peritoneum',
  'bipolar,dissect,adhesion',
  'bipolar,dissect,cystic-artery',
  'bipolar,dissect,cystic-duct',
  'bipolar,dissect,cystic-plate',
  'bipolar,dissect,gallbladder',
  'bipolar,dissect,omentum',
  'bipolar,grasp,cystic-plate',
  'bipolar,grasp,liver',
  'bipolar,grasp,specimen-bag',
  'bipolar,null-verb,null-target',
  'bipolar,retract,blood-vessel',
  'bipolar,retract,cystic-pedicle',
  'bipolar,retract,gallbladder',
  'bipolar,retract,liver',
  'bipolar,retract,omentum',
  'clipper,clip,blood-vessel',
  'clipper,clip,cystic-artery',
  'clipper,clip,cystic-duct',
  'clipper,clip,cystic-pedicle',
  'clipper,clip,cystic-plate',
  'clipper,null-verb,null-target',
  'grasper,dissect,cystic-plate',
  'grasper,dissect,gallbladder',
  'grasper,dissect,omentum',
  'grasper,grasp,cystic-artery',
  'grasper,grasp,cystic-duct',
  'grasper,grasp,cystic-pedicle',
  'grasper,grasp,cystic-plate',
  'grasper,grasp,gallbladder',
  'grasper,grasp,gut',
  'grasper,grasp,liver',
  'grasper,grasp,omentum',
  'grasper,grasp,peritoneum',
  'grasper,grasp,specimen-bag',
  'grasper,null-verb,null-target',
  'grasper,pack,gallbladder',
  'grasper,retract,cystic-duct',
  'grasper,retract,cystic-pedicle',
  'grasper,retract,cystic-plate',
  'grasper,retract,gallbladder',
  'grasper,retract,gut',
  'grasper,retract,liver',
  'grasper,retract,omentum',
  'grasper,retract,peritoneum',
  'hook,coagulate,blood-vessel',
  'hook,coagulate,cystic-artery',
  'hook,coagulate,cystic-duct',
  'hook,coagulate,cystic-pedicle',
  'hook,coagulate,cystic-plate',
  'hook,coagulate,gallbladder',
  'hook,coagulate,liver',
  'hook,coagulate,omentum',
  'hook,cut,blood-vessel',
  'hook,cut,peritoneum',
  'hook,dissect,blood-vessel',
  'hook,dissect,cystic-artery',
  'hook,dissect,cystic-duct',
  'hook,dissect,cystic-plate',
  'hook,dissect,gallbladder',
  'hook,dissect,omentum',
  'hook,dissect,peritoneum',
  'hook,null-verb,null-target',
  'hook,retract,gallbladder',
  'hook,retract,liver',
  'irrigator,aspirate,fluid',
  'irrigator,dissect,cystic-duct',
  'irrigator,dissect,cystic-pedicle',
  'irrigator,dissect,cystic-plate',
  'irrigator,dissect,gallbladder',
  'irrigator,dissect,omentum',
  'irrigator,irrigate,abdominal-wall/cavity',
  'irrigator,irrigate,cystic-pedicle',
  'irrigator,irrigate,liver',
  'irrigator,null-verb,null-target',
  'irrigator,retract,gallbladder',
  'irrigator,retract,liver',
  'irrigator,retract,omentum',
  'scissors,coagulate,omentum',
  'scissors,cut,adhesion',
  'scissors,cut,blood-vessel',
  'scissors,cut,cystic-artery',
  'scissors,cut,cystic-duct',
  'scissors,cut,cystic-plate',
  'scissors,cut,liver',
  'scissors,cut,omentum',
  'scissors,cut,peritoneum',
  'scissors,dissect,cystic-plate',
  'scissors,dissect,gallbladder',
  'scissors,dissect,omentum',
  'scissors,null-verb,null-target',
]



In [8]:
from pathlib import Path
import shutil
from collections import defaultdict
from math import ceil

# === PATHS ===
IMG_ROOT = imgData
ANNOT_ROOT = Path("./organizedAnnot/")
OUTPUT_ROOT = Path("./YOLO_dataset")

SPLIT_RATIOS = {"train": 0.7, "val": 0.15, "test": 0.15}

def sanitize(name: str) -> str:
    return name.replace("/", "_").replace(",", "_").replace(" ", "_")

# ========= STEP 1: COLLECT IMAGES PER CLASS =========
class_to_images = defaultdict(list)

for ann_file in ANNOT_ROOT.rglob("*.txt"):
    img_file = IMG_ROOT / (ann_file.stem + ".png")
    if not img_file.exists():
        continue
    try:
        idx = int(ann_file.read_text().strip())
        class_name = NAMES[idx]
        class_to_images[class_name].append(img_file)
    except:
        continue

# ========= STEP 2: DETERMINE BALANCED CLASS COUNT =========
# Maximum number of images any class has
max_images = max(len(imgs) for imgs in class_to_images.values())

# Number of images for each split PER CLASS
train_n = int(max_images * SPLIT_RATIOS["train"])
val_n   = int(max_images * SPLIT_RATIOS["val"])
test_n  = max_images - train_n - val_n  # remaining

print(f"Per-class counts → Train={train_n}, Val={val_n}, Test={test_n}")

# ========= STEP 3: CREATE OUTPUT STRUCTURE =========
for split in ["train", "val", "test"]:
    (OUTPUT_ROOT / split).mkdir(parents=True, exist_ok=True)

# ========= STEP 4: BALANCED SAMPLING & COPYING =========
def oversample(images, needed):
    """Return exactly `needed` images (repeat if necessary)."""
    if len(images) == 0:
        return []  # shouldn't happen
    out = []
    while len(out) < needed:
        out.extend(images)
    return out[:needed]

for class_name, img_list in class_to_images.items():

    # Oversample as needed for each split
    train_imgs = oversample(img_list, train_n)
    val_imgs   = oversample(img_list, val_n)
    test_imgs  = oversample(img_list, test_n)

    # Create class subfolders
    for split in ["train", "val", "test"]:
        (OUTPUT_ROOT / split / sanitize(class_name)).mkdir(parents=True, exist_ok=True)

    # Copy files
    for img in train_imgs:
        shutil.copy(img, OUTPUT_ROOT / "train" / sanitize(class_name) / img.name)

    for img in val_imgs:
        shutil.copy(img, OUTPUT_ROOT / "val" / sanitize(class_name) / img.name)

    for img in test_imgs:
        shutil.copy(img, OUTPUT_ROOT / "test" / sanitize(class_name) / img.name)

print("Balanced dataset creation complete.")


Per-class counts → Train=1043, Val=223, Test=225
Balanced dataset creation complete.


In [11]:
from pathlib import Path
from collections import defaultdict

YOLO_ROOT = Path("./YOLO_dataset")

splits = ["train", "val", "test"]

counts = {split: defaultdict(int) for split in splits}

for split in splits:
    split_path = YOLO_ROOT / split
    if not split_path.exists():
        print(f"Missing split folder: {split}")
        continue

    # Each class is a subfolder
    for class_folder in split_path.iterdir():
        if class_folder.is_dir():
            class_name = class_folder.name
            num_images = len(list(class_folder.glob("*.png")))
            counts[split][class_name] = num_images

# ==== Print summary ====
print("=== CLASS DISTRIBUTION ===")
for split in splits:
    print(f"\n--- {split.upper()} ---")
    for class_name, num_images in sorted(counts[split].items()):
        print(f"{class_name:30} : {num_images}")



=== CLASS DISTRIBUTION ===

--- TRAIN ---
bipolar_coagulate_cystic-duct  : 10
bipolar_coagulate_cystic-pedicle : 47
bipolar_coagulate_liver        : 469
bipolar_dissect_adhesion       : 23
bipolar_dissect_cystic-duct    : 65
bipolar_dissect_cystic-plate   : 2
bipolar_grasp_liver            : 1043
bipolar_grasp_specimen-bag     : 5
bipolar_null-verb_null-target  : 426
bipolar_retract_blood-vessel   : 83
bipolar_retract_cystic-pedicle : 7
bipolar_retract_liver          : 14
clipper_clip_cystic-artery     : 6
clipper_clip_cystic-plate      : 42
grasper_null-verb_null-target  : 1
grasper_pack_gallbladder       : 2
grasper_retract_liver          : 2
grasper_retract_omentum        : 3
hook_coagulate_cystic-pedicle  : 33
hook_coagulate_cystic-plate    : 113
hook_coagulate_gallbladder     : 20
hook_coagulate_liver           : 561
hook_coagulate_omentum         : 6
hook_dissect_cystic-duct       : 14
hook_dissect_gallbladder       : 8
hook_dissect_omentum           : 16
hook_retract_gallbladder

In [12]:
from pathlib import Path

YOLO_ROOT = Path("./YOLO_dataset")
splits = ["train", "val", "test"]

for split in splits:
    split_path = YOLO_ROOT / split
    if not split_path.exists():
        print(f"{split} folder missing.")
        continue

    # Count only directories (each one is a class)
    class_folders = [p for p in split_path.iterdir() if p.is_dir()]
    print(f"{split.upper()} → {len(class_folders)} class folders")


TRAIN → 38 class folders
VAL → 38 class folders
TEST → 38 class folders
