# YOLO Detection Label Generation for Polyp Dataset

This notebook prepares **YOLO-formatted detection labels** for a large and heterogeneous **polyp detection dataset** that combines single images and video sequences from multiple sources (including **PolypGen** and others).  
The goal is to **automatically discover** all image and bounding-box folders across the **train**, **val**, and **test** splits and **generate YOLO labels** in a modular, folder-aware way.


##  Dataset Organization

Each split (**train**, **val**, **test**) follows a structured layout containing:

- **Single images**  
  Located under `images_single/` and corresponding bounding box annotation text files are stored under bbox/ `bbox/`.

- **Sequence data**  
  Grouped under `seq/`.

- **Positive sequences (`seqX`)**  
  Contain both `images_seqXX/` and `bbox_seqXX/` folders.

- **Negative sequences (`seqX_neg`)**  
  Contain only image frames without bounding boxes (no detected polyps).


## Purpose of This Notebook

This notebook:

1. **Generates** static augmentations for single images to increase data diversity./`.
2. **Scans** the split folders automatically using helper functions located in `src/label_generation/`.
3. **Generates YOLO detection labels** (`.txt` files) for each positive sequence or single-image folder by reading original bounding-box annotation text files.
4. **Handles negative samples** by generating **empty label files** (to maintain YOLO compatibility).
5. **Keeps YOLO labels separated per folder**, allowing flexible split creation later (e.g., combining selected sequences into new training or validation sets).
6. **Integrates other datasets** such as **Kvasir** and **Real-Colon images**, which are combined to generate final YOLO training splits.


 This modular structure ensures reproducibility, easy dataset updates, and seamless expansion with new polyp data sources.

In [52]:
from pathlib import Path
# Change the following base path:
BASE_PATH = "../"
ROOT = Path(BASE_PATH)

# Generate Yolo Detection Labels From TXT Files

### Train Set Images_Single (Polypgen) Label Generation

In [25]:
SCRIPT = ROOT / "src/label_generation/bboxes_to_yolo_det_labels.py"
IMG_ROOT = ROOT / "data/detection2/train/images_single"
BBOX_ROOT = ROOT / "data/detection2/train/bbox"
OUT_PATH =  ROOT / "data/detection2/train/yolo_label_folders/images_single"

!python "{SCRIPT}" \
  --img_root "{IMG_ROOT}" \
  --bbox_root "{BBOX_ROOT}" \
  --out_labels "{OUT_PATH}" \
  --class_map "polyp:0" \
  --min_area_rel 0.001 --min_w_rel 0.005 --min_h_rel 0.005 \
  --min_w_px 4 --min_h_px 4

[skip] tiny/noisy box (C3_EndoCV2021_00474:6) -> (1248.0,672.0,1288.0,696.0)
[done] images=1105 | with_boxes=1013 | boxes_total=1131 | labels_dir=..\data\detection2\train\yolo_label_folders\images_single


### Val Set Images_Single (Polypgen) Label Generation

In [26]:
SCRIPT = ROOT / "src/label_generation/bboxes_to_yolo_det_labels.py"
IMG_ROOT = ROOT / "data/detection2/val/images_single"
BBOX_ROOT = ROOT / "data/detection2/val/bbox"
OUT_PATH =  ROOT / "data/detection2/val/yolo_label_folders/images_single"

!python "{SCRIPT}" \
  --img_root "{IMG_ROOT}" \
  --bbox_root "{BBOX_ROOT}" \
  --out_labels "{OUT_PATH}" \
  --class_map "polyp:0" \
  --min_area_rel 0.001 --min_w_rel 0.005 --min_h_rel 0.005 \
  --min_w_px 4 --min_h_px 4

[done] images=139 | with_boxes=128 | boxes_total=147 | labels_dir=..\data\detection2\val\yolo_label_folders\images_single


### Train Sequences Generate Yolo Labels

In [28]:
SCRIPT = ROOT / "src/label_generation/build_seq_yolo_det_labels.py"
SEQ_ROOT = ROOT / "data/detection2/train/seq"
OUT_PATH =  ROOT / "data/detection2/train/yolo_label_folders"

!python "{SCRIPT}" \
  --root "{SEQ_ROOT}" \
  --out_labels "{OUT_PATH}" \
  --mirror True \
  --neg_fraction 1.0 \
  --seed 42 \
  --min_area_rel 0.001 --min_w_rel 0.005 --min_h_rel 0.005 \
  --verbose 1

[pos] seq10: images=25 | with_boxes=7 | boxes_total=7
[pos] seq11: images=148 | with_boxes=83 | boxes_total=83
[pos] seq12: images=24 | with_boxes=24 | boxes_total=24
[pos] seq13: images=24 | with_boxes=20 | boxes_total=20
[pos] seq14: images=33 | with_boxes=24 | boxes_total=24
[neg] seq16_neg: selected=141/141 (fraction=1.000)
[neg] seq1_neg: selected=315/315 (fraction=1.000)
[pos] seq2: images=63 | with_boxes=60 | boxes_total=60
[neg] seq22_neg: selected=82/82 (fraction=1.000)
[neg] seq2_neg: selected=302/302 (fraction=1.000)
[pos] seq3: images=15 | with_boxes=15 | boxes_total=16
[neg] seq3_neg: selected=40/40 (fraction=1.000)
[pos] seq4: images=48 | with_boxes=46 | boxes_total=46
[neg] seq4_neg: selected=72/72 (fraction=1.000)
[pos] seq5: images=220 | with_boxes=170 | boxes_total=170
[neg] seq5_neg: selected=61/61 (fraction=1.000)
[pos] seq6: images=91 | with_boxes=62 | boxes_total=62
[neg] seq6_neg: selected=90/90 (fraction=1.000)
[neg] seq7_neg: selected=207/207 (fraction=1.000)
[

### Val Sequences Generate Yolo Labels

In [45]:
SCRIPT = ROOT / "src/label_generation/build_seq_yolo_det_labels.py"
SEQ_ROOT = ROOT / "data/detection2/val/seq"
OUT_PATH =  ROOT / "data/detection2/val/yolo_label_folders"

!python "{SCRIPT}" \
  --root "{SEQ_ROOT}" \
  --out_labels "{OUT_PATH}" \
  --mirror True \
  --neg_fraction 1.0 \
  --seed 42 \
  --min_area_rel 0.001 --min_w_rel 0.005 --min_h_rel 0.005 \
  --verbose 1

[neg] seq15_neg: selected=278/278 (fraction=1.000)
[done] pos_images=0 | pos_with_boxes=0 | pos_boxes_total=0 | neg_empty_selected=278 | labels_root=..\data\detection2\val\yolo_label_folders


## Kvasir Data Generate Yolo Labels

In [29]:
SCRIPT = ROOT / "src/label_generation/bboxes_to_yolo_det_labels.py"
IMG_ROOT = ROOT / "data/kvasir-seg/images"
BBOX_ROOT = ROOT / "data/kvasir-seg/bbox_labels"
OUT_PATH =  ROOT / "data/detection2/val/yolo_label_folders/kvasir"

!python "{SCRIPT}" \
  --img_root "{IMG_ROOT}" \
  --bbox_root "{BBOX_ROOT}" \
  --out_labels "{OUT_PATH}" \
  --class_map "polyp:0" \
  --min_area_rel 0.001 --min_w_rel 0.005 --min_h_rel 0.005 \
  --min_w_px 4 --min_h_px 4

[done] images=1000 | with_boxes=1000 | boxes_total=1064 | labels_dir=..\data\detection2\val\yolo_label_folders\kvasir


## Real Colon 

Already generated.

## Augmentation of Single Positive Images:

In [32]:
SCRIPT = ROOT / "src/augmentation/augment_single_pos_det.py"
IMG_ROOT = ROOT / "data/detection2/train/images_single"
BBOX_ROOT = ROOT / "data/detection2/train/bbox"
OUT_IMAGES_PATH =  ROOT / "data/detection2/train/aug_pos_images_det"
OUT_LABELS_PATH =  ROOT / "data/detection2/train/yolo_label_folders/aug_pos_labels_det"

!python "{SCRIPT}" \
  --img_root   "{IMG_ROOT}" \
  --bbox_root  "{BBOX_ROOT}" \
  --out_images "{OUT_IMAGES_PATH}" \
  --out_labels "{OUT_LABELS_PATH}" \
  --copies_per_img 2 \
  --class_map "polyp:0" \
  --min_area_rel 0.001 --min_w_rel 0.004 --min_h_rel 0.004 \
  --min_w_px 4 --min_h_px 4 \
  --skip_if_no_boxes \
  --seed 0 \
  --verbose 1

[info] images found: 1105
[progress] 100/1105 | aug_written=152 | boxes_kept=174
[progress] 200/1105 | aug_written=306 | boxes_kept=334
[progress] 300/1105 | aug_written=478 | boxes_kept=510
[progress] 400/1105 | aug_written=668 | boxes_kept=719
[progress] 500/1105 | aug_written=868 | boxes_kept=938
[progress] 600/1105 | aug_written=1068 | boxes_kept=1158
[progress] 700/1105 | aug_written=1266 | boxes_kept=1411
[progress] 800/1105 | aug_written=1436 | boxes_kept=1595
[progress] 900/1105 | aug_written=1618 | boxes_kept=1795
[progress] 1000/1105 | aug_written=1816 | boxes_kept=2011
[progress] 1100/1105 | aug_written=2013 | boxes_kept=2246
[progress] 1105/1105 | aug_written=2023 | boxes_kept=2258
[done] images=1105 | aug_written=2023 | boxes_kept=2258 | out_images=..\data\detection2\train\aug_pos_images_det | out_labels=..\data\detection2\train\yolo_label_folders\aug_pos_labels_det


# Combine Images and Labels To Generate Yolo Splits

In [59]:
import json

TRAIN_SOURCES = {
    "train_single": {
        "images": str(ROOT / "data/detection2/train/images_single"),
        "labels": str(ROOT / "data/detection2/train/yolo_label_folders/images_single"),
        "recursive": False,
        "label_mode": "flat"
    },
    "train_seq_pos": {
        "images": str(ROOT / "data/detection2/train/seq"),
        "labels": str(ROOT / "data/detection2/train/yolo_label_folders"),
        "recursive": True,
        "glob": "seq*/images_seq*/*",
        "label_mode": "seq_pos"
    },
    "train_seq_neg": {
        "images": str(ROOT / "data/detection2/train/seq"),
        "labels": str(ROOT / "data/detection2/train/yolo_label_folders"),
        "recursive": True,
        "glob": "seq*_neg/*",
        "label_mode": "seq_neg"
    },
    "aug_pos_single": {
        "images": str(ROOT / "data/detection2/train/aug_pos_images_det"),
        "labels": str(ROOT / "data/detection2/train/yolo_label_folders/aug_pos_labels_det"),
        "recursive": False,
        "label_mode": "flat"
    },
    "aug_neg_seq": {
        "images": str(ROOT / "data/detection2/train/aug_neg_images"),
        "labels": str(ROOT / "data/detection2/train/yolo_label_folders/aug_neg_labels"),
        "recursive": False,
        "label_mode": "flat"
    },
    "real_colon": {
        "images": str(ROOT / "data/RealColon/images"),
        "labels": str(ROOT / "data/RealColon/labels"),
        "recursive": False,
        "label_mode": "flat"
    },
        "kvasir": {
        "images": str(ROOT / "data/kvasir-seg/images"),
        "labels": str(ROOT / "data/detection2/val/yolo_label_folders/kvasir"),
        "recursive": False,
        "label_mode": "flat"
    }
}

train_tmp_json = Path("sources_tmp.json")
train_tmp_json.write_text(json.dumps(TRAIN_SOURCES, indent=2))

1448

In [44]:
# Choose which sources to include
INCLUDE = "train_single,train_seq_pos,train_seq_neg,aug_pos_single,aug_neg_seq,real_colon"
SCRIPT = ROOT / "src/label_generation/compose_from_dict.py"
# Run the composer
!python "{SCRIPT}" \
  --sources "{train_tmp_json}" \
  --include "{INCLUDE}" \
  --out "{ROOT / 'data/detection2/yolo_split3/train'}" \
  --copy_mode copy \
  --verbose 1

[info] OUT=..\data\detection2\yolo_split3\train | copy_mode=copy | dry_run=False | allow_missing=False
[src] train_single    imgs_dir=..\data\detection2\train\images_single labels_dir=..\data\detection2\train\yolo_label_folders\images_single mode=flat rec=False glob=None | imgs=1105
[keep] train_single: 1105 files (missing_labels=0)
[src] train_seq_pos   imgs_dir=..\data\detection2\train\seq labels_dir=..\data\detection2\train\yolo_label_folders mode=seq_pos rec=True glob=seq*/images_seq*/* | imgs=815
[keep] train_seq_pos: 815 files (missing_labels=0)
[src] train_seq_neg   imgs_dir=..\data\detection2\train\seq labels_dir=..\data\detection2\train\yolo_label_folders mode=seq_neg rec=True glob=seq*_neg/* | imgs=1720
[keep] train_seq_neg: 1720 files (missing_labels=0)
[src] aug_pos_single  imgs_dir=..\data\detection2\train\aug_pos_images_det labels_dir=..\data\detection2\train\yolo_label_folders\aug_pos_labels_det mode=flat rec=False glob=None | imgs=2023
[keep] aug_pos_single: 2023 files 

### Combine Validation Images and Labels

In [58]:
VAL_SOURCES = {
    "val_single": {
        "images": str(ROOT / "data/detection2/val/images_single"),
        "labels": str(ROOT / "data/detection2/val/yolo_label_folders/images_single"),
        "recursive": False,
        "label_mode": "flat"
    },
    "val_seq_pos": {
        "images": str(ROOT / "data/detection2/val/seq"),
        "labels": str(ROOT / "data/detection2/val/yolo_label_folders"),
        "recursive": True,
        "glob": "seq*/images_seq*/*",
        "label_mode": "seq_pos"
    },
    "val_seq_neg": {
        "images": str(ROOT / "data/detection2/val/seq"),
        "labels": str(ROOT / "data/detection2/val/yolo_label_folders"),
        "recursive": True,
        "glob": "seq*_neg/*",
        "label_mode": "seq_neg"
    },
    "kvasir": {
        "images": str(ROOT / "data/kvasir-seg/images"),
        "labels": str(ROOT / "data/detection2/val/yolo_label_folders/kvasir"),
        "recursive": False,
        "label_mode": "flat"
    }

}

val_tmp_json = Path("sources_tmp2.json")
val_tmp_json.write_text(json.dumps(VAL_SOURCES, indent=2))

827

In [49]:
# Choose which sources to include
INCLUDE_VAL = "val_single,val_seq_pos,val_seq_neg,kvasir"
SCRIPT = ROOT / "src/label_generation/compose_from_dict.py"
# Run the composer
!python "{SCRIPT}" \
  --sources "{val_tmp_json}" \
  --include "{INCLUDE_VAL}" \
  --out "{ROOT / 'data/detection2/yolo_split3/val'}" \
  --copy_mode copy \
  --verbose 1

[info] OUT=..\data\detection2\yolo_split3\val | copy_mode=copy | dry_run=False | allow_missing=False
[src] val_single      imgs_dir=..\data\detection2\val\images_single labels_dir=..\data\detection2\val\yolo_label_folders\images_single mode=flat rec=False glob=None | imgs=139
[keep] val_single: 139 files (missing_labels=0)
[src] val_seq_pos     imgs_dir=..\data\detection2\val\seq labels_dir=..\data\detection2\val\yolo_label_folders mode=seq_pos rec=True glob=seq*/images_seq*/* | imgs=0
[keep] val_seq_pos: 0 files (missing_labels=0)
[src] val_seq_neg     imgs_dir=..\data\detection2\val\seq labels_dir=..\data\detection2\val\yolo_label_folders mode=seq_neg rec=True glob=seq*_neg/* | imgs=278
[keep] val_seq_neg: 278 files (missing_labels=0)
[src] kvasir          imgs_dir=..\data\kvasir-seg\images labels_dir=..\data\detection2\val\yolo_label_folders\kvasir mode=flat rec=False glob=None | imgs=1000
[keep] kvasir: 1000 files (missing_labels=0)

[done] wrote 1417 image+label pairs into ..\data

In [50]:
from pathlib import Path
import yaml

# ===============================
# CONFIG
# ===============================
YAML_PATH = ROOT / "configs" / "data_yolo_split3.yaml"

data_yolo_split3 = {
    "path": str(ROOT / "data/detection2/yolo_split3"),
    "train": "train/images",
    "val": "val/images",
    "test": "test/images",  # optional
    "nc": 1,
    "names": {0: "polyp"},
    "description": (
        "Combined split including: "
        "train_single, train_seq_pos, train_seq_neg, "
        "aug_pos_single, aug_neg_seq, real_colon1, real_colon2 "
        "— generated via compose_from_dict.py → yolo_split3"
    ),
}

# ===============================
# WRITE YAML
# ===============================
YAML_PATH.parent.mkdir(parents=True, exist_ok=True)

with open(YAML_PATH, "w") as f:
    yaml.dump(data_yolo_split3, f, sort_keys=False)

print(f"[done] YOLO data config written to:\n{YAML_PATH.resolve()}")

[done] YOLO data config written to:
C:\Users\Betul\Desktop\Projects\Polyp\configs\data_yolo_split3.yaml


# Yolo Split 4

In [53]:
# Choose which sources to include
INCLUDE = "train_single,train_seq_pos,train_seq_neg,aug_pos_single,real_colon"
SCRIPT = ROOT / "src/label_generation/compose_from_dict.py"
# Run the composer
!python "{SCRIPT}" \
  --sources "{train_tmp_json}" \
  --include "{INCLUDE}" \
  --out "{ROOT / 'data/detection2/yolo_split4/train'}" \
  --copy_mode copy \
  --verbose 1

[info] OUT=..\data\detection2\yolo_split4\train | copy_mode=copy | dry_run=False | allow_missing=False
[src] train_single    imgs_dir=..\data\detection2\train\images_single labels_dir=..\data\detection2\train\yolo_label_folders\images_single mode=flat rec=False glob=None | imgs=1105
[keep] train_single: 1105 files (missing_labels=0)
[src] train_seq_pos   imgs_dir=..\data\detection2\train\seq labels_dir=..\data\detection2\train\yolo_label_folders mode=seq_pos rec=True glob=seq*/images_seq*/* | imgs=815
[keep] train_seq_pos: 815 files (missing_labels=0)
[src] train_seq_neg   imgs_dir=..\data\detection2\train\seq labels_dir=..\data\detection2\train\yolo_label_folders mode=seq_neg rec=True glob=seq*_neg/* | imgs=1720
[keep] train_seq_neg: 1720 files (missing_labels=0)
[src] aug_pos_single  imgs_dir=..\data\detection2\train\aug_pos_images_det labels_dir=..\data\detection2\train\yolo_label_folders\aug_pos_labels_det mode=flat rec=False glob=None | imgs=2023
[keep] aug_pos_single: 2023 files 

In [55]:
# Choose which sources to include
INCLUDE_VAL = "val_single,val_seq_pos,val_seq_neg,kvasir"
SCRIPT = ROOT / "src/label_generation/compose_from_dict.py"
# Run the composer
!python "{SCRIPT}" \
  --sources "{val_tmp_json}" \
  --include "{INCLUDE_VAL}" \
  --out "{ROOT / 'data/detection2/yolo_split4/val'}" \
  --copy_mode copy \
  --verbose 1

[info] OUT=..\data\detection2\yolo_split4\val | copy_mode=copy | dry_run=False | allow_missing=False
[src] val_single      imgs_dir=..\data\detection2\val\images_single labels_dir=..\data\detection2\val\yolo_label_folders\images_single mode=flat rec=False glob=None | imgs=139
[keep] val_single: 139 files (missing_labels=0)
[src] val_seq_pos     imgs_dir=..\data\detection2\val\seq labels_dir=..\data\detection2\val\yolo_label_folders mode=seq_pos rec=True glob=seq*/images_seq*/* | imgs=0
[keep] val_seq_pos: 0 files (missing_labels=0)
[src] val_seq_neg     imgs_dir=..\data\detection2\val\seq labels_dir=..\data\detection2\val\yolo_label_folders mode=seq_neg rec=True glob=seq*_neg/* | imgs=278
[keep] val_seq_neg: 278 files (missing_labels=0)
[src] kvasir          imgs_dir=..\data\kvasir-seg\images labels_dir=..\data\detection2\val\yolo_label_folders\kvasir mode=flat rec=False glob=None | imgs=1000
[keep] kvasir: 1000 files (missing_labels=0)

[done] wrote 1417 image+label pairs into ..\data

## Yolo Split 5 - Kvasir in Training

In [60]:
# Choose which sources to include
INCLUDE = "train_single,train_seq_pos,train_seq_neg,aug_pos_single,real_colon, kvasir"
SCRIPT = ROOT / "src/label_generation/compose_from_dict.py"
# Run the composer
!python "{SCRIPT}" \
  --sources "{train_tmp_json}" \
  --include "{INCLUDE}" \
  --out "{ROOT / 'data/detection2/yolo_split5/train'}" \
  --copy_mode copy \
  --verbose 1

[info] OUT=..\data\detection2\yolo_split5\train | copy_mode=copy | dry_run=False | allow_missing=False
[src] train_single    imgs_dir=..\data\detection2\train\images_single labels_dir=..\data\detection2\train\yolo_label_folders\images_single mode=flat rec=False glob=None | imgs=1105
[keep] train_single: 1105 files (missing_labels=0)
[src] train_seq_pos   imgs_dir=..\data\detection2\train\seq labels_dir=..\data\detection2\train\yolo_label_folders mode=seq_pos rec=True glob=seq*/images_seq*/* | imgs=815
[keep] train_seq_pos: 815 files (missing_labels=0)
[src] train_seq_neg   imgs_dir=..\data\detection2\train\seq labels_dir=..\data\detection2\train\yolo_label_folders mode=seq_neg rec=True glob=seq*_neg/* | imgs=1720
[keep] train_seq_neg: 1720 files (missing_labels=0)
[src] aug_pos_single  imgs_dir=..\data\detection2\train\aug_pos_images_det labels_dir=..\data\detection2\train\yolo_label_folders\aug_pos_labels_det mode=flat rec=False glob=None | imgs=2023
[keep] aug_pos_single: 2023 files 

In [61]:
# Choose which sources to include
INCLUDE_VAL = "val_single,val_seq_pos,val_seq_neg"
SCRIPT = ROOT / "src/label_generation/compose_from_dict.py"
# Run the composer
!python "{SCRIPT}" \
  --sources "{val_tmp_json}" \
  --include "{INCLUDE_VAL}" \
  --out "{ROOT / 'data/detection2/yolo_split5/val'}" \
  --copy_mode copy \
  --verbose 1

[info] OUT=..\data\detection2\yolo_split5\val | copy_mode=copy | dry_run=False | allow_missing=False
[src] val_single      imgs_dir=..\data\detection2\val\images_single labels_dir=..\data\detection2\val\yolo_label_folders\images_single mode=flat rec=False glob=None | imgs=139
[keep] val_single: 139 files (missing_labels=0)
[src] val_seq_pos     imgs_dir=..\data\detection2\val\seq labels_dir=..\data\detection2\val\yolo_label_folders mode=seq_pos rec=True glob=seq*/images_seq*/* | imgs=0
[keep] val_seq_pos: 0 files (missing_labels=0)
[src] val_seq_neg     imgs_dir=..\data\detection2\val\seq labels_dir=..\data\detection2\val\yolo_label_folders mode=seq_neg rec=True glob=seq*_neg/* | imgs=278
[keep] val_seq_neg: 278 files (missing_labels=0)

[done] wrote 417 image+label pairs into ..\data\detection2\yolo_split5\val


In [62]:
from pathlib import Path
import yaml

# ===============================
# CONFIG
# ===============================
YAML_PATH = ROOT / "configs" / "data_yolo_split5.yaml"

data_yolo_split5 = {
    "path": str(ROOT / "data/detection2/yolo_split5"),
    "train": "train/images",
    "val": "val/images",
    "test": "test/images",  # optional
    "nc": 1,
    "names": {0: "polyp"},
    "description": (
        "Combined split including: "
        "train_single, train_seq_pos, train_seq_neg, "
        "aug_pos_single, real_colon1, real_colon2,kvasir "
        "— generated via compose_from_dict.py → yolo_split5"
    ),
}

# ===============================
# WRITE YAML
# ===============================
YAML_PATH.parent.mkdir(parents=True, exist_ok=True)

with open(YAML_PATH, "w") as f:
    yaml.dump(data_yolo_split5, f, sort_keys=False)

print(f"[done] YOLO data config written to:\n{YAML_PATH.resolve()}")

[done] YOLO data config written to:
C:\Users\Betul\Desktop\Projects\Polyp\configs\data_yolo_split5.yaml
