# YOLO Dataset Preparation
**Project:** Automated Expense Extraction - Receipt Parsing Using YOLO and OCR

## Objectives
1. Convert SROIE format to YOLO-compatible format
2. Map SROIE 8-point polygon boxes to YOLO normalized bounding boxes
3. Match text labels with bounding boxes using fuzzy matching
4. Create proper train/val splits
5. Generate YAML configuration for YOLO training

## SROIE vs YOLO Format

### SROIE Format (Input):
```
Structure:
  - box/*.txt: x1,y1,x2,y2,x3,y3,x4,y4,TEXT_CONTENT
  - entities/*.txt: {"company": "...", "date": "...", "total": "..."}
  - img/*.jpg: Original images
```

### YOLO Format (Output):
```
Structure:
  - images/train/*.jpg: Training images
  - images/val/*.jpg: Validation images
  - labels/train/*.txt: class_id x_center y_center width height (normalized 0-1)
  - labels/val/*.txt: class_id x_center y_center width height (normalized 0-1)
  - dataset.yaml: Dataset configuration
```

## Target Classes
- **Class 0:** Company/Vendor name
- **Class 1:** Date
- **Class 2:** Total amount

## Setup & Imports

In [12]:
import json
import os
import shutil
from pathlib import Path
from tqdm.notebook import tqdm
from PIL import Image
from difflib import SequenceMatcher

In [13]:
# Check if we are running in Google Colab and set paths accordingly
if 'COLAB_GPU' in os.environ:
    # Mount Google Drive (for Colab)
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Configuration for Google Colab
    DATA_PATH = Path('/content/drive/MyDrive/data')
else:
    # Local configuration
    DATA_PATH = Path('../data')

# Define directories based on DATA_PATH
RAW_DIR = DATA_PATH / "raw/SROIE2019"
YOLO_DIR = DATA_PATH / "processed/yolo_dataset"

# Verify paths
for directory in [RAW_DIR, YOLO_DIR]:
    print(f"{'Raw' if directory == RAW_DIR else 'YOLO'} directory: {directory}")
    print(f"{'Exists' if directory.exists() else 'Does not exist'}")


Raw directory: ../data/raw/SROIE2019
Exists
YOLO directory: ../data/processed/yolo_dataset
Exists


In [14]:
# YOLO class mapping
CLASS_MAP = {
    "company": 0,
    "date": 1,
    "total": 2
}

print(f"\nClass mapping: {CLASS_MAP}")


Class mapping: {'company': 0, 'date': 1, 'total': 2}


## 1. Initialize Output Directory Structure

In [15]:
# Clean and create directory structure
if YOLO_DIR.exists():
    print(f"Removing existing directory: {YOLO_DIR}")
    shutil.rmtree(YOLO_DIR)

# Create YOLO directory structure
for split in ['train', 'val']:
    (YOLO_DIR / 'images' / split).mkdir(parents=True, exist_ok=True)
    (YOLO_DIR / 'labels' / split).mkdir(parents=True, exist_ok=True)

print("✅ Directory structure created:")
print(f"   {YOLO_DIR}/images/train/")
print(f"   {YOLO_DIR}/images/val/")
print(f"   {YOLO_DIR}/labels/train/")
print(f"   {YOLO_DIR}/labels/val/")

Removing existing directory: ../data/processed/yolo_dataset
✅ Directory structure created:
   ../data/processed/yolo_dataset/images/train/
   ../data/processed/yolo_dataset/images/val/
   ../data/processed/yolo_dataset/labels/train/
   ../data/processed/yolo_dataset/labels/val/


## 2. Coordinate Conversion Functions

In [16]:
def sroie_to_yolo(coords_list, img_w, img_h):
    """
    Convert SROIE 8-point polygon to YOLO normalized bounding box.

    SROIE Format: [x1, y1, x2, y2, x3, y3, x4, y4] (8 points defining polygon)
    YOLO Format: class_id x_center y_center width height (all normalized 0-1)

    Args:
        coords_list: List of 8 coordinates [x1,y1,x2,y2,x3,y3,x4,y4]
        img_w: Image width (pixels)
        img_h: Image height (pixels)

    Returns:
        str: YOLO format bbox "x_center y_center width height" (normalized)
    """
    # Extract x and y coordinates
    xs = coords_list[0::2]  # x1, x2, x3, x4
    ys = coords_list[1::2]  # y1, y2, y3, y4

    # Get bounding box from polygon
    x_min, x_max = min(xs), max(xs)
    y_min, y_max = min(ys), max(ys)

    # Clip to image boundaries
    x_min = max(0, x_min)
    y_min = max(0, y_min)
    x_max = min(img_w, x_max)
    y_max = min(img_h, y_max)

    # Calculate width, height, and center
    bbox_w = x_max - x_min
    bbox_h = y_max - y_min
    center_x = x_min + (bbox_w / 2)
    center_y = y_min + (bbox_h / 2)

    # Normalize to 0-1 range
    norm_x = center_x / img_w
    norm_y = center_y / img_h
    norm_w = bbox_w / img_w
    norm_h = bbox_h / img_h

    return f"{norm_x:.6f} {norm_y:.6f} {norm_w:.6f} {norm_h:.6f}"

# # Test the conversion function
# print("Testing coordinate conversion:")
# test_coords = [100, 50, 300, 50, 300, 100, 100, 100]  # Rectangle
# test_result = sroie_to_yolo(test_coords, img_w=1000, img_h=1500)
# print(f"  Input (8-point): {test_coords}")
# print(f"  Output (YOLO): {test_result}")
# print(f"  Format: x_center y_center width height (normalized)")

## 3. Text Matching Functions

In [17]:
def fuzzy_match_text(text1, text2, threshold=0.85):
    """
    Compute fuzzy similarity between two strings.

    Uses SequenceMatcher to handle OCR variations:
    - Typos
    - Extra spaces
    - Case differences

    Args:
        text1: First string
        text2: Second string
        threshold: Minimum similarity score (0-1)

    Returns:
        float: Similarity ratio (0-1)
    """
    return SequenceMatcher(None, text1.lower(), text2.lower()).ratio()


def find_best_bbox_for_label(target_text, doc_lines, threshold=0.85):
    """
    Find the best matching bounding box for a given label text.

    Strategy:
    1. Compare target text with all line texts
    2. Use fuzzy matching to handle OCR variations
    3. Return bbox with highest similarity above threshold

    Args:
        target_text: Text to match (from entities file)
        doc_lines: List of dicts with 'coords' and 'text' keys
        threshold: Minimum similarity score

    Returns:
        list or None: Best matching coordinates [x1,y1,...,x4,y4] or None
    """
    best_ratio = 0
    best_coords = None

    for line_obj in doc_lines:
        ratio = fuzzy_match_text(line_obj['text'], target_text)

        if ratio > threshold and ratio > best_ratio:
            best_ratio = ratio
            best_coords = line_obj['coords']

    return best_coords, best_ratio


# # Test fuzzy matching
# print("Testing fuzzy matching:")
# test_pairs = [
#     ("ABC Company Ltd.", "ABC COMPANY LTD"),
#     ("01/12/2023", "01-12-2023"),
#     ("Total: $123.45", "TOTAL $123.45")
# ]

# for text1, text2 in test_pairs:
#     ratio = fuzzy_match_text(text1, text2)
#     print(f"  '{text1}' vs '{text2}': {ratio:.2f}")

## 4. Main Conversion Pipeline

In [18]:
def process_single_receipt(file_id, split, raw_split_dir, proc_img_dir, yolo_split):
    """
    Process a single receipt: load annotations, match labels, convert to YOLO.

    Args:
        file_id: Receipt ID (filename without extension)
        split: Original split ('train' or 'test')
        raw_split_dir: Path to raw SROIE split directory
        proc_img_dir: Path to preprocessed images
        yolo_split: YOLO split name ('train' or 'val')

    Returns:
        dict: Processing status and statistics
    """
    # Define paths
    gt_file = raw_split_dir / "entities" / f"{file_id}.txt"
    box_file = raw_split_dir / "box" / f"{file_id}.txt"
    img_path = proc_img_dir / f"{file_id}.jpg"

    # Check if all files exist
    if not all([gt_file.exists(), box_file.exists(), img_path.exists()]):
        return {'success': False, 'reason': 'missing_files'}

    # 1. Load ground truth labels (entities)
    try:
        with open(gt_file, 'r', encoding='utf-8') as f:
            content = f.read()
            # Fix common JSON formatting issues in SROIE dataset
            content = content.replace(",\n}", "}").replace(",}", "}")
            gt_data = json.loads(content)
    except (json.JSONDecodeError, UnicodeDecodeError) as e:
        return {'success': False, 'reason': 'invalid_json', 'error': str(e)}

    # 2. Load bounding box coordinates
    doc_lines = []
    try:
        with open(box_file, 'r', encoding='utf-8', errors='ignore') as f:
            for line in f:
                parts = line.strip().split(',')
                if len(parts) >= 9:
                    # First 8 parts are coordinates
                    coords = [int(p) for p in parts[:8]]
                    # Remaining parts form the text (may contain commas)
                    text = ",".join(parts[8:]).strip()
                    doc_lines.append({'coords': coords, 'text': text})
    except Exception as e:
        return {'success': False, 'reason': 'invalid_box_file', 'error': str(e)}

    # 3. Get image dimensions
    try:
        with Image.open(img_path) as img:
            img_w, img_h = img.size
    except Exception as e:
        return {'success': False, 'reason': 'invalid_image', 'error': str(e)}

    # 4. Match labels to bounding boxes
    yolo_labels = []
    matched_fields = []

    for field_name, class_id in CLASS_MAP.items():
        target_text = gt_data.get(field_name)
        if not target_text:
            continue

        # Find best matching box
        best_coords, match_ratio = find_best_bbox_for_label(target_text, doc_lines)

        if best_coords and match_ratio >= 0.85:
            # Convert to YOLO format
            yolo_bbox = sroie_to_yolo(best_coords, img_w, img_h)
            yolo_labels.append(f"{class_id} {yolo_bbox}")
            matched_fields.append(field_name)

    # 5. Save if we found at least one label
    if yolo_labels:
        # Save labels
        label_path = YOLO_DIR / "labels" / yolo_split / f"{file_id}.txt"
        with open(label_path, 'w') as f:
            f.write("\n".join(yolo_labels))

        # Copy preprocessed image
        dest_img = YOLO_DIR / "images" / yolo_split / f"{file_id}.jpg"
        shutil.copy(img_path, dest_img)

        return {
            'success': True,
            'file_id': file_id,
            'num_labels': len(yolo_labels),
            'matched_fields': matched_fields
        }
    else:
        return {'success': False, 'reason': 'no_matches'}


def convert_sroie_to_yolo():
    """
    Main function to convert entire SROIE dataset to YOLO format.

    Mapping:
    - SROIE 'train' → YOLO 'train'
    - SROIE 'test' → YOLO 'val'
    """
    # Split mapping
    split_map = {
        'train': 'train',
        'test': 'val'
    }

    overall_stats = {}

    for sroie_split, yolo_split in split_map.items():
        print(f"\n{'='*60}")
        print(f"Processing: {sroie_split.upper()} → YOLO {yolo_split.upper()}")
        print(f"{'='*60}")

        # Define paths
        raw_split_dir = RAW_DIR / sroie_split
        proc_img_dir = raw_split_dir / "img"

        # Get all entity files
        entity_files = list((raw_split_dir / "entities").glob("*.txt"))
        print(f"Found {len(entity_files)} receipts")

        # Process each receipt
        results = []
        for gt_file in tqdm(entity_files, desc=f"Converting {sroie_split}"):
            file_id = gt_file.stem
            result = process_single_receipt(
                file_id, sroie_split, raw_split_dir, proc_img_dir, yolo_split
            )
            results.append(result)

        # Calculate statistics
        successful = [r for r in results if r['success']]
        failed = [r for r in results if not r['success']]

        # Count failure reasons
        failure_reasons = {}
        for r in failed:
            reason = r.get('reason', 'unknown')
            failure_reasons[reason] = failure_reasons.get(reason, 0) + 1

        # Count field matches
        field_counts = {field: 0 for field in CLASS_MAP.keys()}
        for r in successful:
            for field in r.get('matched_fields', []):
                field_counts[field] += 1

        overall_stats[yolo_split] = {
            'total': len(results),
            'successful': len(successful),
            'failed': len(failed),
            'field_counts': field_counts,
            'failure_reasons': failure_reasons
        }

        # Print statistics
        print(f"\nResults for {sroie_split}:")
        print(f"  Total receipts: {len(results)}")
        print(f"  Successfully converted: {len(successful)}")
        print(f"  Failed: {len(failed)}")

        if successful:
            print(f"\n  Field Detection Rates:")
            for field, count in field_counts.items():
                rate = (count / len(successful)) * 100
                print(f"    {field}: {count}/{len(successful)} ({rate:.1f}%)")

        if failure_reasons:
            print(f"\n  Failure Reasons:")
            for reason, count in failure_reasons.items():
                print(f"    {reason}: {count}")

    return overall_stats


# Execute conversion
print("Starting SROIE to YOLO conversion...\n")
conversion_stats = convert_sroie_to_yolo()

print(f"\n\n{'='*60}")
print("✅ CONVERSION COMPLETE!")
print(f"{'='*60}")

Starting SROIE to YOLO conversion...


Processing: TRAIN → YOLO TRAIN
Found 626 receipts


Converting train:   0%|          | 0/626 [00:00<?, ?it/s]


Results for train:
  Total receipts: 626
  Successfully converted: 619
  Failed: 7

  Field Detection Rates:
    company: 435/619 (70.3%)
    date: 142/619 (22.9%)
    total: 592/619 (95.6%)

  Failure Reasons:
    no_matches: 7

Processing: TEST → YOLO VAL
Found 347 receipts


Converting test:   0%|          | 0/347 [00:00<?, ?it/s]


Results for test:
  Total receipts: 347
  Successfully converted: 344
  Failed: 3

  Field Detection Rates:
    company: 235/344 (68.3%)
    date: 68/344 (19.8%)
    total: 338/344 (98.3%)

  Failure Reasons:
    no_matches: 3


✅ CONVERSION COMPLETE!


## 5. Generate YOLO Configuration File

In [19]:
# Create dataset.yaml
yaml_content = f"""# SROIE Receipt Dataset - YOLO Format
# Generated for receipt field detection (vendor, date, total)

path: {str(YOLO_DIR)}  # Dataset root directory
train: images/train  # Train images (relative to 'path')
val: images/val      # Validation images (relative to 'path')

# Number of classes
nc: 3

# Class names
names:
  0: vendor   # Company/Store name
  1: date     # Transaction date
  2: total    # Total amount

# Dataset statistics
# Train: {conversion_stats['train']['successful']} images
# Val: {conversion_stats['val']['successful']} images
"""

yaml_path = YOLO_DIR / "dataset.yaml"
with open(yaml_path, 'w') as f:
    f.write(yaml_content)

print(f"✅ YAML configuration saved: {yaml_path}")
print("\nContents:")
print(yaml_content)

✅ YAML configuration saved: ../data/processed/yolo_dataset/dataset.yaml

Contents:
# SROIE Receipt Dataset - YOLO Format
# Generated for receipt field detection (vendor, date, total)

path: ../data/processed/yolo_dataset  # Dataset root directory
train: images/train  # Train images (relative to 'path')
val: images/val      # Validation images (relative to 'path')

# Number of classes
nc: 3

# Class names
names:
  0: vendor   # Company/Store name
  1: date     # Transaction date
  2: total    # Total amount

# Dataset statistics
# Train: 619 images
# Val: 344 images



## 6. Verify Output Structure

In [20]:
def verify_dataset_structure():
    """Verify the generated YOLO dataset structure and contents."""

    print("\n" + "="*60)
    print("DATASET VERIFICATION")
    print("="*60)

    for split in ['train', 'val']:
        img_dir = YOLO_DIR / "images" / split
        label_dir = YOLO_DIR / "labels" / split

        num_images = len(list(img_dir.glob("*.jpg")))
        num_labels = len(list(label_dir.glob("*.txt")))

        print(f"\n{split.upper()} Split:")
        print(f"  Images: {num_images}")
        print(f"  Labels: {num_labels}")
        print(f"  Match: {'✅' if num_images == num_labels else '❌'}")

        # Sample a label file
        if num_labels > 0:
            sample_label = list(label_dir.glob("*.txt"))[0]
            with open(sample_label, 'r') as f:
                lines = f.readlines()
            print(f"\n  Sample label ({sample_label.name}):")
            for line in lines:
                parts = line.strip().split()
                class_id = int(parts[0])
                class_name = list(CLASS_MAP.keys())[list(CLASS_MAP.values()).index(class_id)]
                print(f"    Class {class_id} ({class_name}): {' '.join(parts[1:])}")

    # Check YAML
    yaml_exists = (YOLO_DIR / "dataset.yaml").exists()
    print(f"\nYAML config: {'✅' if yaml_exists else '❌'}")

    print("\n" + "="*60)
    print("Final directory structure:")
    print("="*60)
    print(f"""
{YOLO_DIR}/
├── dataset.yaml
├── images/
│   ├── train/     ({len(list((YOLO_DIR/'images'/'train').glob('*.jpg')))} images)
│   └── val/       ({len(list((YOLO_DIR/'images'/'val').glob('*.jpg')))} images)
└── labels/
    ├── train/     ({len(list((YOLO_DIR/'labels'/'train').glob('*.txt')))} labels)
    └── val/       ({len(list((YOLO_DIR/'labels'/'val').glob('*.txt')))} labels)
    """)

verify_dataset_structure()


DATASET VERIFICATION

TRAIN Split:
  Images: 619
  Labels: 619
  Match: ✅

  Sample label (X51006555072.txt):
    Class 0 (company): 0.513514 0.195010 0.668919 0.025458
    Class 1 (date): 0.190034 0.448065 0.217905 0.022403
    Class 2 (total): 0.779561 0.601324 0.133446 0.025458

VAL Split:
  Images: 344
  Labels: 344
  Match: ✅

  Sample label (X51005675104.txt):
    Class 0 (company): 0.497315 0.124332 0.809882 0.026203
    Class 2 (total): 0.765306 0.506150 0.114930 0.020856

YAML config: ✅

Final directory structure:

../data/processed/yolo_dataset/
├── dataset.yaml
├── images/
│   ├── train/     (619 images)
│   └── val/       (344 images)
└── labels/
    ├── train/     (619 labels)
    └── val/       (344 labels)
    


## 7. Summary & Next Steps

### YOLO Dataset Created 

**What We Did:**
1. ✅ Converted SROIE 8-point polygons → YOLO normalized bboxes
2. ✅ Matched text labels with bounding boxes using fuzzy matching
3. ✅ Created proper train/val split
4. ✅ Generated YAML configuration

**Classes:**
- Class 0: Vendor/Company name
- Class 1: Date
- Class 2: Total amount


## Next Steps

**YOLO Training:**