<a href="https://colab.research.google.com/github/jagadeesh-usd/receipt-expense-tracker-cv/blob/jaga-dev/notebooks/05_OCR_Tesseract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Automated Expense Extraction - Receipt Parsing Using YOLO and OCR
###  Baseline OCR Extraction (Tesseract)

### Objective
Establish a secondary **baseline** using **Tesseract OCR**, the industry-standard open-source engine. This allows us to compare "Deep Learning OCR" (EasyOCR) vs. "Traditional OCR" (Tesseract) to ensure our baseline metrics are robust.

### Methodology
1.  **Input:** "Adaptive Preprocessed" images from Module 01 (optimized for Tesseract's contrast requirements).
2.  **Engine:** Tesseract 4.x (LSTM-based) via `pytesseract`.
3.  **Process:**
    * Run Tesseract on the **full image** (blind extraction).
    * Parse the HOCR/Dict output to get words and bounding boxes.
4.  **Output:** Save raw extraction data to JSON files (`data/processed/tesseract_ocr`).

In [1]:
# !apt-get update
# !apt-get install -y tesseract-ocr
!pip install pytesseract pillow

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


#### Setup & Imports

In [2]:
import pytesseract
from PIL import Image
from pathlib import Path
import json
import re
import os
from tqdm import tqdm

print("Tesseract version:", pytesseract.get_tesseract_version())

Tesseract version: 4.1.1


In [3]:
# Check if running in Google Colab
if 'COLAB_GPU' in os.environ:
    # Mount Google Drive (for Colab)
    from google.colab import drive
    drive.mount('/content/drive')

    # Set DATA_PATH for Google Drive
    DATA_PATH = Path('/content/drive/MyDrive/data')
else:
    # Set DATA_PATH for local environment
    DATA_PATH = Path('../data')

Mounted at /content/drive


In [8]:
BASE_DIR = Path(f"{DATA_PATH}/processed/SROIE2019")
OUT_DIR = Path(f"{DATA_PATH}/processed/tesseract_ocr")
OUT_DIR.mkdir(parents=True, exist_ok=True)

#### Tesseract OCR Function

In [9]:
def tesseract_ocr(image_path: str):
    """
    Runs Tesseract and returns EasyOCR-compatible format
    with actual bounding boxes and confidence scores.
    """
    img = Image.open(image_path).convert("L")

    # Get detailed data including boxes and confidence
    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)

    lines = []
    raw = []

    for i, text in enumerate(data['text']):
        if not text.strip():
            continue

        # Extract bounding box
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        bbox = [[x, y], [x+w, y], [x+w, y+h], [x, y+h]]

        # Convert confidence from 0-100 to 0-1
        conf = float(data['conf'][i]) / 100.0 if data['conf'][i] != -1 else 0.0

        lines.append(text)
        raw.append({
            "bbox": bbox,
            "text": text,
            "conf": conf
        })

    return {
        "full_text": "\\n".join(lines),
        "lines": lines,
        "raw": raw
    }

In [10]:
splits = ["train", "test"]

for split in splits:
    img_dir = BASE_DIR / split
    out_dir = OUT_DIR / split
    out_dir.mkdir(parents=True, exist_ok=True)

    print(f"Running Tesseract on {split}...")

    # Using tqdm to track progress
    for img_path in tqdm(sorted(img_dir.glob("*")), desc=f"Processing images in {split}", unit="file"):
        out_file = out_dir / f"{img_path.stem}.json"

        if out_file.exists():
            continue  # skip already processed files

        try:
            result = tesseract_ocr(str(img_path))
        except Exception as e:
            print("Error:", img_path, e)
            continue

        with open(out_file, "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)

print("✔ DONE: Tesseract OCR complete!")

Running Tesseract on train...


Processing images in train: 100%|██████████| 626/626 [13:40<00:00,  1.31s/file]


Running Tesseract on test...


Processing images in test: 100%|██████████| 347/347 [09:34<00:00,  1.65s/file]

✔ DONE: Tesseract OCR complete!



