# 03 â€“ OCR using Microsoft TrOCR (Transformer OCR)

**Goal:**
- Run TrOCR on all preprocessed receipt images
- Save OCR output in JSON format
- Match the EasyOCR results format for comparison
- Prepare data for field extraction and accuracy evaluation

In [None]:
!pip install transformers pillow sentencepiece --quiet

## 1. Load TrOCR Model

In [None]:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch
import json
from pathlib import Path
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

# Load TrOCR base model
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-printed').to(device)

## 2. OCR Helper Functions

In [None]:
def trocr_ocr(image_path: str):
    """
    Runs TrOCR on a single image and returns:
      - full_text
      - lines (split)
      - raw (minimal JSON structure to match EasyOCR format)
    """
    image = Image.open(image_path).convert('RGB')
    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
    output_ids = model.generate(pixel_values)

    text = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
    lines = text.split('\n')

    # TrOCR doesn't return bounding boxes â†’ we'll structure raw results differently
    raw = []
    for line in lines:
        raw.append({"bbox": [], "text": line, "conf": 1.0})

    return {
        "full_text": text,
        "lines": lines,
        "raw": raw
    }

## 3. Define Input/Output Paths

**Make sure this path matches your project structure:**
- Preprocessed images live in `data/processed/SROIE2019/train` & `test`
- OCR output will be saved into `data/processed/trocr_ocr/train` & `test`

In [None]:
BASE_DIR = Path("../data/processed/SROIE2019")  # adjust if needed
OUT_DIR = Path("../data/processed/trocr_ocr")
OUT_DIR.mkdir(parents=True, exist_ok=True)

splits = ["train", "test"]

## 4. Run TrOCR on All Images

In [None]:
for split in splits:
    print(f"Running TrOCR on {split} set...")
    img_dir = BASE_DIR / split
    out_dir = OUT_DIR / split
    out_dir.mkdir(parents=True, exist_ok=True)

    for img_path in sorted(img_dir.glob("*")):
        out_file = out_dir / f"{img_path.stem}.json"

        # skip already processed images
        if out_file.exists():
            continue

        try:
            result = trocr_ocr(str(img_path))
        except Exception as e:
            print("Error processing", img_path, e)
            continue

        with open(out_file, "w", encoding="utf-8") as f:
            json.dump(result, f, indent=2, ensure_ascii=False)

print("âœ” TrOCR extraction complete!")

## 5. Preview Example Output

In [None]:
sample_json = next((OUT_DIR / 'train').glob('*.json'))
print("Example file:", sample_json)

with open(sample_json, "r") as f:
    data = json.load(f)

data

# âœ” Next Steps

You now have TrOCR output for all receipts!

### Proceed to:
1. **Run field extraction** on TrOCR JSON files (reuse your existing code)
2. Generate a CSV similar to `train_extracted.csv`
3. Evaluate accuracy:
   - Vendor
   - Date
   - Total
4. Compare EasyOCR vs TrOCR performance

Want the next notebook for **field extraction + accuracy evaluation**?

**Just say:**  
ðŸ‘‰ *"Give me the evaluation notebook"*