# Sprint 2 – End-to-End Kaggle Notebook (SR → OCR-CTC)

This notebook is **standalone** (no dependency on other notebooks). It runs the full pipeline:

1. **Super-Resolution (HR+)**: train a small SR model on synthetic pairs and export HR+ frames from LR frames.
2. **OCR (Multi-frame CRNN + CTC)**: train the recognizer.
3. **Brazil/Mercosur CTC decoding**: beam search + template constraints + confusion-map post-processing.

## Why the Brazil/Mercosur constraint layer?
In our meeting/discussion notes, the main pain point is CTC decoding errors on ambiguous glyphs (e.g., **O/0**, **I/1**, **B/8**, **Z/2**) and merged characters. Instead of only relying on the network, we add a lightweight, **rule-based** decoding layer that enforces valid plate templates:

- **Brazil old**: `LLLDDDD` (optionally `LLL-DDDD`)
- **Mercosur (Brazil)**: `LLLDLDD` (optionally `LLL-DLDD`)

## Outputs (safe space)
All checkpoints and exports are written to **`/kaggle/working/sprint2_outputs/`** (persist via Kaggle *Save & Commit*).

In [None]:
# --- Setup imports and paths ---
import sys
from pathlib import Path

def add_to_syspath_front(p: Path) -> None:
    p = p.resolve()
    ps = str(p)
    if ps in sys.path:
        sys.path.remove(ps)
    sys.path.insert(0, ps)

# Kaggle: repo dataset name is expected to be 'icpr2026-repo'
repo_root = Path('/kaggle/input/icpr2026-repo')
if repo_root.exists():
    add_to_syspath_front(repo_root)
else:
    # fallback: find any dataset that contains a sprint2 folder
    base = Path('/kaggle/input')
    if base.exists():
        for d in base.iterdir():
            if (d / 'sprint2').exists():
                add_to_syspath_front(d)
                repo_root = d
                break

# Local/dev: if running from a folder that contains ./sprint2
cwd = Path.cwd()
if (cwd / 'sprint2').exists():
    add_to_syspath_front(cwd)

print('CWD:', cwd)
print('repo_root:', repo_root)
print('repo_root exists:', repo_root.exists())
print('repo_root in sys.path:', str(repo_root.resolve()) in [str(Path(x).resolve()) for x in sys.path if isinstance(x, str)])
print('sys.path[0:5]:', sys.path[:5])

# Verify the import resolves from the expected location
import sprint2
print('sprint2 imported from:', sprint2.__file__)

In [None]:
# --- (Optional) Install SR deps if needed (Kaggle + Internet ON only) ---
# This is ONLY required when you set SRConfig.use_pretrained=True and want Real-ESRGAN weights via BasicSR.
# If Kaggle internet is OFF, do NOT run this (it will fail). Instead, upload weights + dependencies or disable pretrained mode.

import importlib
import sys
import subprocess
import socket

def _internet_enabled(timeout_sec: float = 2.0) -> bool:
    # Kaggle doesn't expose a perfect flag; use a quick TCP probe.
    # If this returns False, pip install from PyPI will almost certainly fail.
    try:
        sock = socket.create_connection(('pypi.org', 443), timeout=timeout_sec)
        sock.close()
        return True
    except OSError:
        return False

def _module_exists(name: str) -> bool:
    try:
        importlib.import_module(name)
        return True
    except Exception:
        return False

need_basicsr = not _module_exists('basicsr')
need_realesrgan = not _module_exists('realesrgan')

print('basicsr installed:', not need_basicsr)
print('realesrgan installed:', not need_realesrgan)

internet_ok = _internet_enabled()
print('internet probe (pypi.org:443):', internet_ok)

if need_basicsr or need_realesrgan:
    if not internet_ok:
        print('Internet appears OFF/unavailable. Skipping pip install.')
        print('Options:')
        print('  1) Enable Internet in Kaggle notebook settings and re-run this cell')
        print('  2) Upload a Kaggle Dataset that contains the weights file and set sr_cfg.pretrained_path')
        print("  3) Set sr_cfg.use_pretrained = False to use the built-in SR model")
    else:
        pkgs = []
        if need_basicsr:
            pkgs.append('basicsr')
        if need_realesrgan:
            pkgs.append('realesrgan')
        cmd = [sys.executable, '-m', 'pip', 'install', '-q'] + pkgs
        print('Installing:', pkgs)
        try:
            subprocess.check_call(cmd)
        except subprocess.CalledProcessError as e:
            print('pip install failed:', repr(e))
            print('Tip: in Kaggle you may need to enable Internet, or pin versions compatible with the default torch.')
            raise
        print('Install done. Verifying imports...')
        print('basicsr import OK:', _module_exists('basicsr'))
        print('realesrgan import OK:', _module_exists('realesrgan'))
        print('Now re-run the SR cell.')
else:
    print('All optional SR deps already installed; nothing to do.')

In [None]:
# Import our local sprint2 modules
from sprint2.paths import get_paths
from sprint2.config import OCRConfig, SRConfig
from sprint2.sr_train_export import train_sr, export_hr_plus
from sprint2.ocr_train import train_ocr

paths = get_paths('sprint2_outputs')
paths

## 1) Configure dataset paths (Kaggle)
Set `DATA_ROOT` to your competition dataset train folder containing `Scenario-*/*/track_*`.

In [None]:
# --- Data root: supports uploaded zip OR Kaggle dataset folder ---
import os
import zipfile
from pathlib import Path

# If your dataset isn't showing up under /kaggle/input, it is NOT attached to this notebook session.
# Kaggle mounts only the datasets you add via the right panel: "Add Data".
# After adding a dataset, you often need "Restart Session" for it to appear under /kaggle/input.

PREFERRED_DATA_ROOT = Path('/kaggle/input/icpr-2026-train/train')
PREFERRED_DATASET_ROOT = Path('/kaggle/input/icpr-2026-train')

# Manual override (recommended when debugging):
# - Set in a notebook cell before this runs: os.environ['ICPR_DATA_ROOT'] = '/kaggle/input/<dataset>/train'
# - Or set in Kaggle Notebook "Add-ons" -> "Secrets" / environment variables (if available)
ENV_DATA_ROOT = os.environ.get('ICPR_DATA_ROOT', '').strip()
DATASET_SLUG = os.environ.get('ICPR_DATASET_SLUG', '').strip()  # e.g. 'icpr-2026-train'

# Option A: add your dataset under /kaggle/input/ (Kaggle Dataset or Competition data).
# Option B: upload a single .zip file to the notebook and unzip it into /kaggle/working.
ZIP_PATH = os.environ.get('ICPR_ZIP_PATH', '').strip()  # e.g. '/kaggle/input/my-upload/train.zip' or '/kaggle/working/train.zip'
UNZIP_DIR = Path(os.environ.get('ICPR_UNZIP_DIR', '/kaggle/working/icpr_data'))

def _looks_like_train_dir(p: Path) -> bool:
    if not p.exists() or not p.is_dir():
        return False
    scenarios = list(p.glob('Scenario-*'))
    if not scenarios:
        return False
    # fast structural check: at least one track_* somewhere under a scenario
    for s in scenarios[:3]:
        try:
            if any(s.rglob('track_*')):
                return True
        except Exception:
            # rglob can sometimes fail on very large trees; still accept scenario presence
            return True
    return False

def _candidate_train_roots() -> list[Path]:
    cands: list[Path] = []
    if ENV_DATA_ROOT:
        cands.append(Path(ENV_DATA_ROOT))
    if DATASET_SLUG:
        cands.append(Path('/kaggle/input') / DATASET_SLUG / 'train')
        cands.append(Path('/kaggle/input') / DATASET_SLUG / 'train' / 'train')
        cands.append(Path('/kaggle/input') / DATASET_SLUG)

    # Common layouts for the expected dataset name
    cands.extend([
        PREFERRED_DATA_ROOT,
        PREFERRED_DATASET_ROOT / 'train',
        PREFERRED_DATASET_ROOT / 'train' / 'train',
        PREFERRED_DATA_ROOT / 'train',
        Path('/kaggle/input/icpr-2026-train') / 'train',
        Path('/kaggle/input/icpr-2026-train') / 'train' / 'train',
    ])

    # Also: any /kaggle/input/*/{train,Train}
    base = Path('/kaggle/input')
    if base.exists():
        for d in base.iterdir():
            if d.is_dir():
                cands.append(d / 'train')
                cands.append(d / 'Train')
                cands.append(d / 'train' / 'train')

    # De-dup while preserving order
    seen = set()
    out: list[Path] = []
    for c in cands:
        s = str(c)
        if s not in seen:
            out.append(c)
            seen.add(s)
    return out

def _auto_find_train_root() -> Path:
    for c in _candidate_train_roots():
        if _looks_like_train_dir(c):
            return c
    raise FileNotFoundError(
        "Could not auto-detect train root under /kaggle/input. "
        "Attach the dataset (Add Data) and restart session, or set ICPR_DATA_ROOT."
    )

def _maybe_unzip(zip_path: str, unzip_dir: Path) -> None:
    if not zip_path:
        return
    z = Path(zip_path)
    if not z.exists():
        raise FileNotFoundError(f'ZIP_PATH not found: {z}')
    unzip_dir.mkdir(parents=True, exist_ok=True)
    marker = unzip_dir / '.unzipped.ok'
    if marker.exists():
        return
    print(f'Unzipping {z} -> {unzip_dir} ...')
    with zipfile.ZipFile(z, 'r') as zf:
        zf.extractall(unzip_dir)
    marker.write_text('ok', encoding='utf-8')
    print('Unzip done.')

# If user provided a zip, unzip once into /kaggle/working and use it
if ZIP_PATH:
    _maybe_unzip(ZIP_PATH, UNZIP_DIR)
    for cand in [UNZIP_DIR / 'train', UNZIP_DIR]:
        if _looks_like_train_dir(cand):
            DATA_ROOT = str(cand)
            break
    else:
        found = None
        for t in UNZIP_DIR.rglob('train'):
            if _looks_like_train_dir(t):
                found = t
                break
        if found is None:
            raise FileNotFoundError(f'Unzipped but could not find a valid train folder under {UNZIP_DIR}')
        DATA_ROOT = str(found)
else:
    try:
        DATA_ROOT = str(_auto_find_train_root())
    except FileNotFoundError:
        # Print diagnostics to help you correct the path quickly
        print('--- Diagnostics ---')
        print('ENV ICPR_DATA_ROOT =', ENV_DATA_ROOT or '(not set)')
        print('ENV ICPR_DATASET_SLUG =', DATASET_SLUG or '(not set)')
        print('PREFERRED_DATASET_ROOT exists:', PREFERRED_DATASET_ROOT.exists())
        base = Path('/kaggle/input')
        if base.exists():
            mounted = [p.name for p in base.iterdir() if p.is_dir()]
            print('Datasets mounted in /kaggle/input:', mounted)
            if 'icpr-2026-train' not in mounted:
                print('NOTE: icpr-2026-train is NOT mounted. Use Kaggle right panel -> Add Data -> select your train dataset, then Restart Session.')
        raise

print('DATA_ROOT =', DATA_ROOT)
assert Path(DATA_ROOT).exists(), f'Not found: {DATA_ROOT}'
print('Scenario-A exists:', (Path(DATA_ROOT) / 'Scenario-A').exists())

# Outputs are written here (safe writable space)
OUT_ROOT = paths.out_root
OUT_ROOT

## 2) Super-Resolution (SR) → Export HR+ (distribution-consistent)
We train a stronger SR model (**RRDBNetLite / ESRGAN-lite**) and add a light **degradation-consistency** loss so that when we *downgrade the SR output back to LR*, it matches the **real LR distribution** better.

### Resolution logic (keep it consistent)
- **LR → HR** scale is inferred from real paired `lr-*` and `hr-*` frame sizes (usually `x2`).
- We export **HR+** at `x<scale>` relative to the chosen source frames.

### Which source should you export from?
- `SR_SOURCE='lr'` (**recommended for distribution matching**): export HR+ from `lr-*` so `downsample(HR+) ≈ original LR`.
- `SR_SOURCE='hr'` (optional): export HR+ from `hr-*` for extra detail beyond HR; use when you explicitly want HR→HR+.

Exports go to: `/kaggle/working/sprint2_outputs/hr_plus/x<scale>/...`
You can lower runtime by setting `LIMIT_TRACKS` (export only N track folders).

In [None]:
# SR config tuned to be notebook-friendly
sr_cfg = SRConfig(
    scale=2,
    # If you enable pretrained, the code will try to build a checkpoint from external weights (Real-ESRGAN).
    # If deps/weights/internet are not available, it will fall back to training internal rrdb_lite.
    use_pretrained=True,
    pretrained_name='realesrgan_x2plus',
    pretrained_path='',  # set to local .pth under /kaggle/input/... if you uploaded weights
    pretrained_url='https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth',   # optional; if empty, uses a known default for realesrgan_x2plus
    allow_download=True,  # set True only if Kaggle internet is enabled
    model='rrdb_lite',     # fallback internal model
    rrdb_blocks=6,
    epochs=3,
    batch_size=16,
    patch_size_hr=128,
    lambda_cycle=0.25,  # degradation-consistency (paired lr/hr only)
    cycle_blur_sigma=1.2,
 )
sr_out_dir = OUT_ROOT / 'sr'
sr_ckpt = sr_out_dir / 'sr_best.pt'

# Build SR checkpoint (pretrained if possible; otherwise train fallback)
if not sr_ckpt.exists():
    try:
        sr_ckpt = train_sr(DATA_ROOT, sr_cfg, out_dir=sr_out_dir, max_steps_per_epoch=500)
    except Exception as e:
        print('Pretrained SR unavailable (falling back to internal training):', repr(e))
        sr_cfg.use_pretrained = False
        sr_ckpt = train_sr(DATA_ROOT, sr_cfg, out_dir=sr_out_dir, max_steps_per_epoch=500)

print('SR checkpoint:', sr_ckpt)

# Export SR images
# For LR distribution consistency, use SR_SOURCE='lr' (recommended).
# Use 'hr' only if you explicitly want HR -> HR+.
SR_SOURCE = 'lr'  # 'lr' | 'hr' | 'auto'

# Set LIMIT_TRACKS to e.g. 200 for quick experiments
LIMIT_TRACKS = None
hr_plus_root = export_hr_plus(
    DATA_ROOT,
    sr_ckpt,
    hr_plus_root=OUT_ROOT / 'hr_plus',
    scale=sr_cfg.scale,
    source=SR_SOURCE,
    limit_tracks=LIMIT_TRACKS,
 )
print('HR+ root:', hr_plus_root)

## 3) OCR Training (Multi-frame CRNN + CTC)
We train a CRNN on 5-frame tracks. If `hr_plus_root` exists, the dataloader will automatically use **HR frames (if available)** and swap them to **HR+** when exported (falls back to LR otherwise).

### CTC design notes for Brazil/Mercosur
- Labels are normalized to uppercase alnum and canonicalized by stripping `-`
- CTCLoss uses `blank=0` and `zero_infinity=True`
- Decoding uses **beam search** + **plate template constraints** and a confusion-map to handle ambiguous characters.

In [None]:
ocr_cfg = OCRConfig(
    data_root=DATA_ROOT,
    frames_per_sample=5,
    img_height=32,
    img_width=128,
    batch_size=64,
    epochs=10,
    num_workers=2,
    learning_rate=1e-3,
 )

# Prefer 'mercosur' if most of your dataset is Mercosur format
PREFER_TEMPLATE = None  # or 'mercosur' or 'brazil_old'

ocr_out = train_ocr(
    ocr_cfg,
    out_dir=OUT_ROOT / 'ocr',
    hr_plus_root=str(hr_plus_root.parent),  # points to .../hr_plus
    hr_plus_scale=sr_cfg.scale,
    split_ratio=0.9,
    prefer_template=PREFER_TEMPLATE,
 )
ocr_out

## 4) Artifacts & quick inspection
We save: SR checkpoint, OCR checkpoint, metrics CSV, and a small `val_samples.json` for error inspection.

In [None]:
from pprint import pprint
import json

print('Output root:', OUT_ROOT)
print('Files:')
for p in sorted(OUT_ROOT.rglob('*'))[:50]:
    print(' -', p)

samples_path = OUT_ROOT / 'ocr' / 'val_samples.json'
if samples_path.exists():
    samples = json.loads(samples_path.read_text())
    pprint(samples[:10])

## 5) (Optional) Zip outputs for download
Kaggle lets you download the zip from the Output panel.

In [None]:
import shutil
zip_path = str(OUT_ROOT) + '.zip'
if Path(zip_path).exists():
    Path(zip_path).unlink()
shutil.make_archive(str(OUT_ROOT), 'zip', root_dir=str(OUT_ROOT))
print('Zipped to:', zip_path)