# üßæ Kraken OCR ‚Äî Easy Colab Trainer (Non-experts friendly)

This notebook helps you **train a Kraken recognition (rec) model** on Google Colab with minimal setup.

**Highlights**  
- **No admin needed** ‚Äî works fully in Colab.  
- **Fast re-open** ‚Äî uses a **pip cache in Drive**, so re-installs are much faster next time.  
- **Zero path headaches** ‚Äî auto-creates `train.txt` / `val.txt` if missing.  
- **Safe defaults** ‚Äî CPU by default (set GPU in *Runtime ‚Üí Change runtime type* if available).

> Colab resets the runtime on re-open. We keep a **pip cache in Drive** to make installs much faster.


In [None]:
# --- Step 0: Mount Google Drive (for datasets, models, and pip cache) ---
from google.colab import drive
drive.mount('/content/drive')
print("Drive mounted.")

In [None]:
# --- Step 1: Fast setup with Drive pip cache & Kraken install ---
import os, sys, subprocess, importlib

PIP_CACHE_DIR = "/content/drive/MyDrive/pip-cache"
os.makedirs(PIP_CACHE_DIR, exist_ok=True)

KRAKEN_VERSION = os.environ.get("KRAKEN_VERSION", "").strip() or ""   # pin if desired, e.g. "5.3.0"
EXTRA = "[train]"  # include training extras

def ensure(pkg_spec: str):
    try:
        mod_name = pkg_spec.split("[" ,1)[0].split("==",1)[0]
        importlib.import_module(mod_name)
        print(f"‚úì {pkg_spec} already available.")
    except Exception:
        print(f"Installing {pkg_spec} (cached in Drive) ...")
        cmd = [sys.executable, "-m", "pip", "install", "--upgrade", "--no-input",
               "--cache-dir", PIP_CACHE_DIR, pkg_spec]
        subprocess.check_call(cmd)

pkg = f"kraken{EXTRA}" if not KRAKEN_VERSION else f"kraken{EXTRA}=={KRAKEN_VERSION}"
ensure(pkg)

import kraken, subprocess
print("Kraken version:", getattr(kraken, "__version__", "unknown"))
!ketos --version || true


## Configure your project (edit the variables below)

- You can **upload a ZIP dataset** and/or a **base .mlmodel** below, or point to data already on Drive.  
- The notebook will create `train.txt` and `val.txt` automatically if they don't exist.
- Supports **image+`.gt.txt` pairs** and **ALTO/PAGE XML** (choose via `DATA_FORMAT`).

> If unsure, start with `DATA_FORMAT = "pairs"`.


In [None]:
# --- Step 2: Project paths & options (EDIT these if needed) ---
from pathlib import Path

# === Edit these: ===
DATA_DIR   = Path("/content/drive/MyDrive/kraken_data/0093")   # or leave blank and upload a ZIP below
DATA_FORMAT = "pairs"   # "pairs" (image + .gt.txt), or "alto", or "page"

RUN_ID     = "0093"     # short id for your dataset/run
OUT_DIR    = Path(f"/content/drive/MyDrive/kraken_models/{RUN_ID}/rec")
OUT_DIR.mkdir(parents=True, exist_ok=True)

VAL_RATIO = 0.1
RANDOM_SEED = 42

print("Models will be saved in:", OUT_DIR)

## Upload data and (optionally) the first base model

- If this is your **first attempt** (e.g., `attempt_01`) and you don't have a model in Drive yet, upload a **base model** (`.mlmodel`).
- You can also upload your **dataset** as a **ZIP** (we'll extract it to Drive). If your data is **already on Drive**, you can **skip** the ZIP upload and edit `DATA_DIR` above.


In [None]:
# --- Step 2.1: (Optional) Upload ZIP dataset and/or a base .mlmodel ---
from google.colab import files
from pathlib import Path
import zipfile, io

print("If your data is already on Drive, you can skip uploading a ZIP here.")
uploaded = files.upload()  # choose zero or more files

def is_zip(name): return name.lower().endswith(".zip")
def is_model(name): return name.lower().endswith(".mlmodel")

for fname, bytes_obj in uploaded.items():
    if is_zip(fname):
        print(f"Extracting ZIP to {DATA_DIR} ...")
        DATA_DIR.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(io.BytesIO(bytes_obj), 'r') as z:
            z.extractall(DATA_DIR)
        print("Done extracting.")
    elif is_model(fname):
        OUT_DIR.mkdir(parents=True, exist_ok=True)
        target = OUT_DIR / "attempt_01.mlmodel"
        print(f"Saving uploaded base model to {target} ...")
        with open(target, "wb") as f:
            f.write(bytes_obj)
        print("Saved base model.")
    else:
        print(f"Skipped {fname} (not .zip/.mlmodel)")

print("Upload step complete. If you added data, run 'Build train/val lists'.")

## Attempts: name, chaining, and where they save

- Set `ATTEMPT_NUM` manually (e.g., `1`, `2`, `3`, ...).  
- The model will be saved to Drive as `attempt_XX.mlmodel` (e.g., `attempt_02.mlmodel`).  
- **Chaining:** If `ATTEMPT_NUM > 1` and `BASE_MODEL_MODE = "auto"`, we'll load `attempt_{ATTEMPT_NUM-1}.mlmodel` as the base.  
- If you want to load a **different** base model, set `BASE_MODEL_MODE = "manual"` and give the path in `BASE_MODEL_MANUAL`.


In [None]:
# --- Step 2.2: Attempt settings & chaining ---
from pathlib import Path

ATTEMPT_NUM = 1                 # e.g. 1, 2, 3 ... you control this
BASE_MODEL_MODE = "auto"        # "auto" ‚Üí previous attempt; "manual" ‚Üí use BASE_MODEL_MANUAL
BASE_MODEL_MANUAL = ""          # set a path if using "manual"

ATTEMPT_NAME = f"attempt_{ATTEMPT_NUM:02d}.mlmodel"
BEST_MODEL_OUT = OUT_DIR / ATTEMPT_NAME

if BASE_MODEL_MODE == "auto":
    if ATTEMPT_NUM <= 1:
        base_model_path = OUT_DIR / "attempt_01.mlmodel"
    else:
        base_model_path = OUT_DIR / f"attempt_{ATTEMPT_NUM-1:02d}.mlmodel"
elif BASE_MODEL_MODE == "manual":
    base_model_path = Path(BASE_MODEL_MANUAL) if BASE_MODEL_MANUAL else Path("")
else:
    raise ValueError("BASE_MODEL_MODE must be 'auto' or 'manual'")

print("Attempt:", ATTEMPT_NAME)
print("Base model (resolved):", str(base_model_path) if base_model_path else "(none)")
print("Models directory:", OUT_DIR)

## Build `train.txt` / `val.txt` automatically

Scans your `DATA_DIR`, collects samples, and writes:
- `/content/lists/train.txt`
- `/content/lists/val.txt`


In [None]:
# --- Step 3: Auto-create train/val lists from your data ---
import os, random, glob
from pathlib import Path

LISTS_DIR = Path("/content/lists")
LISTS_DIR.mkdir(parents=True, exist_ok=True)
train_list_path = LISTS_DIR / "train.txt"
val_list_path   = LISTS_DIR / "val.txt"

def collect_pairs(root: Path):
    exts = {".png",".jpg",".jpeg",".tif",".tiff",".bmp"}
    images = []
    for ext in exts:
        images.extend(root.rglob(f"*{ext}"))
    samples = []
    for img in images:
        base = img.with_suffix("")
        gt = img.parent / (base.name + ".gt.txt")
        if gt.exists():
            samples.append(str(gt))  # kraken expects GT file paths
    return sorted(samples)

def collect_xml(root: Path, kind: str):
    return sorted([str(p) for p in root.rglob("*.xml")])

if DATA_FORMAT == "pairs":
    all_samples = collect_pairs(Path(DATA_DIR))
elif DATA_FORMAT in ("alto", "page"):
    all_samples = collect_xml(Path(DATA_DIR), DATA_FORMAT)
else:
    raise ValueError("DATA_FORMAT must be one of: 'pairs', 'alto', 'page'")

if not all_samples:
    raise SystemExit(f"No training samples found under {DATA_DIR}.")

random.seed(RANDOM_SEED)
random.shuffle(all_samples)
n = len(all_samples)
n_val = max(1, int(n * VAL_RATIO))
val = all_samples[:n_val]
train = all_samples[n_val:]

with open(train_list_path, "w") as f:
    f.write("\n".join(train) + "\n")
with open(val_list_path, "w") as f:
    f.write("\n".join(val) + "\n")

print(f"Wrote {len(train)} train and {len(val)} val samples.")
print("Train list:", train_list_path)
print("Val   list:", val_list_path)

In [None]:
# --- Step 4: Sanity-check the lists ---
from pathlib import Path
print(Path('/content/lists/train.txt').read_text().splitlines()[:5])
print(Path('/content/lists/val.txt').read_text().splitlines()[:5])

## Train with checkpoints and pick the **best** model automatically

- Saves a checkpoint **every epoch** (`--savefreq 1`).  
- If you **stop** training to tweak LR, we still evaluate all checkpoints and pick the **best**.  
- The best model is saved to `OUT_DIR/attempt_XX.mlmodel` and logged to `attempts.csv`.


In [None]:
# --- Step 5A: Train (pairs/alto/page auto) with checkpoints and best picking ---
import os, re, csv, shutil, subprocess, datetime
from pathlib import Path

train_list_path = "/content/lists/train.txt"
val_list_path   = "/content/lists/val.txt"

assert Path(train_list_path).exists(), "train.txt missing. Build lists first."
assert Path(val_list_path).exists(), "val.txt missing. Build lists first."

CKPT_DIR = OUT_DIR / f"_ckpts_{ATTEMPT_NAME.replace('.mlmodel','')}"
CKPT_DIR.mkdir(parents=True, exist_ok=True)

base_opt = ""
if 'base_model_path' in globals() and base_model_path and Path(base_model_path).exists():
    base_opt = f'--load "{base_model_path}"'
    print("Using base model:", base_model_path)
else:
    print("No base model found. Training from scratch.")

# Hyperparams (edit)
EPOCHS = 20
LR = 0.0003
BATCH_SIZE = 16

fmt_opt = "" if DATA_FORMAT == "pairs" else f"-f {DATA_FORMAT}"

train_cmd = f'''ketos train {fmt_opt} {base_opt} --savefreq 1 --epochs {EPOCHS} -lr {LR} -b {BATCH_SIZE} -o "{CKPT_DIR}/epoch_model.mlmodel" $(cat "{train_list_path}") --validation $(cat "{val_list_path}")'''
print("Training command:\\n", train_cmd)

ret = subprocess.call(train_cmd, shell=True, executable="/bin/bash")
if ret != 0:
    print("Training exited non-zero (you may have interrupted it). Proceeding to pick the best checkpoint.")

def parse_accuracy(output: str):
    acc = None
    cer = None
    m = re.search(r'accuracy[:\\s]+([0-9.]+)%', output, re.I)
    if m: acc = float(m.group(1))
    m2 = re.search(r'CER[:\\s]+([0-9.]+)', output, re.I)
    if m2: cer = float(m2.group(1))
    return acc, cer

def score_model(model_path: Path):
    test_cmd = f'''ketos test -m "{model_path}" $(cat "{val_list_path}")'''
    proc = subprocess.run(test_cmd, shell=True, executable="/bin/bash",
                          stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    out = proc.stdout
    acc, cer = parse_accuracy(out)
    score = acc if acc is not None else (100.0 - cer*100.0 if cer is not None else float("-inf"))
    return score, acc, cer

ckpts = sorted(Path(CKPT_DIR).glob("*.mlmodel"))
if not ckpts:
    raise SystemExit(f"No checkpoints found in {CKPT_DIR}. Nothing to pick.")

best = None
for m in ckpts:
    score, acc, cer = score_model(m)
    print(f"{m.name}: score={score:.4f}  acc={acc}  cer={cer}")
    if best is None or score > best["score"]:
        best = {"path": m, "score": score, "acc": acc, "cer": cer}

shutil.copy2(best["path"], BEST_MODEL_OUT)
print("Saved best model as:", BEST_MODEL_OUT)

log_path = OUT_DIR / "attempts.csv"
is_new = not log_path.exists()
with open(log_path, "a", newline="") as f:
    w = csv.writer(f)
    if is_new:
        w.writerow(["timestamp","attempt","model_path","base_model","score","accuracy","cer"])
    w.writerow([datetime.datetime.now().isoformat(), ATTEMPT_NAME, str(BEST_MODEL_OUT),
                str(base_model_path) if 'base_model_path' in globals() and base_model_path else "", best["score"], best["acc"], best["cer"]])

print("Logged to:", log_path)

## Evaluate best model on validation set

In [None]:
# --- Step 6: Evaluate the saved best attempt model ---
from pathlib import Path
import subprocess, sys

model_path = str(BEST_MODEL_OUT)
assert Path(model_path).exists(), f"Model not found: {model_path}"
val_list_path = "/content/lists/val.txt"
cmd = f'''ketos test -m "{model_path}" $(cat "{val_list_path}")'''
print("Running:\\n", cmd)
ret = subprocess.call(cmd, shell=True, executable="/bin/bash")
if ret != 0:
    sys.exit("Evaluation failed.")

## Troubleshooting

- **Missing `train.txt`** ‚Üí Build lists again.  
- **Slow install** ‚Üí The Drive **pip cache** speeds it up next time.  
- **First attempt** ‚Üí Upload or point to a base model for `attempt_01`.  
- **Next attempts** ‚Üí Set `ATTEMPT_NUM` and keep `BASE_MODEL_MODE="auto"` to chain from the last attempt.



# üìí Kraken Training ‚Äî **ALTO-only** (Colab, auto-flag detection, GPU/CPU)

This notebook:
- Assumes **ALTO** ground truth (eScriptorium exports)
- Upload ZIPs ‚Üí auto-extract
- Rebuilds `train.txt` / `val.txt`
- **Auto-detects** Kraken CLI flags (older vs newer): `--lr` vs `--lrate`, `--validation` vs `--evaluation-files`
- Uses Python `subprocess.run` (prints logs live)
- Auto-selects GPU if available

> In Colab: **Runtime ‚Üí Change runtime type ‚Üí GPU** first.


In [None]:

# 0) GPU check
!nvidia-smi || true


## 1) Install Kraken + deps (Py 3.12 / Colab-safe)

In [4]:

import sys, subprocess
def pip_install(*pkgs):
    print("pip install", " ".join(pkgs))
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
# Colab-friendly pin (CLI may still be older depending on environment; we'll detect flags later)
pip_install("kraken==5.3.0", "torch>=2.1,<3", "cairocffi", "opencv-python", "lxml", "h5py")


pip install kraken==5.3.0 torch>=2.1,<3 cairocffi opencv-python lxml h5py


In [5]:

from importlib.metadata import version, PackageNotFoundError
import shutil, torch, sys, subprocess

def pkg_ver(name):
    try: return version(name)
    except PackageNotFoundError: return "not installed"

print("python:", sys.version.split()[0])
print("kraken:", pkg_ver("kraken"))
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
print("ketos path:", shutil.which("ketos"))

# Show a snippet of help to confirm CLI is reachable
subprocess.run(["ketos", "train", "--help"], check=False)


python: 3.12.12
kraken: 5.3.0
torch: 2.8.0+cu126
cuda available: False
ketos path: /usr/local/bin/ketos


CompletedProcess(args=['ketos', 'train', '--help'], returncode=0)

## 2) Config ‚Äî ALTO fixed, validation split

In [6]:

# ======= CONFIG (ALTO) =======
FORMAT = "alto"     # hard-coded
VAL_FRACTION = 0.10 # validation split
# =============================


## 3) Upload ALTO ZIP(s) ‚Äî auto-extract under /content/data_alto

In [7]:

import os, io, zipfile, re
from google.colab import files

LOCAL_BASE = "/content/data_alto"
os.makedirs(LOCAL_BASE, exist_ok=True)

print("Upload ALTO export ZIP(s). They will be extracted into", LOCAL_BASE)
uploaded = files.upload()

EXTRACTED_ROOTS = []
for name, data in uploaded.items():
    path = os.path.join(LOCAL_BASE, name)
    with open(path, "wb") as f: f.write(data)
    if name.lower().endswith(".zip"):
        base = re.sub(r"\s*\(\d+\)\s*$", "", os.path.splitext(name)[0])
        target = os.path.join(LOCAL_BASE, base)
        os.makedirs(target, exist_ok=True)
        with zipfile.ZipFile(io.BytesIO(data)) as zf:
            zf.extractall(target)
        os.remove(path)
        EXTRACTED_ROOTS.append(target)
    else:
        EXTRACTED_ROOTS.append(LOCAL_BASE)

print("Extracted roots:")
for r in EXTRACTED_ROOTS: print(" -", r if os.path.exists(r) else r+" (missing)")


Upload ALTO export ZIP(s). They will be extracted into /content/data_alto


Saving export_doc5946_0093_alto_202510261625.zip to export_doc5946_0093_alto_202510261625.zip
Extracted roots:
 - /content/data_alto/export_doc5946_0093_alto_202510261625


## 4) Build ALTO XML list + write train/val files

In [8]:
# === Build train/val lists from all uploaded ALTO data ===
# This keeps 1 random page for validation and uses the rest for training.

import os, glob, random, xml.etree.ElementTree as ET, pathlib

SEARCH_ROOT = "/content/data_alto"   # folder where you extracted your ZIPs
IMG_EXTS = (".jpg",".jpeg",".png",".tif",".tiff",".bmp",".jp2")

def is_alto_xml(path):
    """Return True if file looks like a proper ALTO XML."""
    base = os.path.basename(path)
    if base.lower() == "mets.xml":
        return False
    try:
        r = ET.parse(path).getroot()
        return isinstance(r.tag, str) and "alto" in r.tag.lower()
    except Exception:
        return False

def has_img(xml_path):
    """Check if the XML has a matching image beside it."""
    stem = os.path.splitext(os.path.basename(xml_path))[0]
    d = os.path.dirname(xml_path)
    return any(os.path.exists(os.path.join(d, stem + ext)) for ext in IMG_EXTS)

# collect ALTO XMLs
xmls = sorted(
    p for p in glob.glob(os.path.join(SEARCH_ROOT, "**", "*.xml"), recursive=True)
    if is_alto_xml(p) and has_img(p)
)
print("Found usable ALTO pages:", len(xmls))
assert len(xmls) >= 2, "Need at least 2 pages (1 for validation)."

# choose 1 random page for validation
random.seed(42)
random.shuffle(xmls)
val_xmls   = [xmls[0]]
train_xmls = xmls[1:]

# write list files
pathlib.Path("/content/lists").mkdir(parents=True, exist_ok=True)
open("/content/lists/train.txt","w").write("\n".join(train_xmls))
open("/content/lists/val.txt","w").write("\n".join(val_xmls))

print(f"‚úÖ Train pages: {len(train_xmls)} | Val pages: {len(val_xmls)}")
print("Validation page:", val_xmls[0])

Found usable ALTO pages: 7
‚úÖ Train pages: 6 | Val pages: 1
Validation page: /content/data_alto/export_doc5946_0093_alto_202510261625/0021_mirrored.xml


## 5) Hyperparameters (+ auto device) ‚Äî edit learning rates here

In [62]:

# ======= EDIT ME (Hyperparameters) =======
import torch, shutil

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
print("Using device:", DEVICE)
KETOS = shutil.which("ketos") or "ketos"
print("ketos path:", KETOS)

# Recognition
REC_EPOCHS = 30
REC_BATCH  = 8
REC_OPTIM  = "Adam"    # or "SGD"
REC_LR     = 0.0001    # learning rate (decimal allowed)
REC_WD     = 1e-5

# Segmentation
SEG_EPOCHS = 20
SEG_BATCH  = 2
SEG_OPTIM  = "Adam"
SEG_LR     = 0.0005    # learning rate
# =========================================


Using device: cpu
ketos path: /usr/local/bin/ketos


## 6) Helpers ‚Äî detect Kraken flags and run training with logs

In [9]:

import subprocess, shlex

def detect_flags():
    # Detect whether CLI supports --lr or --lrate; --validation or --evaluation-files
    out = subprocess.run([KETOS, "train", "--help"], text=True, capture_output=True).stdout
    lr_flag   = "--lr" if "--lr" in out else "--lrate"
    val_flag  = "--validation" if "--validation" in out else "--evaluation-files"
    print(f"[detected] lr flag: {lr_flag}, validation flag: {val_flag}")
    return lr_flag, val_flag

LR_FLAG, VAL_FLAG = detect_flags()

def run_recognition(out_dir, epochs, batch, optim, lr, weight_decay, device, train_list, val_list):
    args = [
        KETOS, "train",
        "-f", FORMAT, f"@{train_list}",
        "-o", out_dir,
        "--device", device,
        "--epochs", str(epochs),
        "--batch-size", str(batch),
        "--optimizer", optim, LR_FLAG, str(lr), "--weight-decay", str(weight_decay),
        VAL_FLAG, f"@{val_list}",
    ]
    print(">>>", " ".join(shlex.quote(a) for a in args))
    # stream logs live
    return subprocess.run(args).returncode

def run_segmentation(out_dir, epochs, batch, optim, lr, device, train_list, val_list):
    args = [
        KETOS, "segtrain",
        "-f", FORMAT, f"@{train_list}",
        "-o", out_dir,
        "--device", device,
        "--epochs", str(epochs),
        "--batch-size", str(batch),
        "--optimizer", optim, LR_FLAG, str(lr),
        VAL_FLAG, f"@{val_list}",
    ]
    print(">>>", " ".join(shlex.quote(a) for a in args))
    return subprocess.run(args).returncode


[detected] lr flag: --lr, validation flag: --evaluation-files


## 7) üîé Smoke test ‚Äî recognition (1 epoch, tiny batch)

In [10]:
rc = run_recognition(
    out_dir="/content/models/rec_smoke",
    epochs=1, batch=2, optim=REC_OPTIM, lr=REC_LR, weight_decay=REC_WD,
    device="cpu",
    train_list="/content/lists/train.txt",
    val_list="/content/lists/val.txt"
)
print("Return code:", rc)

>>> ketos train -f alto @/content/lists/train.txt -o /content/models/rec_smoke --device cpu --epochs 1 --batch-size 2 --optimizer Adam --lr 3e-05 --weight-decay 1e-05 --evaluation-files @/content/lists/val.txt
Return code: 2


## 8) ‚ñ∂Ô∏è Recognition ‚Äî full training

In [1]:
# === Recognition Training (Google Drive, per-MS folders, keep BEST only) ===
# Saves only the BEST model (highest validation accuracy) for each attempt:
# /content/drive/MyDrive/kraken_models/<MANUSCRIPT_ID>/rec/attempt_XX.mlmodel
#
# Example target folder for manuscript 0093 (recognition):
# /content/drive/MyDrive/kraken_models/0093/rec/

import os, re, glob, shutil, shlex, subprocess, torch, multiprocessing
from google.colab import drive, files

# ------------------ MOUNT GOOGLE DRIVE ------------------
drive.mount('/content/drive')

#  CHANGE THIS PER MANUSCRIPT
MANUSCRIPT_ID = "0093"   # e.g. "0093", "0089", ...
MODEL_SUBDIR  = f"/content/drive/MyDrive/kraken_models/{MANUSCRIPT_ID}/rec"
os.makedirs(MODEL_SUBDIR, exist_ok=True)
print(f"Models will be saved in: {MODEL_SUBDIR}")
# ---------------------------------------------------------

# ---------------- CONFIG (EDIT PER ATTEMPT) ----------------
ATTEMPT_NUM = 9           # set attempt number manually (1, 2, 3, ...)
BASE_MODEL  = "/content/drive/MyDrive/kraken_models/0093/rec/attempt_01.mlmodel"        # path to base .mlmodel, or None to upload
# e.g. BASE_MODEL = "/content/drive/MyDrive/kraken_models/0093/rec/attempt_06.mlmodel""
# or   BASE_MODEL = None

REC_EPOCHS = 10
REC_BATCH  = 4
REC_OPTIM  = "Adam"
REC_LR     = 0.00003
REC_WD     = 1e-5
# -----------------------------------------------------------

# Resolve base model
if not BASE_MODEL:
    print("No BASE_MODEL defined ‚Äî please upload a starting .mlmodel file.")
    uploaded = files.upload()
    BASE_MODEL = "/content/" + list(uploaded.keys())[0]
    print(f"Uploaded base model: {BASE_MODEL}")
else:
    assert os.path.exists(BASE_MODEL), f"Base model not found: {BASE_MODEL}"
    print(f"Using base model: {BASE_MODEL}")

# Output names
OUTPUT_PREFIX = os.path.join(MODEL_SUBDIR, f"attempt_{ATTEMPT_NUM:02d}")
OUTPUT_FINAL  = OUTPUT_PREFIX + ".mlmodel"
print(f"Final best model will be saved as: {OUTPUT_FINAL}")

# Environment / threads / device
num_cores = multiprocessing.cpu_count()
threads = max(1, num_cores - 1)
os.environ.update({
    "OMP_NUM_THREADS": str(threads),
    "MKL_NUM_THREADS": str(threads),
    "OPENBLAS_NUM_THREADS": str(threads),
    "NUMEXPR_NUM_THREADS": str(threads),
})

import shutil as _shutil
KETOS  = _shutil.which("ketos") or "ketos"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f"Device: {DEVICE} | ketos: {KETOS} | CPU threads: {threads}")

# Kraken 3.x flags
LR_FLAG  = "--lrate"
VAL_FLAG = "--evaluation-files"

# Train/val lists
train_list_path = "/content/lists/train.txt"
val_list_path   = "/content/lists/val.txt"
with open(train_list_path) as f:
    train_paths = [p.strip() for p in f if p.strip()]
with open(val_list_path) as f:
    val_paths   = [p.strip() for p in f if p.strip()]
assert train_paths and val_paths, "Missing or empty train/val lists."
print(f"Train pages: {len(train_paths)} | Val pages: {len(val_paths)}")

# Build command
cmd = [
    KETOS, "train",
    "-f", "alto",
    "--load", BASE_MODEL,
    *train_paths,
    "-o", OUTPUT_PREFIX,              # kraken will write PREFIX_best.mlmodel + epoch files
    "--device", DEVICE,
    "--epochs", str(REC_EPOCHS),
    "--batch-size", str(REC_BATCH),
    "--optimizer", REC_OPTIM,
    LR_FLAG, str(REC_LR),
    "--weight-decay", str(REC_WD),
    VAL_FLAG, *val_paths,
]
print("\n>>> Running:\n", " ".join(shlex.quote(a) for a in cmd), "\n")

# Keep-BEST helper:
# Prefer <prefix>_best.mlmodel (Kraken's best validation checkpoint).
# If none exists (e.g., interrupted before first val), fallback to latest epoch file.
def keep_best_only():
    best = glob.glob(OUTPUT_PREFIX + "_best.mlmodel")
    epocheds = sorted(glob.glob(OUTPUT_PREFIX + "_*.mlmodel"))
    candidate = best[0] if best else (epocheds[-1] if epocheds else None)
    if candidate:
        # copy best to the canonical attempt file
        shutil.copy2(candidate, OUTPUT_FINAL)
        print(f"‚úÖ Saved BEST model as: {OUTPUT_FINAL}")
        # remove other checkpoints
        for f in epocheds:
            if f != candidate and os.path.exists(f):
                os.remove(f)
        # also remove the *_best file if its name differs from OUTPUT_FINAL
        if candidate != OUTPUT_FINAL and os.path.exists(candidate):
            os.remove(candidate)
        print("üßπ Cleaned intermediate checkpoints; kept only the best.")
    else:
        print("‚ö†Ô∏è No checkpoint found (interrupted before first validation/save).")

# Run and always keep only the best
proc = None
try:
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    for line in proc.stdout:
        print(line, end="")
    proc.wait()
    print("\nReturn code:", proc.returncode)
finally:
    keep_best_only()

if proc and proc.returncode != 0:
    raise RuntimeError("Training did not complete cleanly; check logs above.")

print("\n‚úÖ Done. Current files in the manuscript's rec folder:")
!ls -lh "$MODEL_SUBDIR"

Mounted at /content/drive
Models will be saved in: /content/drive/MyDrive/kraken_models/0093/rec
Using base model: /content/drive/MyDrive/kraken_models/0093/rec/attempt_01.mlmodel
Final best model will be saved as: /content/drive/MyDrive/kraken_models/0093/rec/attempt_09.mlmodel
Device: cpu | ketos: ketos | CPU threads: 1


FileNotFoundError: [Errno 2] No such file or directory: '/content/lists/train.txt'

## 9) ‚ñ∂Ô∏è Segmentation ‚Äî full training

In [None]:

rc = run_segmentation(
    out_dir="/content/models/seg",
    epochs=SEG_EPOCHS, batch=SEG_BATCH, optim=SEG_OPTIM, lr=SEG_LR,
    device=DEVICE,
    train_list="/content/lists/train.txt",
    val_list="/content/lists/val.txt"
)
print("Return code:", rc)
assert rc == 0, "Segmentation training failed."


## 10) Evaluate recognition (CER/WER)

In [None]:

import subprocess, shlex
args = ["ketos", "test", "-f", FORMAT, "/content/models/rec_best.mlmodel", "@/content/lists/val.txt"]
print(">>>", " ".join(shlex.quote(a) for a in args))
rc = subprocess.run(args).returncode
print("Return code:", rc)
assert rc == 0, "Evaluation failed."


## 11) Package models for download

In [None]:

!mkdir -p /content/models && cd /content/models && ls -lh && zip -r ../trained_models.zip . && cd /content


## 12) Optional ‚Äî Upload to msia.escriptorium.fr via API

In [None]:

# UI upload (My Models ‚Üí Upload) is simplest.
MSIA_URL   = "https://msia.escriptorium.fr"
API_TOKEN  = "PASTE_YOUR_TOKEN_HERE"   # keep secret or leave blank
MODEL_PATH = "/content/models/rec_best.mlmodel"   # or seg_best.mlmodel
MODEL_NAME = "rec_best"

if API_TOKEN and API_TOKEN != "PASTE_YOUR_TOKEN_HERE":
    import requests
    headers = {"Authorization": f"Token {API_TOKEN}"}
    files = {"file": (MODEL_NAME + ".mlmodel", open(MODEL_PATH, "rb"), "application/octet-stream")}
    data  = {"name": MODEL_NAME}
    resp = requests.post(f"{MSIA_URL}/api/models/", headers=headers, files=files, data=data)
    print("Status:", resp.status_code)
    try:
        print(resp.json())
    except Exception:
        print(resp.text[:800])
else:
    print("Skipping API upload. Use the UI or paste a valid API token.")
