# MobileTeXOCR V2: Train HME Recognition on Google Colab

This notebook trains the **V2** Handwritten Mathematical Expression (HME) recognition model.

## V2 Features:
- **Proper autoregressive generation** (fixed SOS/EOS handling)
- **Differential Attention** (ICLR 2025) - noise cancellation in attention
- **Mixture of Experts FFN** - sparse computation
- **Optional PaTH Attention** - Householder transform position encoding

**Before running:**
1. Go to Runtime -> Change runtime type -> Select **T4 GPU** (or better)
2. Run all cells in order

**Model Variants:**
| Variant | Description | Model Size | Features |
|---------|-------------|------------|----------|
| ultralight_v2 | Differential Attention + MoE | ~7MB | Best for mobile |
| ultralight_v2_path | PaTH Attention + MoE | ~7.5MB | Alternative attention |

In [None]:
# Check GPU availability
!nvidia-smi

## 1. Install PaddlePaddle

In [None]:
# Install PaddlePaddle GPU version
%pip install -q paddlepaddle-gpu==2.6.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

# Verify installation
import paddle
print(f"PaddlePaddle version: {paddle.__version__}")
print(f"GPU available: {paddle.device.is_compiled_with_cuda()}")
print(f"GPU count: {paddle.device.cuda.device_count()}")

## 2. Clone Repository & Install Dependencies

In [None]:
# Clone from GitHub
!git clone https://github.com/markm39/MobileTeXOCR.git
%cd MobileTeXOCR

In [None]:
# Install dependencies
%pip install -q -r requirements.txt
%pip install -q visualdl shapely pyclipper lmdb

## 3. Download CROHME Dataset

In [None]:
!python tools/download_hme_datasets.py --dataset crohme --data_dir ./train_data

## 4. Select Model Variant & Train

In [None]:
# Choose your model variant
# Options: "ultralight_v2" (recommended), "ultralight_v2_path"
MODEL_VARIANT = "ultralight_v2"

config_map = {
    # V2 models (recommended - proper autoregressive generation)
    "ultralight_v2": "configs/rec/hme_latex_ocr_ultralight_v2.yml",
    "ultralight_v2_path": "configs/rec/hme_latex_ocr_ultralight_v2_path.yml",
    # Legacy V1 models (broken inference - for reference only)
    "ultralight": "configs/rec/hme_latex_ocr_ultralight.yml"
}

CONFIG_PATH = config_map[MODEL_VARIANT]
OUTPUT_DIR = f"./output/rec/hme_{MODEL_VARIANT}/"

print(f"Model variant: {MODEL_VARIANT}")
print(f"Config: {CONFIG_PATH}")
print(f"Output: {OUTPUT_DIR}")

In [None]:
# Make sure we have the latest code
!git pull origin main

In [None]:
# Start training!
!python tools/train.py -c {CONFIG_PATH}

## 5. Debug & Verify Training (Optional)

Run this cell to check if the model is learning properly.

In [None]:
%%writefile debug_preds_v2.py
import paddle
import numpy as np
import sys
import logging
import os
sys.path.insert(0, '.')

from ppocr.modeling.architectures import build_model
from ppocr.data import build_dataloader
from tools.program import load_config

logger = logging.getLogger('ppocr')
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
logger.addHandler(handler)

config_path = os.environ.get('CONFIG_PATH', 'configs/rec/hme_latex_ocr_ultralight_v2.yml')
output_dir = os.environ.get('OUTPUT_DIR', './output/rec/hme_ultralight_v2/')

print(f"Loading config: {config_path}")
config = load_config(config_path)
model = build_model(config['Architecture'])

ckpt_path = os.path.join(output_dir, 'latest.pdparams')
print(f"Loading checkpoint: {ckpt_path}")
state = paddle.load(ckpt_path)
model.set_state_dict(state)
model.eval()

valid_dataloader = build_dataloader(config, 'Eval', None, logger)

batch = next(iter(valid_dataloader))
batch = [paddle.to_tensor(b) if isinstance(b, np.ndarray) else b for b in batch]

images, image_masks, decoder_inputs, decoder_targets, label_masks = batch

print(f"\nBatch shapes:")
print(f"  images: {images.shape}")
print(f"  decoder_inputs: {decoder_inputs.shape}")
print(f"  decoder_targets: {decoder_targets.shape}")

with paddle.no_grad():
    logits = model(images)

pred_tokens = logits.argmax(axis=-1).numpy()[0][:20]
target_tokens = decoder_targets.numpy()[0][:20]

print(f"\nPredicted: {pred_tokens.tolist()}")
print(f"Targets:   {target_tokens.tolist()}")
print(f"Unique predictions: {len(set(pred_tokens))}")

if len(set(pred_tokens[:10])) <= 2:
    print("\n[WARNING] Possible repetition collapse detected!")
else:
    print("\n[OK] Model producing diverse tokens")

dict_path = config['Global']['character_dict_path']
vocab = ['<eos>', '<sos>']
with open(dict_path, 'r') as f:
    vocab.extend([line.strip() for line in f])

print(f"\nDecoded prediction:")
decoded = []
for t in pred_tokens:
    if t == 0:
        break
    if t == 1:
        continue
    if t < len(vocab):
        decoded.append(vocab[t])
print(' '.join(decoded))

In [None]:
import os
os.environ['CONFIG_PATH'] = CONFIG_PATH
os.environ['OUTPUT_DIR'] = OUTPUT_DIR
!python debug_preds_v2.py

## 6. Test on Sample Images

In [None]:
# List some test images
!ls ./train_data/CROHME/evaluation/images/ | head -10

In [None]:
# Display a test image
from IPython.display import Image, display
test_image = './train_data/CROHME/evaluation/images/18_em_10.jpg'
display(Image(test_image))

In [None]:
# Run inference on test image
!python tools/test_hme_model_v2.py \
    --image ./train_data/CROHME/evaluation/images/18_em_10.jpg \
    --checkpoint {OUTPUT_DIR}/best_accuracy \
    --config {CONFIG_PATH}

## 7. Export & Download Trained Model

In [None]:
# Export to inference model
INFERENCE_DIR = f"./inference/hme_{MODEL_VARIANT}/"

!python tools/export_model.py -c {CONFIG_PATH} \
    -o Global.export_with_pir=False \
       Global.pretrained_model={OUTPUT_DIR}/best_accuracy \
       Global.save_inference_dir={INFERENCE_DIR}

In [None]:
# Check model size
print("Model size:")
!du -sh {INFERENCE_DIR}
!du -sh {INFERENCE_DIR}/*

In [None]:
# Zip and download trained model
!zip -r hme_model_v2.zip {OUTPUT_DIR} {INFERENCE_DIR}

from google.colab import files
files.download('hme_model_v2.zip')

## 8. Resume Training (if interrupted)

In [None]:
# Resume from last checkpoint
!python tools/train.py -c {CONFIG_PATH} \
    -o Global.checkpoints={OUTPUT_DIR}/latest