# Train a LaTeX OCR model
In this brief notebook I show how you can finetune/train an OCR model.

I've opted to mix in handwritten data into the regular pdf LaTeX images. For that I started out with the released pretrained model and continued training on the slightly larger corpus.

In [8]:
pip install pix2tex[train] -qq

Note: you may need to restart the kernel to use updated packages.


In [29]:
pip install entmax streamlit PyQt6 python-Levenshtein torchtext imagesize tqdm munch torch opencv_python_headless requests einops x_transformers transformers tokenizers numpy Pillow PyYAML pandas timm albumentations pyreadline3 pygments screeninfo pyside6 python-multipart uvicorn[standard] -qq

Note: you may need to restart the kernel to use updated packages.


In [1]:
import os

# Create the directory if it doesn't exist
os.makedirs('LaTeX-OCR', exist_ok=True)

# Change the current working directory
os.chdir('LaTeX-OCR')

In [28]:
pip install opencv-python-headless




In [4]:
# check what GPU we have
gpustat

NguyenHuyHoang              2024-05-11 14:20:36  551.78

[0] NVIDIA GeForce GTX 1650 | 50°C,  23 % |   404 /  4096 MB | NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) ?(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M) NGUYENHUYHOANG\huyho(?M)



In [1]:

import os
import gdown
import zipfile
import random
import shutil
os.chdir('LaTeX-OCR')
# Tạo các thư mục nếu chưa tồn tại
os.makedirs('dataset/data', exist_ok=True)
os.makedirs('image', exist_ok=True)

# Tải các file từ Google Drive
gdown.download('https://drive.google.com/uc?id=13vjxGYrFCuYnwgDIUqkxsNGKk__D_sOM', 'dataset/data/crohme.zip', quiet=False)
gdown.download('https://drive.google.com/uc?id=176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ', 'dataset/data/pdf.zip', quiet=False)
gdown.download('https://drive.google.com/uc?id=1QUjX6PFWPa-HBWdcY-7bA5TRVUnbyS1D', 'dataset/data/pdfmath.txt', quiet=False)

# Giải nén các file zip
with zipfile.ZipFile('dataset/data/crohme.zip', 'r') as zip_ref:
    zip_ref.extractall('dataset/data')

with zipfile.ZipFile('dataset/data/pdf.zip', 'r') as zip_ref:
    zip_ref.extractall('dataset/data')

# Tách dữ liệu handwritten thành tập validation và tập train
os.makedirs('dataset/valimages', exist_ok=True)

image_files = os.listdir('dataset/data/images')
val_files = random.sample(image_files, 1000)

for file in val_files:
    shutil.move(os.path.join('dataset/data/images', file), 'dataset/valimages')

Downloading...
From (original): https://drive.google.com/uc?id=13vjxGYrFCuYnwgDIUqkxsNGKk__D_sOM
From (redirected): https://drive.google.com/uc?id=13vjxGYrFCuYnwgDIUqkxsNGKk__D_sOM&confirm=t&uuid=daca999c-b81a-444d-9b60-5afce51cbe44
To: c:\Users\huyho\Desktop\LaTeX-OCR\notebooks\LaTeX-OCR\dataset\data\crohme.zip
100%|██████████| 59.8M/59.8M [00:18<00:00, 3.22MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ
From (redirected): https://drive.google.com/uc?id=176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ&confirm=t&uuid=0b74cda9-6dca-4648-8a5e-25e7d5e5723d
To: c:\Users\huyho\Desktop\LaTeX-OCR\notebooks\LaTeX-OCR\dataset\data\pdf.zip
100%|██████████| 284M/284M [01:22<00:00, 3.45MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1QUjX6PFWPa-HBWdcY-7bA5TRVUnbyS1D
To: c:\Users\huyho\Desktop\LaTeX-OCR\notebooks\LaTeX-OCR\dataset\data\pdfmath.txt
100%|██████████| 36.6M/36.6M [00:10<00:00, 3.34MB/s]


Now we generate the datasets. We can string multiple datasets together to get one large lookup table. The only thing saved in these pkl files are image sizes, image location and the ground truth latex code. That way we can serve batches of images with the same dimensionality.

In [None]:
python -m pix2tex.dataset.dataset -i dataset/data/images dataset/data/train -e dataset/data/CROHME_math.txt dataset/data/pdfmath.txt -o dataset/data/train.pkl

In [None]:
python -m pix2tex.dataset.dataset -i dataset/data/valimages dataset/data/val -e dataset/data/CROHME_math.txt dataset/data/pdfmath.txt -o dataset/data/val.pkl

In [3]:
# If using wandb
pip install wandb 
# you can cancel this if you don't wan't to use it or don't have a W&B acc.
#!wandb login

SyntaxError: invalid syntax (192739755.py, line 2)

In [5]:
# download the weights we want to fine tune
!curl -L -o weights.pth https://github.com/lukas-blecher/LaTeX-OCR/releases/download/v0.0.1/weights.pth

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0

  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0 97.3M    0 25003    0     0   7689      0  3:41:20  0:00:03  3:41:17  118k
  0 97.3M    0  395k    0     0  94834      0  0:17:56  0:00:04  0:17:52  322k
  1 97.3M    1 1283k    0     0   244k      0  0:06:47  0:00:05  0:06:42  581k
  2 97.3M    2 2418k    0     0   367k      0  0:04:31  0:00:06  0:04:25  684k
  3 97.3M    3 3649k    0     0   484k      0  0:03:25  0:00:07  0:03:18  814k
  6 97.3M    6 6889k    0     0   834k      0  0:0

In [4]:
# generate colab specific config (set 'debug' to true if wandb is not used)
!echo {backbone_layers: [2, 3, 7], betas: [0.9, 0.999], batchsize: 10, bos_token: 1, channels: 1, data: dataset/data/train.pkl, debug: true, decoder_args: {'attn_on_attn': true, 'cross_attend': true, 'ff_glu': true, 'rel_pos_bias': false, 'use_scalenorm': false}, dim: 256, encoder_depth: 4, eos_token: 2, epochs: 50, gamma: 0.9995, heads: 8, id: null, load_chkpt: 'weights.pth', lr: 0.001, lr_step: 30, max_height: 192, max_seq_len: 512, max_width: 672, min_height: 32, min_width: 32, model_path: checkpoints, name: mixed, num_layers: 4, num_tokens: 8000, optimizer: Adam, output_path: outputs, pad: false, pad_token: 0, patch_size: 16, sample_freq: 2000, save_freq: 1, scheduler: StepLR, seed: 42, temperature: 0.2, test_samples: 5, testbatchsize: 20, tokenizer: dataset/tokenizer.json, valbatches: 100, valdata: dataset/data/val.pkl} > colab.yaml

In [1]:
python -m pix2tex.train --config colab.yaml

SyntaxError: invalid syntax (4209187608.py, line 1)