## Fine tuning

Based on: https://towardsdatascience.com/ocr-free-document-understanding-with-donut-1acfbdf099be/

Significant donut issues I encountered:
- https://github.com/clovaai/donut/issues/282
- https://github.com/clovaai/donut/issues/303

In [1]:
!git clone https://github.com/zzzDavid/ICDAR-2019-SROIE.git sroie

Cloning into 'sroie'...
remote: Enumerating objects: 2386, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 2386 (delta 18), reused 18 (delta 18), pack-reused 2366 (from 1)[K
Receiving objects: 100% (2386/2386), 278.64 MiB | 42.10 MiB/s, done.
Resolving deltas: 100% (239/239), done.
Updating files: 100% (1980/1980), done.


In [2]:
!cat sroie/data/key/021.json

{
    "company": "TEO HENG STATIONERY & BOOKS",
    "date": "18/01/2018",
    "address": "NO. 53, JALAN BESAR, 45600 BATANG BERJUNTAI SELANGOR DARUL EHSAN",
    "total": "4.90"
}

In [3]:
!ls sroie/data/key/ | wc -l

626


In [4]:
!rm -rf sroie-donut/

In [5]:
import os
import json
import shutil
from tqdm.notebook import tqdm

train_size = 500
val_size = 100
test_size = 26

beg = 0

for split_name, split_size in [('train', train_size), ('validation', val_size), ('test', test_size)]:
    lines = []
    images = []

    os.makedirs(f'./sroie-donut/{split_name}', exist_ok=True)
    for ann in tqdm(os.listdir("./sroie/data/key")[beg:beg+split_size]):
        if ann != ".ipynb_checkpoints":
            with open("./sroie/data/key/" + ann) as f:
                data = json.load(f)
            images.append(ann[:-4] + "jpg")
            line = {"gt_parse": data}
            lines.append(line)
    with open(f"./sroie-donut/{split_name}/metadata.jsonl", 'w') as f:
        for i, gt_parse in enumerate(lines):
            line = {"file_name": images[i], "ground_truth": json.dumps(gt_parse)}
            f.write(json.dumps(line) + "\n")
            shutil.copyfile("./sroie/data/img/" + images[i], f"./sroie-donut/{split_name}/" + images[i])
    beg = beg + split_size

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/26 [00:00<?, ?it/s]

In [6]:
!git clone https://github.com/clovaai/donut.git

Cloning into 'donut'...
remote: Enumerating objects: 289, done.[K
remote: Total 289 (delta 0), reused 0 (delta 0), pack-reused 289 (from 1)[K
Receiving objects: 100% (289/289), 62.76 MiB | 38.83 MiB/s, done.
Resolving deltas: 100% (135/135), done.


The DONUT was released in October 6th 2022, the idea is to fix the versions of the below packages in order to have successful training.

In [7]:
# https://colab.research.google.com/drive/1c_RGCgQeLHVXlF44LyOFjfUW32CmG6BP#scrollTo=LAZ11nESX6qt
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:06
🔁 Restarting kernel...


In [13]:
!wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/local/bin/yq && chmod +x /usr/local/bin/yq

--2025-08-03 11:40:28--  https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/mikefarah/yq/releases/download/v4.47.1/yq_linux_amd64 [following]
--2025-08-03 11:40:28--  https://github.com/mikefarah/yq/releases/download/v4.47.1/yq_linux_amd64
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/43225113/ab7f5105-5465-49ba-a24e-624c8435dd9b?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-08-03T12%3A31%3A20Z&rscd=attachment%3B+filename%3Dyq_linux_amd64&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-08-03T11%3A31%3A09Z&ske=2025-08-03T12%3A31%3A20Z&sks=b&skv=2018-11-09&sig=af

In [14]:
!cp donut/config/train_cord.yaml donut/config/train_sroie.yaml

In [15]:
!yq eval '.result_path = "./result"' --inplace donut/config/train_sroie.yaml
!yq eval '.dataset_name_or_paths = ["../sroie-donut"]' --inplace donut/config/train_sroie.yaml
!yq eval '.train_batch_sizes[0] = 1' --inplace donut/config/train_sroie.yaml
!yq eval '.val_batch_sizes[0] = 1' --inplace donut/config/train_sroie.yaml
!yq eval '.check_val_every_n_epoch = 10' --inplace donut/config/train_sroie.yaml
!cat donut/config/train_sroie.yaml | yq

[36mresume_from_checkpoint_path[0m: null [90m# only used for resume_from_checkpoint option in PL[0m
[90m[0m[36mresult_path[0m:[32m "./result"[0m[36m[0m
[36mpretrained_model_name_or_path[0m:[32m "naver-clova-ix/donut-base"[0m[90m # loading a pre-trained model (from moldehub or path)[0m
[90m[0m[36mdataset_name_or_paths[0m: [[32m../sroie-donut[0m][90m # loading datasets (from moldehub or path)[0m
[90m[0m[36msort_json_key[0m:[95m False [0m[90m# cord dataset is preprocessed, and publicly available at https://huggingface.co/datasets/naver-clova-ix/cord-v2[0m
[90m[0m[36mtrain_batch_sizes[0m: [[95m1[0m][36m[0m
[36mval_batch_sizes[0m: [[95m1[0m][36m[0m
[36minput_size[0m: [[95m1280[0m,[95m 960[0m][90m # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)[0m
[90m[0m[36mmax_length[0m:[95m 768[0m
[95m[0m[36malign_long_axis[0m:[95m False[0m
[

In [24]:
!conda create -n donut python=3.9 > /dev/null
!conda run -n donut python --version



    current version: 24.11.2
    latest version: 25.5.1

Please update conda by running

    $ conda update -n base -c conda-forge conda


Python 3.9.23



In [25]:
%%writefile donut-requirements.txt
transformers==4.22.2 # pip install transformers==4.22.2
pytorch-lightning==1.8.5 # https://pypi.org/project/pytorch-lightning/1.6.4/
timm==0.5.4 # https://pypi.org/project/timm/0.5.4/
sentence-transformers==2.2.1 # https://pypi.org/project/sentence-transformers/2.2.1/
sconf==0.2.5 # https://pypi.org/project/sconf/0.2.5/
zss==1.2.0 # https://pypi.org/project/zss/1.2.0/
nltk==3.7 # https://pypi.org/project/nltk/3.7/
datasets==2.4.0
pillow==9.2.0
pyarrow==9.0.0
numpy==1.23.5
fsspec==2022.8.2

Overwriting donut-requirements.txt


In [26]:
!conda run -n donut pip install -r donut-requirements.txt

Collecting transformers==4.22.2 (from -r donut-requirements.txt (line 1))
  Using cached transformers-4.22.2-py3-none-any.whl.metadata (84 kB)
Collecting pytorch-lightning==1.8.5 (from -r donut-requirements.txt (line 2))
  Using cached pytorch_lightning-1.8.5-py3-none-any.whl.metadata (25 kB)
Collecting timm==0.5.4 (from -r donut-requirements.txt (line 3))
  Using cached timm-0.5.4-py3-none-any.whl.metadata (36 kB)
Collecting sentence-transformers==2.2.1 (from -r donut-requirements.txt (line 4))
  Using cached sentence_transformers-2.2.1-py3-none-any.whl
Collecting sconf==0.2.5 (from -r donut-requirements.txt (line 5))
  Using cached sconf-0.2.5-py3-none-any.whl.metadata (3.9 kB)
Collecting zss==1.2.0 (from -r donut-requirements.txt (line 6))
  Using cached zss-1.2.0-py3-none-any.whl
Collecting nltk==3.7 (from -r donut-requirements.txt (line 7))
  Using cached nltk-3.7-py3-none-any.whl.metadata (2.8 kB)
Collecting datasets==2.4.0 (from -r donut-requirements.txt (line 8))
  Using cached

In [27]:
!cd donut && conda run -n donut python train.py --config config/train_sroie.yaml

Moving 0 files to the new cache system
resume_from_checkpoint_path: None
result_path: ./result
pretrained_model_name_or_path: naver-clova-ix/donut-base
dataset_name_or_paths: 
  - ../sroie-donut
sort_json_key: False
train_batch_sizes: 
  - 1
val_batch_sizes: 
  - 1
input_size: 
  - 1280
  - 960
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-05
warmup_steps: 300
num_training_samples_per_epoch: 800
max_epochs: 30
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 10
gradient_clip_val: 1.0
verbose: True
exp_name: train_sroie
exp_version: 20250803_114543
Config is saved at result/train_sroie/20250803_114543/config.yaml
Downloading and preparing dataset imagefolder/sroie-donut to /root/.cache/huggingface/datasets/imagefolder/sroie-donut-b698272269d8cb80/0.0.0/0fc50c79b681877cc46b23245a6ef5333d036f48db40d53765a68034bc48faff...
Dataset imagefolder downloaded and prepared to /root/.cache/huggingface/datasets/imagefolder/sroie-donut-b698272269d8

In [29]:
!cd donut && conda run -n donut python test.py --dataset_name_or_path ../sroie-donut --pretrained_model_name_or_path './result/train_sroie/20250803_114543' --save_path ./result/output.json

Total number of samples: 26, Tree Edit Distance (TED) based accuracy score: 0.9459366631881628, F1 accuracy score: 0.8309178743961353

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

Resolving data files:   0%|          | 0/501 [00:00<?, ?it/s]
Resolving data files: 100%|██████████| 501/501 [00:00<00:00, 155586.13it/s]

Resolving data files:   0%|          | 0/27 [00:00<?, ?it/s]
Resolving data files: 100%|██████████| 27/27 [00:00<00:00, 24089.81it/s]

Resolving data files:   0%|          | 0/101 [00:00<?, ?it/s]
Resolving data files: 100%|██████████| 101/101 [00:00<00:00, 49891.03it/s]
Using custom data configuration sroie-donut-b698272269d8cb80
Reusing dataset imagefolder (/root/.cache/huggingface/datasets/imagefolder/sroie-donut-b698272269d8cb80/0.0.0/0fc50c79b681877cc46b23245a6ef5333d036f48db40d53765a68034bc48faff)

  0%|          | 0/26 [00:00<?, ?it/s]
  4%|▍         | 1/26 [00:00<00:20,  1.24it/s]
  8%|▊         | 2/26 [00:01<00:13,  1.77it/s]
 12%|█▏    