You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vision-encoder-decoder models (TrOCR, Donut, Nougat) fail across image-to-text and document-question-answering. Large TrOCR models hit the ONNX protobuf 2GB size limit. Donut/Nougat models timeout during compilation. Document-question-answering is not supported by TasksManager for this model type.
Eval Results (2026-03-11)
Status
Model
Task
Error
FAIL
microsoft/trocr-large-printed
image-to-text
Error parsing message with type 'onnx.ModelProto'
FAIL
microsoft/trocr-large-handwritten
image-to-text
same
FAIL
naver-clova-ix/donut-base
image-to-text
TIMEOUT (1200s)
FAIL
facebook/nougat-base
image-to-text
TIMEOUT (1200s)
FAIL
breezedeus/pix2text-mfr
image-to-text
does not appear to have pytorch_model.bin / model.safetensors
FAIL
naver-clova-ix/donut-base-finetuned-docvqa
document-question-answering
Task 'document-question-answering' not supported by TasksManager
Donut/Nougat (TIMEOUT): These are large seq2seq models with complex decoders. Export or compilation exceeds the 1200s timeout. May need the EncoderDecoderCache fix (Sam2 add defaut task config. #426) plus size handling.
document-question-answering: TasksManager does not map document-question-answering to any ONNX config for vision-encoder-decoder model type. This task needs to be registered as equivalent to image-to-text for donut-style models.
pix2text-mfr: ONNX-only model (no PyTorch weights) — out of scope.
Current State
No vision_encoder_decoder.py in modelkit/models/hf/
document-question-answering task not registered in Optimum's TasksManager for this model type
microsoft/trocr-base-printed and trocr-base-handwritten PASS — confirming base-size works
Desired State
All 8 models (excluding pix2text-mfr which is ONNX-only) pass wmk perf.
Acceptance Criteria
microsoft/trocr-large-printed and trocr-large-handwritten pass wmk perf
naver-clova-ix/donut-base and facebook/nougat-base pass wmk perf
document-question-answering: Register document-question-answering as a task alias for vision-encoder-decoder models — internally route to image-to-text config
Donut TIMEOUT: Donut has a large autoregressive decoder. May be hitting the EncoderDecoderCache issue (Sam2 add defaut task config. #426) which causes infinite loop / very slow export. Fix the cache issue first.
trocr-base models already pass — use them as a reference for correct export path
Related Files
modelkit/models/hf/blip.py — pattern for vision-language ONNX config
Summary
Vision-encoder-decoder models (TrOCR, Donut, Nougat) fail across image-to-text and document-question-answering. Large TrOCR models hit the ONNX protobuf 2GB size limit. Donut/Nougat models timeout during compilation. Document-question-answering is not supported by TasksManager for this model type.
Eval Results (2026-03-11)
Error parsing message with type 'onnx.ModelProto'does not appear to have pytorch_model.bin / model.safetensorsTask 'document-question-answering' not supported by TasksManager9/9 models fail — 0 pass.
Root Cause
trocr-largemodels generate ONNX >2GB. Same fix as xlm-roberta (bug: --device npu resolves to QNN on AMD machines (should use VitisAI) #429): external data format.TasksManagerdoes not mapdocument-question-answeringto any ONNX config forvision-encoder-decodermodel type. This task needs to be registered as equivalent toimage-to-textfor donut-style models.Current State
vision_encoder_decoder.pyinmodelkit/models/hf/document-question-answeringtask not registered in Optimum's TasksManager for this model typemicrosoft/trocr-base-printedandtrocr-base-handwrittenPASS — confirming base-size worksDesired State
All 8 models (excluding pix2text-mfr which is ONNX-only) pass
wmk perf.Acceptance Criteria
microsoft/trocr-large-printedandtrocr-large-handwrittenpasswmk perfnaver-clova-ix/donut-baseandfacebook/nougat-basepasswmk perfnaver-clova-ix/donut-base-finetuned-docvqa,jinhybr/OCR-DocVQA-Donut,Xenova/donut-base-finetuned-docvqapasswmk perfuv run pytest tests/passes (CLAUDE.md Cardinal Rule This repo is missing important files #3)Technical Notes
use_external_data_format=True(coordinate with bug: --device npu resolves to QNN on AMD machines (should use VitisAI) #429)document-question-answeringas a task alias forvision-encoder-decodermodels — internally route toimage-to-textconfigtrocr-basemodels already pass — use them as a reference for correct export pathRelated Files
modelkit/models/hf/blip.py— pattern for vision-language ONNX configmodelkit/export/io.py—register_onnx_overwrite()eval_results/2026-03-11/models/microsoft__trocr-large-printed__image-to-text/result.jsoneval_results/2026-03-11/models/naver-clova-ix__donut-base__image-to-text/result.jsonReferences