Skip to content

vision-encoder-decoder / image-to-text + document-question-answering: all models pass wmk perf #133

@DingmaomaoBJTU

Description

@DingmaomaoBJTU

Summary

Vision-encoder-decoder models (TrOCR, Donut, Nougat) fail across image-to-text and document-question-answering. Large TrOCR models hit the ONNX protobuf 2GB size limit. Donut/Nougat models timeout during compilation. Document-question-answering is not supported by TasksManager for this model type.

Eval Results (2026-03-11)

Status Model Task Error
FAIL microsoft/trocr-large-printed image-to-text Error parsing message with type 'onnx.ModelProto'
FAIL microsoft/trocr-large-handwritten image-to-text same
FAIL naver-clova-ix/donut-base image-to-text TIMEOUT (1200s)
FAIL facebook/nougat-base image-to-text TIMEOUT (1200s)
FAIL breezedeus/pix2text-mfr image-to-text does not appear to have pytorch_model.bin / model.safetensors
FAIL naver-clova-ix/donut-base-finetuned-docvqa document-question-answering Task 'document-question-answering' not supported by TasksManager
FAIL jinhybr/OCR-DocVQA-Donut document-question-answering same
FAIL Xenova/donut-base-finetuned-docvqa document-question-answering same
FAIL fxmarty/tiny-doc-qa-vision-encoder-decoder document-question-answering same

9/9 models fail — 0 pass.

Root Cause

  1. TrOCR large (ONNX size): trocr-large models generate ONNX >2GB. Same fix as xlm-roberta (bug: --device npu resolves to QNN on AMD machines (should use VitisAI) #429): external data format.
  2. Donut/Nougat (TIMEOUT): These are large seq2seq models with complex decoders. Export or compilation exceeds the 1200s timeout. May need the EncoderDecoderCache fix (Sam2 add defaut task config. #426) plus size handling.
  3. document-question-answering: TasksManager does not map document-question-answering to any ONNX config for vision-encoder-decoder model type. This task needs to be registered as equivalent to image-to-text for donut-style models.
  4. pix2text-mfr: ONNX-only model (no PyTorch weights) — out of scope.

Current State

  • No vision_encoder_decoder.py in modelkit/models/hf/
  • document-question-answering task not registered in Optimum's TasksManager for this model type
  • microsoft/trocr-base-printed and trocr-base-handwritten PASS — confirming base-size works

Desired State

All 8 models (excluding pix2text-mfr which is ONNX-only) pass wmk perf.

Acceptance Criteria

  • microsoft/trocr-large-printed and trocr-large-handwritten pass wmk perf
  • naver-clova-ix/donut-base and facebook/nougat-base pass wmk perf
  • naver-clova-ix/donut-base-finetuned-docvqa, jinhybr/OCR-DocVQA-Donut, Xenova/donut-base-finetuned-docvqa pass wmk perf
  • Fix is universal — no hardcoded model names (CLAUDE.md Cardinal Rule This repo is missing a LICENSE file #1)
  • uv run pytest tests/ passes (CLAUDE.md Cardinal Rule This repo is missing important files #3)

Technical Notes

  • ONNX size for TrOCR-large: Enable use_external_data_format=True (coordinate with bug: --device npu resolves to QNN on AMD machines (should use VitisAI) #429)
  • document-question-answering: Register document-question-answering as a task alias for vision-encoder-decoder models — internally route to image-to-text config
  • Donut TIMEOUT: Donut has a large autoregressive decoder. May be hitting the EncoderDecoderCache issue (Sam2 add defaut task config. #426) which causes infinite loop / very slow export. Fix the cache issue first.
  • trocr-base models already pass — use them as a reference for correct export path

Related Files

  • modelkit/models/hf/blip.py — pattern for vision-language ONNX config
  • modelkit/export/io.pyregister_onnx_overwrite()
  • eval_results/2026-03-11/models/microsoft__trocr-large-printed__image-to-text/result.json
  • eval_results/2026-03-11/models/naver-clova-ix__donut-base__image-to-text/result.json

References

Metadata

Metadata

Labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions