vision-encoder-decoder / image-to-text + document-question-answering: all models pass wmk perf

## Summary

Vision-encoder-decoder models (TrOCR, Donut, Nougat) fail across image-to-text and document-question-answering. Large TrOCR models hit the ONNX protobuf 2GB size limit. Donut/Nougat models timeout during compilation. Document-question-answering is not supported by TasksManager for this model type.

## Eval Results (2026-03-11)

| Status | Model | Task | Error |
|--------|-------|------|-------|
| FAIL | microsoft/trocr-large-printed | image-to-text | `Error parsing message with type 'onnx.ModelProto'` |
| FAIL | microsoft/trocr-large-handwritten | image-to-text | same |
| FAIL | naver-clova-ix/donut-base | image-to-text | TIMEOUT (1200s) |
| FAIL | facebook/nougat-base | image-to-text | TIMEOUT (1200s) |
| FAIL | breezedeus/pix2text-mfr | image-to-text | `does not appear to have pytorch_model.bin / model.safetensors` |
| FAIL | naver-clova-ix/donut-base-finetuned-docvqa | document-question-answering | `Task 'document-question-answering' not supported by TasksManager` |
| FAIL | jinhybr/OCR-DocVQA-Donut | document-question-answering | same |
| FAIL | Xenova/donut-base-finetuned-docvqa | document-question-answering | same |
| FAIL | fxmarty/tiny-doc-qa-vision-encoder-decoder | document-question-answering | same |

**9/9 models fail — 0 pass.**

## Root Cause

1. **TrOCR large (ONNX size)**: `trocr-large` models generate ONNX >2GB. Same fix as xlm-roberta (#429): external data format.
2. **Donut/Nougat (TIMEOUT)**: These are large seq2seq models with complex decoders. Export or compilation exceeds the 1200s timeout. May need the EncoderDecoderCache fix (#426) plus size handling.
3. **document-question-answering**: `TasksManager` does not map `document-question-answering` to any ONNX config for `vision-encoder-decoder` model type. This task needs to be registered as equivalent to `image-to-text` for donut-style models.
4. **pix2text-mfr**: ONNX-only model (no PyTorch weights) — out of scope.

## Current State

- No `vision_encoder_decoder.py` in `modelkit/models/hf/`
- `document-question-answering` task not registered in Optimum's TasksManager for this model type
- `microsoft/trocr-base-printed` and `trocr-base-handwritten` PASS — confirming base-size works

## Desired State

All 8 models (excluding pix2text-mfr which is ONNX-only) pass `wmk perf`.

## Acceptance Criteria

- [x] `microsoft/trocr-large-printed` and `trocr-large-handwritten` pass `wmk perf`
- [ ] `naver-clova-ix/donut-base` and `facebook/nougat-base` pass `wmk perf`
- [ ] `naver-clova-ix/donut-base-finetuned-docvqa`, `jinhybr/OCR-DocVQA-Donut`, `Xenova/donut-base-finetuned-docvqa` pass `wmk perf`
- [ ] Fix is universal — no hardcoded model names (CLAUDE.md Cardinal Rule #1)
- [ ] `uv run pytest tests/` passes (CLAUDE.md Cardinal Rule #3)

## Technical Notes

- **ONNX size for TrOCR-large**: Enable `use_external_data_format=True` (coordinate with #429)
- **document-question-answering**: Register `document-question-answering` as a task alias for `vision-encoder-decoder` models — internally route to `image-to-text` config
- **Donut TIMEOUT**: Donut has a large autoregressive decoder. May be hitting the EncoderDecoderCache issue (#426) which causes infinite loop / very slow export. Fix the cache issue first.
- `trocr-base` models already pass — use them as a reference for correct export path

## Related Files

- `modelkit/models/hf/blip.py` — pattern for vision-language ONNX config
- `modelkit/export/io.py` — `register_onnx_overwrite()`
- `eval_results/2026-03-11/models/microsoft__trocr-large-printed__image-to-text/result.json`
- `eval_results/2026-03-11/models/naver-clova-ix__donut-base__image-to-text/result.json`

## References

- Related: #429 (ONNX size — same external data format fix)
- Related: #426 (T5 EncoderDecoderCache — likely same issue for Donut)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vision-encoder-decoder / image-to-text + document-question-answering: all models pass wmk perf #133

Summary

Eval Results (2026-03-11)

Root Cause

Current State

Desired State

Acceptance Criteria

Technical Notes

Related Files

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Status	Model	Task	Error
FAIL	microsoft/trocr-large-printed	image-to-text	`Error parsing message with type 'onnx.ModelProto'`
FAIL	microsoft/trocr-large-handwritten	image-to-text	same
FAIL	naver-clova-ix/donut-base	image-to-text	TIMEOUT (1200s)
FAIL	facebook/nougat-base	image-to-text	TIMEOUT (1200s)
FAIL	breezedeus/pix2text-mfr	image-to-text	`does not appear to have pytorch_model.bin / model.safetensors`
FAIL	naver-clova-ix/donut-base-finetuned-docvqa	document-question-answering	`Task 'document-question-answering' not supported by TasksManager`
FAIL	jinhybr/OCR-DocVQA-Donut	document-question-answering	same
FAIL	Xenova/donut-base-finetuned-docvqa	document-question-answering	same
FAIL	fxmarty/tiny-doc-qa-vision-encoder-decoder	document-question-answering	same

vision-encoder-decoder / image-to-text + document-question-answering: all models pass wmk perf #133

Description

Summary

Eval Results (2026-03-11)

Root Cause

Current State

Desired State

Acceptance Criteria

Technical Notes

Related Files

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions