feat: Tool return convention for model-visible image content #22591

rndmcnlly · 2026-03-11T19:04:06Z

rndmcnlly
Mar 11, 2026

OWUI has a clean convention for tools that produce human-facing rich output: return an HTMLResponse with Content-Disposition: inline, the user sees an iframe, the model gets a status string (established in #21294). This works well.

The missing counterpart is model-facing visual output: a tool produces an image and the model needs to see it on its next turn in order to continue reasoning.

This comes up for any tool that renders a chart, screenshots a page, generates an image for iterative refinement, or captures visual output from a code interpreter. Today there is no blessed return type for this. The closest path — returning an MCP-style {"type": "image", ...} item — routes through file storage and associates the result with user messages rather than delivering it as a content part in the model's next context. It wasn't designed for this case.

The underlying message formats (OpenAI image_url, Anthropic image blocks, Ollama images) all support inline image data, and the conversion utilities between them already exist in the codebase. What's missing is a first-class return convention that plugin authors can reach for.

Desired Solution

A return type or convention — analogous to HTMLResponse for human-facing output — that tells OWUI: "deliver this image to the model as a visual content part in its next turn."

The exact shape (a new response class, a structured dict, something else) is an open design question. The important properties:

Explicit — plugin author opts in; existing tools unaffected
Symmetric — mirrors the HTMLResponse idiom plugin authors already know
Model-agnostic — works across OpenAI, Anthropic, and Ollama endpoints

Alternatives Considered

A plugin author can write a Filter that intercepts the outgoing message payload and injects image_url content parts after the tool result. This works but means every vision-related plugin reinvents the same transformation. Solving it centrally in process_tool_result (or equivalent) gives plugin developers a stable convention to target.

feat: implement MCP Annotations for MCP results to bypass LLM and context #22467 proposes the exact dual of this: MCP Annotations to mark tool output as human-only (bypassing the model). Together these two requests describe the full routing matrix: {human, model} × {text, rich} — and suggest a single, unified routing mechanism may be the right design target.
Image transcription for models that do not support images #22590 approaches the same underlying gap from the other direction: using a vision model as a pre-processing step to get image content into a non-vision model's context. Both workarounds point at the same missing primitive.

Additional Context

This is a plugin developer ergonomics issue. The user-facing chat display is a separate concern — the ask is purely that a tool can place an image into the model's context window with a convention as natural as HTMLResponse is for the human-facing case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Tool return convention for model-visible image content #22591

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

feat: Tool return convention for model-visible image content #22591

Uh oh!

Uh oh!

rndmcnlly Mar 11, 2026

Desired Solution

Alternatives Considered

Related

Additional Context

Replies: 0 comments

rndmcnlly
Mar 11, 2026