You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OWUI has a clean convention for tools that produce human-facing rich output: return an HTMLResponse with Content-Disposition: inline, the user sees an iframe, the model gets a status string (established in #21294). This works well.
The missing counterpart is model-facing visual output: a tool produces an image and the model needs to see it on its next turn in order to continue reasoning.
This comes up for any tool that renders a chart, screenshots a page, generates an image for iterative refinement, or captures visual output from a code interpreter. Today there is no blessed return type for this. The closest path — returning an MCP-style {"type": "image", ...} item — routes through file storage and associates the result with user messages rather than delivering it as a content part in the model's next context. It wasn't designed for this case.
The underlying message formats (OpenAI image_url, Anthropic image blocks, Ollama images) all support inline image data, and the conversion utilities between them already exist in the codebase. What's missing is a first-class return convention that plugin authors can reach for.
Desired Solution
A return type or convention — analogous to HTMLResponse for human-facing output — that tells OWUI: "deliver this image to the model as a visual content part in its next turn."
The exact shape (a new response class, a structured dict, something else) is an open design question. The important properties:
Symmetric — mirrors the HTMLResponse idiom plugin authors already know
Model-agnostic — works across OpenAI, Anthropic, and Ollama endpoints
Alternatives Considered
A plugin author can write a Filter that intercepts the outgoing message payload and injects image_url content parts after the tool result. This works but means every vision-related plugin reinvents the same transformation. Solving it centrally in process_tool_result (or equivalent) gives plugin developers a stable convention to target.
Related
feat: implement MCP Annotations for MCP results to bypass LLM and context #22467 proposes the exact dual of this: MCP Annotations to mark tool output as human-only (bypassing the model). Together these two requests describe the full routing matrix: {human, model} × {text, rich} — and suggest a single, unified routing mechanism may be the right design target.
Image transcription for models that do not support images #22590 approaches the same underlying gap from the other direction: using a vision model as a pre-processing step to get image content into a non-vision model's context. Both workarounds point at the same missing primitive.
Additional Context
This is a plugin developer ergonomics issue. The user-facing chat display is a separate concern — the ask is purely that a tool can place an image into the model's context window with a convention as natural as HTMLResponse is for the human-facing case.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
OWUI has a clean convention for tools that produce human-facing rich output: return an
HTMLResponsewithContent-Disposition: inline, the user sees an iframe, the model gets a status string (established in #21294). This works well.The missing counterpart is model-facing visual output: a tool produces an image and the model needs to see it on its next turn in order to continue reasoning.
This comes up for any tool that renders a chart, screenshots a page, generates an image for iterative refinement, or captures visual output from a code interpreter. Today there is no blessed return type for this. The closest path — returning an MCP-style
{"type": "image", ...}item — routes through file storage and associates the result with user messages rather than delivering it as a content part in the model's next context. It wasn't designed for this case.The underlying message formats (OpenAI
image_url, Anthropicimageblocks, Ollamaimages) all support inline image data, and the conversion utilities between them already exist in the codebase. What's missing is a first-class return convention that plugin authors can reach for.Desired Solution
A return type or convention — analogous to
HTMLResponsefor human-facing output — that tells OWUI: "deliver this image to the model as a visual content part in its next turn."The exact shape (a new response class, a structured dict, something else) is an open design question. The important properties:
HTMLResponseidiom plugin authors already knowAlternatives Considered
A plugin author can write a Filter that intercepts the outgoing message payload and injects
image_urlcontent parts after the tool result. This works but means every vision-related plugin reinvents the same transformation. Solving it centrally inprocess_tool_result(or equivalent) gives plugin developers a stable convention to target.Related
{human, model} × {text, rich}— and suggest a single, unified routing mechanism may be the right design target.Additional Context
This is a plugin developer ergonomics issue. The user-facing chat display is a separate concern — the ask is purely that a tool can place an image into the model's context window with a convention as natural as
HTMLResponseis for the human-facing case.Beta Was this translation helpful? Give feedback.
All reactions