Skip to content

feat: support ImageContent in tool return value #4893

@Ao-Last

Description

@Ao-Last

Feature Type

I cannot use LiveKit without it

Feature Description

Currently, @function_tool functions only support primitive return types (str, int, float, bool, None). Returning ImageContent fails validation silently fnc_call_out is set to None and the LLM receives no tool response.

This makes it impossible to implement tools that return visual content to the LLM (e.g. a read_whiteboard() or take_screenshot() tool).

Several providers already support multimodal content in tool results at the API level (Anthropic tool_result, Gemini FunctionResponse), so this could be wired up for supported backends. OpenAI's agents-python SDK also supported this feature openai/openai-agents-python#1898

If this is an existing feature, please let me know. If this is on the road map or the community would like to support this feaure, I could take it or help with it.

Workarounds / Alternatives

Manually inject ImageContent into chat_ctx inside the tool body and return a plain string description:

@function_tool
async def read_screenshot(self, context: RunContext) -> str:
    b64 = base64.b64encode(get_screenshot()).decode()
    context.session.chat_ctx.add_message(
        role="user",
        content=["Current screenshot:", ImageContent(image=f"data:image/png;base64,{b64}")],
    )
    return "screenshot added to context."

This workaround makes the content visible to the LLM but it should originally being passed as tool result/function response instead of a user message.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions