-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Feature Type
I cannot use LiveKit without it
Feature Description
Currently, @function_tool functions only support primitive return types (str, int, float, bool, None). Returning ImageContent fails validation silently fnc_call_out is set to None and the LLM receives no tool response.
This makes it impossible to implement tools that return visual content to the LLM (e.g. a read_whiteboard() or take_screenshot() tool).
Several providers already support multimodal content in tool results at the API level (Anthropic tool_result, Gemini FunctionResponse), so this could be wired up for supported backends. OpenAI's agents-python SDK also supported this feature openai/openai-agents-python#1898
If this is an existing feature, please let me know. If this is on the road map or the community would like to support this feaure, I could take it or help with it.
Workarounds / Alternatives
Manually inject ImageContent into chat_ctx inside the tool body and return a plain string description:
@function_tool
async def read_screenshot(self, context: RunContext) -> str:
b64 = base64.b64encode(get_screenshot()).decode()
context.session.chat_ctx.add_message(
role="user",
content=["Current screenshot:", ImageContent(image=f"data:image/png;base64,{b64}")],
)
return "screenshot added to context."This workaround makes the content visible to the LLM but it should originally being passed as tool result/function response instead of a user message.
Additional Context
No response