-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
When an Agent in the OpenAI Agents SDK is configured with both a list of tools and a structured output_type (Pydantic model), the LLM frequently chooses to directly generate the structured output. This occurs even when the agent's instructions explicitly state that it MUST use the provided tools to obtain the information necessary for that output. Debug logs (logging.DEBUG level) confirm this by showing tool_calls: null in the LLM's response, indicating that the tool was not actually invoked or executed.
This behavior suggests that the LLM prioritizes satisfying the output_type schema directly over adhering to instructions that mandate tool invocation, especially when it perceives it can internally generate or infer the required information (as seen with simple tasks like translation). This can lead to unreliable agent workflows where the accuracy or side-effects of actual tool execution are bypassed.
To reproduce this issue:
Code: Save the following Python code as main.py and run it:
import asyncio
import logging
import os
from datetime import datetime
from dotenv import load_dotenv
from openai import AsyncOpenAI
from pydantic import BaseModel, Field
from agents import Agent, OpenAIChatCompletionsModel, Runner, RunConfig
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
class SimpleLogger:
@staticmethod
def log(step: str, details: str = ""):
timestamp = datetime.now().strftime("%H:%M:%S")
print(f"\n[{timestamp}] {step} {details}")
class SimpleTranslationOutput(BaseModel):
translated_text: str = Field(description="The translated text.")
load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
client = AsyncOpenAI(
api_key=GEMINI_API_KEY,
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)
model = OpenAIChatCompletionsModel(model="gemini-2.0-flash", openai_client=client)
config = RunConfig(model=model, model_provider=client, tracing_disabled=True)
spanish_agent = Agent(
name="SpanishTranslatorAgent",
instructions="You translate the user's message EXACTLY into Spanish. Provide only the translated text.",
model=model,
)
french_agent = Agent(
name="FrenchTranslatorAgent",
instructions="You translate the user's message EXACTLY into French. Provide only the translated text.",
model=model,
)
orchestrator_agent = Agent(
name="OrchestratorAgent",
instructions=(
"You are a translation agent. You use the tools given to you to translate."
"If asked for multiple translations, you call the relevant tools."
),
output_type=SimpleTranslationOutput,
model=model,
tools=[
spanish_agent.as_tool(
tool_name="translate_to_spanish",
tool_description="Translate the user's exact message to Spanish. Input: The message to translate."
),
french_agent.as_tool(
tool_name="translate_to_french",
tool_description="Translate the user's exact message to French. Input: The message to translate.",
),
],
)
async def main():
SimpleLogger.log("SETUP", "Starting Orchestrator Agent demonstration with detailed logging.")
user_message = "Hello, how are you? Please translate this to Spanish."
SimpleLogger.log("--- RUNNING ORCHESTRATOR ---", f"Input: '{user_message}'")
try:
result = await Runner.run(
orchestrator_agent,
input=user_message,
run_config=RunConfig(max_turns=5, model=model, model_provider=client),
)
SimpleLogger.log("--- FINAL OUTPUT ---", "Orchestrator Agent's final response:")
if result.final_output:
print(f"Translated Text: {result.final_output.translated_text}")
else:
print("No structured output received from Orchestrator Agent.")
SimpleLogger.log("DEBUG INFO", f"Raw output items from Orchestrator: {result.new_items}")
SimpleLogger.log("--- DEMONSTRATION COMPLETE ---")
except Exception as e:
SimpleLogger.log("ERROR", f"An error occurred during agent run: {e}")
logging.exception("Detailed error traceback:")
if __name__ == "__main__":
asyncio.run(main())
Expected Behavior
- The
orchestrator_agentidentifies the need for Spanish translation. - It generates a
tool_calltotranslate_to_spanishwith the user's message as an argument. - The
translate_to_spanishtool (which is another agent) executes and returns the translated text. - The
orchestrator_agentthen uses this tool's output to populate itsSimpleTranslationOutputPydantic model and returns it. - The debug logs (
LLM resp:) should clearly show atool_callsentry fortranslate_to_spanish.
Actual Behavior
Upon running the provided code, the following relevant log output is observed:
# ... (initial debug logs, prompt construction, HTTP request details) ...
2025-07-13 01:24:35,743 - openai.agents - DEBUG - LLM resp:
{
"content": "{\n\"translated_text\": \"Hola, ¿cómo estás?\"\n}",
"refusal": null,
"role": "assistant",
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": null <--- CRITICAL: 'tool_calls' is NULL, indicating no tool invocation.
}
# ... (subsequent debug logs) ...
[01:24:35] --- FINAL OUTPUT --- Orchestrator Agent's final response:
Translated Text: Hola, ¿cómo estás?
# ... (remaining logs) ...
Despite the agent being provided with translation tools and explicitly instructed to use them, the LLM generates the translated_text directly. The tool_calls: null in the LLM resp: confirms that the translate_to_spanish tool was not actually invoked. The LLM performed the translation itself and directly populated the SimpleTranslationOutput schema, bypassing the instructed tool usage.
This behavior was observed both with a simple (1-property) SimpleTranslationOutput and a more complex (3-property) TranslationOutput Pydantic model in previous tests.
Potential Root Cause / Suggestion
It appears the LLM prioritizes directly fulfilling the output_type schema over strictly adhering to instructions that mandate tool invocation, especially when the LLM's inherent capabilities allow it to infer or generate the required data (like simple translations) without external tools. The LLM might perceive direct generation as more "efficient."
It would be beneficial for the OpenAI Agents SDK to either:
- Clarify this behavior in documentation: Provide explicit guidance on how
output_typeinteracts with tool-calling behavior, and when LLMs might bypass tools if they can directly satisfy the output format. - Provide mechanisms to enforce tool calls: Offer stronger configuration options (e.g., within
Agentsettings orRunConfig) to force an LLM to make a tool call when anoutput_typeis present, especially if theoutput_typecontains fields that are logically dependent on tool output. This could be atool_choiceequivalent for agents, or a strict mode that raises an error if tool calls are instructed but not made.
Environment
- OpenAI Agents SDK Version:
0.1.0 - Python Version:
3.12 - LLM Model Used:
gemini-2.0-flash - Operating System:
Windows 11