Skip to content

LLM Bypasses Tool Calls When output_type is Defined, Even With Explicit Tooling Instructions #1090

@AbdulSamad94

Description

@AbdulSamad94

When an Agent in the OpenAI Agents SDK is configured with both a list of tools and a structured output_type (Pydantic model), the LLM frequently chooses to directly generate the structured output. This occurs even when the agent's instructions explicitly state that it MUST use the provided tools to obtain the information necessary for that output. Debug logs (logging.DEBUG level) confirm this by showing tool_calls: null in the LLM's response, indicating that the tool was not actually invoked or executed.

This behavior suggests that the LLM prioritizes satisfying the output_type schema directly over adhering to instructions that mandate tool invocation, especially when it perceives it can internally generate or infer the required information (as seen with simple tasks like translation). This can lead to unreliable agent workflows where the accuracy or side-effects of actual tool execution are bypassed.

To reproduce this issue:

Code: Save the following Python code as main.py and run it:

import asyncio
import logging
import os
from datetime import datetime
from dotenv import load_dotenv
from openai import AsyncOpenAI
from pydantic import BaseModel, Field
from agents import Agent, OpenAIChatCompletionsModel, Runner, RunConfig

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

class SimpleLogger:
    @staticmethod
    def log(step: str, details: str = ""):
        timestamp = datetime.now().strftime("%H:%M:%S")
        print(f"\n[{timestamp}] {step} {details}")


class SimpleTranslationOutput(BaseModel):
    translated_text: str = Field(description="The translated text.")

load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 

client = AsyncOpenAI(
    api_key=GEMINI_API_KEY,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)

model = OpenAIChatCompletionsModel(model="gemini-2.0-flash", openai_client=client)
config = RunConfig(model=model, model_provider=client, tracing_disabled=True) 

spanish_agent = Agent(
    name="SpanishTranslatorAgent",
    instructions="You translate the user's message EXACTLY into Spanish. Provide only the translated text.",
    model=model,
)

french_agent = Agent(
    name="FrenchTranslatorAgent",
    instructions="You translate the user's message EXACTLY into French. Provide only the translated text.",
    model=model,
)

orchestrator_agent = Agent(
    name="OrchestratorAgent",
    instructions=(
       "You are a translation agent. You use the tools given to you to translate."
    "If asked for multiple translations, you call the relevant tools."
    ),
    output_type=SimpleTranslationOutput,
    model=model,
    tools=[
        spanish_agent.as_tool(
            tool_name="translate_to_spanish",
            tool_description="Translate the user's exact message to Spanish. Input: The message to translate."
        ),
        french_agent.as_tool(
            tool_name="translate_to_french",
            tool_description="Translate the user's exact message to French. Input: The message to translate.",
        ),
    ],
)

async def main():
    SimpleLogger.log("SETUP", "Starting Orchestrator Agent demonstration with detailed logging.")

    user_message = "Hello, how are you? Please translate this to Spanish."

    SimpleLogger.log("--- RUNNING ORCHESTRATOR ---", f"Input: '{user_message}'")
    try:
        result = await Runner.run(
            orchestrator_agent,
            input=user_message,
            run_config=RunConfig(max_turns=5, model=model, model_provider=client), 
        )

        SimpleLogger.log("--- FINAL OUTPUT ---", "Orchestrator Agent's final response:")
        if result.final_output:
            print(f"Translated Text: {result.final_output.translated_text}")
        else:
            print("No structured output received from Orchestrator Agent.")

        SimpleLogger.log("DEBUG INFO", f"Raw output items from Orchestrator: {result.new_items}")
        SimpleLogger.log("--- DEMONSTRATION COMPLETE ---")

    except Exception as e:
        SimpleLogger.log("ERROR", f"An error occurred during agent run: {e}")
        logging.exception("Detailed error traceback:")


if __name__ == "__main__":
    asyncio.run(main())

Expected Behavior

  1. The orchestrator_agent identifies the need for Spanish translation.
  2. It generates a tool_call to translate_to_spanish with the user's message as an argument.
  3. The translate_to_spanish tool (which is another agent) executes and returns the translated text.
  4. The orchestrator_agent then uses this tool's output to populate its SimpleTranslationOutput Pydantic model and returns it.
  5. The debug logs (LLM resp:) should clearly show a tool_calls entry for translate_to_spanish.

Actual Behavior

Upon running the provided code, the following relevant log output is observed:

# ... (initial debug logs, prompt construction, HTTP request details) ...

2025-07-13 01:24:35,743 - openai.agents - DEBUG - LLM resp:
{
  "content": "{\n\"translated_text\": \"Hola, ¿cómo estás?\"\n}",
  "refusal": null,
  "role": "assistant",
  "annotations": null,
  "audio": null,
  "function_call": null,
  "tool_calls": null  <--- CRITICAL: 'tool_calls' is NULL, indicating no tool invocation.
}

# ... (subsequent debug logs) ...

[01:24:35] --- FINAL OUTPUT --- Orchestrator Agent's final response:
Translated Text: Hola, ¿cómo estás?

# ... (remaining logs) ...

Despite the agent being provided with translation tools and explicitly instructed to use them, the LLM generates the translated_text directly. The tool_calls: null in the LLM resp: confirms that the translate_to_spanish tool was not actually invoked. The LLM performed the translation itself and directly populated the SimpleTranslationOutput schema, bypassing the instructed tool usage.

This behavior was observed both with a simple (1-property) SimpleTranslationOutput and a more complex (3-property) TranslationOutput Pydantic model in previous tests.

Potential Root Cause / Suggestion

It appears the LLM prioritizes directly fulfilling the output_type schema over strictly adhering to instructions that mandate tool invocation, especially when the LLM's inherent capabilities allow it to infer or generate the required data (like simple translations) without external tools. The LLM might perceive direct generation as more "efficient."

It would be beneficial for the OpenAI Agents SDK to either:

  1. Clarify this behavior in documentation: Provide explicit guidance on how output_type interacts with tool-calling behavior, and when LLMs might bypass tools if they can directly satisfy the output format.
  2. Provide mechanisms to enforce tool calls: Offer stronger configuration options (e.g., within Agent settings or RunConfig) to force an LLM to make a tool call when an output_type is present, especially if the output_type contains fields that are logically dependent on tool output. This could be a tool_choice equivalent for agents, or a strict mode that raises an error if tool calls are instructed but not made.

Environment

  • OpenAI Agents SDK Version: 0.1.0
  • Python Version: 3.12
  • LLM Model Used: gemini-2.0-flash
  • Operating System: Windows 11

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions