LLM Bypasses Tool Calls When output_type is Defined, Even With Explicit Tooling Instructions

`When` an `Agent` in the OpenAI Agents SDK is configured with both a list of `tools` and a structured `output_type` (Pydantic model), the LLM frequently chooses to directly generate the structured output. This occurs even when the agent's instructions explicitly state that it **MUST** use the provided tools to obtain the information necessary for that output. Debug logs (`logging.DEBUG` level) confirm this by showing `tool_calls: null` in the LLM's response, indicating that the tool was not actually invoked or executed.

This behavior suggests that the LLM prioritizes satisfying the `output_type` schema directly over adhering to instructions that mandate tool invocation, especially when it perceives it can internally generate or infer the required information (as seen with simple tasks like translation). This can lead to unreliable agent workflows where the accuracy or side-effects of actual tool execution are bypassed.

To reproduce this issue:

  **Code:** Save the following Python code as `main.py` and run it:

    
    import asyncio
    import logging
    import os
    from datetime import datetime
    from dotenv import load_dotenv
    from openai import AsyncOpenAI
    from pydantic import BaseModel, Field
    from agents import Agent, OpenAIChatCompletionsModel, Runner, RunConfig

    logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

    class SimpleLogger:
        @staticmethod
        def log(step: str, details: str = ""):
            timestamp = datetime.now().strftime("%H:%M:%S")
            print(f"\n[{timestamp}] {step} {details}")


    class SimpleTranslationOutput(BaseModel):
        translated_text: str = Field(description="The translated text.")

    load_dotenv()
    GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 

    client = AsyncOpenAI(
        api_key=GEMINI_API_KEY,
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    )

    model = OpenAIChatCompletionsModel(model="gemini-2.0-flash", openai_client=client)
    config = RunConfig(model=model, model_provider=client, tracing_disabled=True) 

    spanish_agent = Agent(
        name="SpanishTranslatorAgent",
        instructions="You translate the user's message EXACTLY into Spanish. Provide only the translated text.",
        model=model,
    )

    french_agent = Agent(
        name="FrenchTranslatorAgent",
        instructions="You translate the user's message EXACTLY into French. Provide only the translated text.",
        model=model,
    )

    orchestrator_agent = Agent(
        name="OrchestratorAgent",
        instructions=(
           "You are a translation agent. You use the tools given to you to translate."
        "If asked for multiple translations, you call the relevant tools."
        ),
        output_type=SimpleTranslationOutput,
        model=model,
        tools=[
            spanish_agent.as_tool(
                tool_name="translate_to_spanish",
                tool_description="Translate the user's exact message to Spanish. Input: The message to translate."
            ),
            french_agent.as_tool(
                tool_name="translate_to_french",
                tool_description="Translate the user's exact message to French. Input: The message to translate.",
            ),
        ],
    )

    async def main():
        SimpleLogger.log("SETUP", "Starting Orchestrator Agent demonstration with detailed logging.")

        user_message = "Hello, how are you? Please translate this to Spanish."

        SimpleLogger.log("--- RUNNING ORCHESTRATOR ---", f"Input: '{user_message}'")
        try:
            result = await Runner.run(
                orchestrator_agent,
                input=user_message,
                run_config=RunConfig(max_turns=5, model=model, model_provider=client), 
            )

            SimpleLogger.log("--- FINAL OUTPUT ---", "Orchestrator Agent's final response:")
            if result.final_output:
                print(f"Translated Text: {result.final_output.translated_text}")
            else:
                print("No structured output received from Orchestrator Agent.")

            SimpleLogger.log("DEBUG INFO", f"Raw output items from Orchestrator: {result.new_items}")
            SimpleLogger.log("--- DEMONSTRATION COMPLETE ---")

        except Exception as e:
            SimpleLogger.log("ERROR", f"An error occurred during agent run: {e}")
            logging.exception("Detailed error traceback:")


    if __name__ == "__main__":
        asyncio.run(main())
  

### Expected Behavior

1.  The `orchestrator_agent` identifies the need for Spanish translation.
2.  It generates a `tool_call` to `translate_to_spanish` with the user's message as an argument.
3.  The `translate_to_spanish` tool (which is another agent) executes and returns the translated text.
4.  The `orchestrator_agent` then uses this tool's output to populate its `SimpleTranslationOutput` Pydantic model and returns it.
5.  The debug logs (`LLM resp:`) should clearly show a `tool_calls` entry for `translate_to_spanish`.

### Actual Behavior

Upon running the provided code, the following relevant log output is observed:

```
# ... (initial debug logs, prompt construction, HTTP request details) ...

2025-07-13 01:24:35,743 - openai.agents - DEBUG - LLM resp:
{
  "content": "{\n\"translated_text\": \"Hola, ¿cómo estás?\"\n}",
  "refusal": null,
  "role": "assistant",
  "annotations": null,
  "audio": null,
  "function_call": null,
  "tool_calls": null  <--- CRITICAL: 'tool_calls' is NULL, indicating no tool invocation.
}

# ... (subsequent debug logs) ...

[01:24:35] --- FINAL OUTPUT --- Orchestrator Agent's final response:
Translated Text: Hola, ¿cómo estás?

# ... (remaining logs) ...
```

Despite the agent being provided with translation tools and explicitly instructed to use them, the LLM generates the `translated_text` directly. The `tool_calls: null` in the `LLM resp:` confirms that the `translate_to_spanish` tool was **not actually invoked**. The LLM performed the translation itself and directly populated the `SimpleTranslationOutput` schema, bypassing the instructed tool usage.

This behavior was observed both with a simple (1-property) `SimpleTranslationOutput` and a more complex (3-property) `TranslationOutput` Pydantic model in previous tests.

### Potential Root Cause / Suggestion

It appears the LLM prioritizes directly fulfilling the `output_type` schema over strictly adhering to instructions that mandate tool invocation, especially when the LLM's inherent capabilities allow it to infer or generate the required data (like simple translations) without external tools. The LLM might perceive direct generation as more "efficient."

It would be beneficial for the OpenAI Agents SDK to either:

1.  **Clarify this behavior in documentation:** Provide explicit guidance on how `output_type` interacts with tool-calling behavior, and when LLMs might bypass tools if they can directly satisfy the output format.
2.  **Provide mechanisms to enforce tool calls:** Offer stronger configuration options (e.g., within `Agent` settings or `RunConfig`) to *force* an LLM to make a tool call when an `output_type` is present, especially if the `output_type` contains fields that are logically dependent on tool output. This could be a `tool_choice` equivalent for agents, or a strict mode that raises an error if tool calls are instructed but not made.

### Environment

* **OpenAI Agents SDK Version:** `0.1.0`
* **Python Version:** `3.12`
* **LLM Model Used:** `gemini-2.0-flash`
* **Operating System:**  `Windows 11`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM Bypasses Tool Calls When output_type is Defined, Even With Explicit Tooling Instructions #1090

Expected Behavior

Actual Behavior

Potential Root Cause / Suggestion

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLM Bypasses Tool Calls When output_type is Defined, Even With Explicit Tooling Instructions #1090

Description

Expected Behavior

Actual Behavior

Potential Root Cause / Suggestion

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions