feat: switch the default model to a newer mini model (affecting only when a model is unset) by seratch · Pull Request #3147 · openai/openai-agents-python

seratch · 2026-05-06T09:52:28Z

This pull request updates the SDK default model from gpt-4.1 to gpt-5.4-mini for agents that do not specify a model explicitly. Although gpt-5.5 is the latest model, gpt-5.4-mini is a pragmatic default for users getting started because it keeps latency closer to gpt-4.1 while moving the default onto the GPT-5 family. This default is not meant to be permanent; we may update it again as newer models offer a better balance of intelligence, latency, and cost.

Detailed analysis report

gpt-5.4-mini default-model validation report

Status: completed on the current checkout.

Investigation target

Validate gpt-5.4-mini as the Agents SDK default model replacement for gpt-4.1.

The probe compares the current baseline default behavior against the candidate with the model
settings the SDK applies to gpt-5.4-mini: reasoning.effort="none" and verbosity="low".

Validation matrix

case_id	scenario	mode	question	setup	state_setup	variable_under_test	held_constant	comparison_basis	observation_summary	result_flag	status	evidence
L1	local SDK default mapping	single-shot	Does the SDK default resolve to `gpt-5.4-mini` with GPT-5 default settings?	Inspect `agents.models` default helpers in the current checkout.	Local only	default model helper output	current checkout	expected SDK default contract	Default resolves to `gpt-5.4-mini` with `reasoning.effort="none"` and `verbosity="low"`.	expected	pass	`artifacts/summary.json`
S1	single-turn text response	warm-up + repeat-10	Can the candidate preserve a constrained simple-response pattern?	Agent must answer with exactly `READY`.	Fresh agent state per run	model name and default model settings	prompt and output constraint	`gpt-4.1` in the same probe	Both models returned the exact constrained response 10/10.	expected	pass	`artifacts/results.json`
T1	function tool call	warm-up + repeat-10	Can the candidate preserve a required function-tool workflow?	Agent must call `lookup_order_status` and report the returned status.	Fresh agent state per run	model name and default model settings	tool schema, prompt, and output constraint	`gpt-4.1` in the same probe	Both models called the required function once and returned the expected status 10/10.	expected	pass	`artifacts/results.json`
H1	handoff to specialist agent	warm-up + repeat-10	Can the candidate preserve a simple handoff workflow?	Frontline agent must hand off Spanish requests to a Spanish specialist.	Fresh agent state per run	model name and default model settings	handoff setup, prompt, and output constraint	`gpt-4.1` in the same probe	Both models completed the handoff and returned the specialist response 10/10.	expected	pass	`artifacts/results.json`
A1	tool approval interruption and resume	warm-up + repeat-10	Can the candidate preserve HITL interruption and approval resume?	Agent must request an approval-gated refund tool, then resume after approval.	Fresh agent state per run	model name and default model settings	tool schema, approval flow, prompt, and output constraint	`gpt-4.1` in the same probe	Both models interrupted for approval, resumed, and returned the approved tool result 10/10.	expected	pass	`artifacts/results.json`

Parity controls

Held constant: prompts, output constraints, tool definitions, handoff setup, approval flow, SDK
checkout, Python environment, and Responses path.
Variable under test: model name and matching default model settings.
Baseline: gpt-4.1 without GPT-5 default model settings.
Candidate: gpt-5.4-mini with reasoning.effort="none" and verbosity="low".
Scope: pattern parity across representative text, tool, handoff, and HITL workflows. This does
not claim broad quality equivalence outside the covered patterns.

Docs preflight

The OpenAI developer docs guidance for GPT-5 reasoning models says reasoning.effort can be used
to tune latency and intelligence tradeoffs, and that none is reserved for cases where low latency
is more important than intelligence. The probe therefore treats gpt-5.4-mini with
reasoning.effort="none" as a latency-oriented default candidate that still needs representative
agent-workflow validation.

Probe command

uv run python validation/gpt_5_4_mini_default/probe.py --output-dir validation/gpt_5_4_mini_default/artifacts --warmup-runs 1 --measured-runs 10

Findings

No candidate-specific regression was observed in the covered patterns. gpt-5.4-mini with
reasoning.effort="none" passed 10/10 measured runs for the text, function-tool, handoff, and HITL
approval-resume cases. The local SDK default helper also resolved to gpt-5.4-mini with
reasoning.effort="none" and verbosity="low".

The comparison supports pattern parity for these representative workflows, not a broad quality
equivalence claim. The covered workflows intentionally focus on low-latency agent mechanics:
single-turn constrained output, required function tool use, a simple handoff, and approval
interruption/resume.

Median total latency was comparable or better for the candidate in three of four live cases:

case_id	scenario	`gpt-4.1` median	`gpt-5.4-mini` median	candidate delta	pass rate
S1	single-turn text response	0.948s	0.955s	+0.7%	10/10 vs 10/10
T1	function tool call	2.500s	2.184s	-12.7%	10/10 vs 10/10
H1	handoff to specialist agent	2.465s	1.838s	-25.4%	10/10 vs 10/10
A1	tool approval interruption and resume	2.570s	2.637s	+2.6%	10/10 vs 10/10

Tail latency varied by case. The largest candidate max was 5.662s in the approval-resume case; the
largest baseline max was 5.592s in the handoff case.

Artifact status

The probe was run from the repository root on commit
ce462354fd3bbb841bb808dd63c8b94a4026a680 with Python 3.12.9 and openai 2.26.0. It used the
approved OPENAI_API_KEY environment variable and did not print the secret value.

Raw runtime artifacts were generated under validation/gpt_5_4_mini_default/artifacts/:

metadata.json
results.json
summary.json

Probe script

# mypy: ignore-errors
from __future__ import annotations

import argparse
import asyncio
import json
import os
import platform
import statistics
import subprocess
import sys
import time
from collections import Counter, defaultdict
from dataclasses import dataclass
from importlib import metadata
from pathlib import Path
from typing import Any

from openai.types.shared import Reasoning

from agents import Agent, ModelSettings, Runner, function_tool, handoff
from agents.models import get_default_model, get_default_model_settings, is_gpt_5_default

APPROVED_ENV_VARS = ["OPENAI_API_KEY"]


@dataclass(frozen=True)
class ModelCase:
    label: str
    model: str
    model_settings: ModelSettings


@dataclass(frozen=True)
class ProbeCase:
    case_id: str
    scenario: str
    mode: str
    question: str
    setup: str


@dataclass
class CaseResult:
    case_id: str
    model_label: str
    model: str
    scenario: str
    mode: str
    measured: bool
    total_latency_s: float | None
    observation_summary: str
    result_flag: str
    status: str
    output_preview: str
    error: str | None
    metrics: dict[str, Any]

    def as_dict(self) -> dict[str, Any]:
        return {
            "case_id": self.case_id,
            "model_label": self.model_label,
            "model": self.model,
            "scenario": self.scenario,
            "mode": self.mode,
            "measured": self.measured,
            "total_latency_s": self.total_latency_s,
            "observation_summary": self.observation_summary,
            "result_flag": self.result_flag,
            "status": self.status,
            "output_preview": self.output_preview,
            "error": self.error,
            "metrics": self.metrics,
        }


PROBE_CASES = [
    ProbeCase(
        case_id="L1",
        scenario="local SDK default mapping",
        mode="single-shot",
        question="Does the SDK default resolve to gpt-5.4-mini with GPT-5 default settings?",
        setup="No live API call; inspect agents.models default helpers in the current checkout.",
    ),
    ProbeCase(
        case_id="S1",
        scenario="single-turn text response",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve a constrained simple-response pattern?",
        setup="Agent must answer with exactly READY.",
    ),
    ProbeCase(
        case_id="T1",
        scenario="function tool call",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve a required function-tool workflow?",
        setup="Agent must call lookup_order_status and report the returned status.",
    ),
    ProbeCase(
        case_id="H1",
        scenario="handoff to specialist agent",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve a simple handoff workflow?",
        setup="Frontline agent must hand off Spanish requests to a Spanish specialist.",
    ),
    ProbeCase(
        case_id="A1",
        scenario="tool approval interruption and resume",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve HITL interruption and approval resume?",
        setup="Agent must request an approval-gated refund tool, then resume after approval.",
    ),
]


def _git_value(*args: str) -> str:
    result = subprocess.run(["git", *args], check=False, capture_output=True, text=True)
    if result.returncode != 0:
        return "unknown"
    return result.stdout.strip() or "unknown"


def _package_version(name: str) -> str | None:
    try:
        return metadata.version(name)
    except metadata.PackageNotFoundError:
        return None


def _runtime_context(output_dir: Path) -> dict[str, Any]:
    return {
        "approved_env_vars": {
            name: ("set" if os.getenv(name) else "unset") for name in APPROVED_ENV_VARS
        },
        "cwd": os.getcwd(),
        "git_branch": _git_value("rev-parse", "--abbrev-ref", "HEAD"),
        "git_commit": _git_value("rev-parse", "HEAD"),
        "output_dir": str(output_dir),
        "package_versions": {
            name: version
            for name in ("openai", "openai-agents", "agents")
            if (version := _package_version(name)) is not None
        },
        "platform": platform.platform(),
        "python_executable": sys.executable,
        "python_version": sys.version.split()[0],
        "script_path": str(Path(__file__).resolve()),
    }


def _write_json(path: Path, payload: Any) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n")


def _preview(value: object, *, limit: int = 600) -> str:
    text = str(value).replace("\n", " ").strip()
    return text[:limit]


async def _run_timed(coro: Any) -> tuple[float, Any]:
    started = time.perf_counter()
    result = await coro
    return time.perf_counter() - started, result


def _model_cases() -> list[ModelCase]:
    return [
        ModelCase(label="baseline_gpt_4_1", model="gpt-4.1", model_settings=ModelSettings()),
        ModelCase(
            label="candidate_gpt_5_4_mini_none",
            model="gpt-5.4-mini",
            model_settings=ModelSettings(reasoning=Reasoning(effort="none"), verbosity="low"),
        ),
    ]


def _result_for_exception(
    case: ProbeCase,
    model_case: ModelCase,
    *,
    measured: bool,
    latency_s: float | None,
    error: BaseException,
) -> CaseResult:
    return CaseResult(
        case_id=case.case_id,
        model_label=model_case.label,
        model=model_case.model,
        scenario=case.scenario,
        mode=case.mode,
        measured=measured,
        total_latency_s=latency_s,
        observation_summary=f"{type(error).__name__}: {_preview(error)}",
        result_flag="negative",
        status="fail",
        output_preview="",
        error=f"{type(error).__name__}: {error}",
        metrics={},
    )


async def _case_simple_text(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[1]
    agent = Agent(
        name="Default candidate text probe",
        instructions="Answer with exactly READY and no other text.",
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        latency_s, result = await _run_timed(Runner.run(agent, "Return the required response."))
        output = str(result.final_output).strip()
        ok = output == "READY"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary="Returned the exact constrained response."
            if ok
            else "Output drifted.",
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={},
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


async def _case_tool_call(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[2]
    calls: list[str] = []

    @function_tool
    def lookup_order_status(order_id: str) -> str:
        """Look up a deterministic test order status."""
        calls.append(order_id)
        return "shipped"

    agent = Agent(
        name="Default candidate tool probe",
        instructions=(
            "Always call lookup_order_status with order_id ord_123 before answering. "
            "After the tool result, answer exactly ORDER_STATUS:<status>."
        ),
        tools=[lookup_order_status],
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        latency_s, result = await _run_timed(Runner.run(agent, "What is the status of ord_123?"))
        output = str(result.final_output).strip()
        ok = calls == ["ord_123"] and output == "ORDER_STATUS:shipped"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary=(
                "Called the required function once and returned the expected status."
                if ok
                else "Tool workflow did not match the expected shape."
            ),
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={"tool_calls": calls},
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


async def _case_handoff(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[3]
    spanish_agent = Agent(
        name="Spanish_specialist",
        instructions="Speak only Spanish. Answer with exactly ESPECIALISTA_LISTO.",
        handoff_description="Handles Spanish-language requests.",
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    frontline_agent = Agent(
        name="Frontline",
        instructions=(
            "If the user asks in Spanish or asks for Spanish, immediately hand off to the "
            "Spanish_specialist. Do not answer directly in that case."
        ),
        handoffs=[handoff(spanish_agent)],
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        latency_s, result = await _run_timed(
            Runner.run(frontline_agent, "Por favor responde en espanol.")
        )
        output = str(result.final_output).strip()
        ok = output.rstrip(".") == "ESPECIALISTA_LISTO"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary=(
                "Completed the handoff and returned the specialist response."
                if ok
                else "Handoff response did not match the expected specialist output."
            ),
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={"last_agent": result.last_agent.name},
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


async def _case_approval(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[4]
    calls: list[str] = []

    @function_tool(needs_approval=True)
    def approve_refund(order_id: str) -> str:
        """Refund a deterministic test order after approval."""
        calls.append(order_id)
        return f"approved_refund:{order_id}"

    agent = Agent(
        name="Default candidate approval probe",
        instructions=(
            "You must call approve_refund with order_id ord_123. After the tool is approved "
            "and returns, answer exactly REFUND_RESULT:<tool result>."
        ),
        tools=[approve_refund],
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        started = time.perf_counter()
        result = await Runner.run(agent, "Refund order ord_123.")
        interruptions_before_approval = len(result.interruptions)
        if not result.interruptions:
            latency_s = time.perf_counter() - started
            return CaseResult(
                case_id=case.case_id,
                model_label=model_case.label,
                model=model_case.model,
                scenario=case.scenario,
                mode=case.mode,
                measured=measured,
                total_latency_s=latency_s,
                observation_summary="The run completed without the required approval interruption.",
                result_flag="negative",
                status="fail",
                output_preview=_preview(result.final_output),
                error=None,
                metrics={"interruptions_before_approval": interruptions_before_approval},
            )

        state = result.to_state()
        for interruption in result.interruptions:
            state.approve(interruption)
        resumed = await Runner.run(agent, state)
        latency_s = time.perf_counter() - started
        output = str(resumed.final_output).strip()
        ok = calls == ["ord_123"] and output == "REFUND_RESULT:approved_refund:ord_123"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary=(
                "Interrupted for approval, resumed, and returned the approved tool result."
                if ok
                else "Approval resume did not match the expected tool result."
            ),
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={
                "interruptions_before_approval": interruptions_before_approval,
                "tool_calls": calls,
            },
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


def _case_local_default() -> CaseResult:
    case = PROBE_CASES[0]
    default_settings = get_default_model_settings()
    reasoning = getattr(default_settings.reasoning, "effort", None)
    ok = (
        get_default_model() == "gpt-5.4-mini"
        and is_gpt_5_default() is True
        and reasoning == "none"
        and default_settings.verbosity == "low"
    )
    return CaseResult(
        case_id=case.case_id,
        model_label="sdk_default",
        model=get_default_model(),
        scenario=case.scenario,
        mode=case.mode,
        measured=True,
        total_latency_s=None,
        observation_summary=(
            "Default model resolves to gpt-5.4-mini with reasoning.effort none and verbosity low."
            if ok
            else "Default helper output did not match the expected gpt-5.4-mini settings."
        ),
        result_flag="expected" if ok else "negative",
        status="pass" if ok else "fail",
        output_preview="",
        error=None,
        metrics={
            "default_model": get_default_model(),
            "is_gpt_5_default": is_gpt_5_default(),
            "reasoning_effort": reasoning,
            "verbosity": default_settings.verbosity,
        },
    )


async def _run_live_case(
    case_id: str,
    model_case: ModelCase,
    *,
    measured: bool,
) -> CaseResult:
    if case_id == "S1":
        return await _case_simple_text(model_case, measured=measured)
    if case_id == "T1":
        return await _case_tool_call(model_case, measured=measured)
    if case_id == "H1":
        return await _case_handoff(model_case, measured=measured)
    if case_id == "A1":
        return await _case_approval(model_case, measured=measured)
    raise ValueError(f"Unknown live case: {case_id}")


def _summarize(results: list[CaseResult]) -> dict[str, Any]:
    grouped: defaultdict[tuple[str, str], list[CaseResult]] = defaultdict(list)
    for result in results:
        grouped[(result.case_id, result.model_label)].append(result)

    cases: dict[str, Any] = {}
    for (case_id, model_label), items in sorted(grouped.items()):
        measured = [item for item in items if item.measured]
        latencies = [item.total_latency_s for item in measured if item.total_latency_s is not None]
        cases[f"{case_id}:{model_label}"] = {
            "case_id": case_id,
            "model_label": model_label,
            "model": items[-1].model,
            "scenario": items[-1].scenario,
            "runs": len(measured),
            "warmups": len(items) - len(measured),
            "status_counts": dict(Counter(item.status for item in measured or items)),
            "result_flags": dict(Counter(item.result_flag for item in measured or items)),
            "median_total_latency_s": statistics.median(latencies) if latencies else None,
            "min_total_latency_s": min(latencies) if latencies else None,
            "max_total_latency_s": max(latencies) if latencies else None,
            "latest_observation": items[-1].observation_summary,
        }

    return {
        "cases": cases,
        "result_flags": dict(Counter(result.result_flag for result in results)),
        "status_counts": dict(Counter(result.status for result in results if result.measured)),
    }


async def _run(args: argparse.Namespace) -> int:
    output_dir = Path(args.output_dir).resolve()
    output_dir.mkdir(parents=True, exist_ok=True)

    results: list[CaseResult] = [_case_local_default()]
    if not args.skip_live:
        if not os.getenv("OPENAI_API_KEY"):
            raise RuntimeError("OPENAI_API_KEY must be set for live probe cases.")

        live_case_ids = [case_id.strip() for case_id in args.cases.split(",") if case_id.strip()]
        model_cases = _model_cases()
        for case_id in live_case_ids:
            for model_case in model_cases:
                for _ in range(args.warmup_runs):
                    results.append(await _run_live_case(case_id, model_case, measured=False))
                for _ in range(args.measured_runs):
                    results.append(await _run_live_case(case_id, model_case, measured=True))

    metadata = {
        "context": _runtime_context(output_dir),
        "probe_cases": [case.__dict__ for case in PROBE_CASES],
        "execution": {
            "cases": args.cases,
            "measured_runs": args.measured_runs,
            "skip_live": args.skip_live,
            "warmup_runs": args.warmup_runs,
        },
        "comparison_parity": {
            "held_constant": [
                "same prompts and output constraints for each scenario",
                "fresh agent state per measured run",
                "same SDK checkout, Python environment, and Responses path",
            ],
            "variable_under_test": "model name and matching default model settings",
            "baseline": "gpt-4.1 without GPT-5 default model settings",
            "candidate": "gpt-5.4-mini with reasoning.effort none and verbosity low",
        },
    }
    summary = _summarize(results)
    _write_json(output_dir / "metadata.json", metadata)
    _write_json(output_dir / "results.json", [result.as_dict() for result in results])
    _write_json(output_dir / "summary.json", summary)
    print(json.dumps({"metadata": metadata, "summary": summary}, indent=2, sort_keys=True))
    return 0 if all(result.status == "pass" for result in results if result.measured) else 1


def _parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Probe gpt-5.4-mini as the Agents SDK default model candidate."
    )
    parser.add_argument(
        "--output-dir",
        default="validation/gpt_5_4_mini_default/artifacts",
        help="Directory for metadata.json, results.json, and summary.json.",
    )
    parser.add_argument(
        "--cases",
        default="S1,T1,H1,A1",
        help="Comma-separated live case IDs to run.",
    )
    parser.add_argument("--warmup-runs", type=int, default=1)
    parser.add_argument("--measured-runs", type=int, default=3)
    parser.add_argument("--skip-live", action="store_true")
    return parser.parse_args()


def main() -> int:
    return asyncio.run(_run(_parse_args()))


if __name__ == "__main__":
    raise SystemExit(main())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: switch the default model to a newer mini model (affecting only when a model is unset)#3147

feat: switch the default model to a newer mini model (affecting only when a model is unset)#3147
seratch merged 1 commit intomainfrom
feat/gpt-5-4-mini-default-model-settings

seratch commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

sibblegp commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seratch commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gpt-5.4-mini default-model validation report

Investigation target

Validation matrix

Parity controls

Docs preflight

Probe command

Findings

Artifact status

Uh oh!

Uh oh!

sibblegp commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seratch commented May 6, 2026 •

edited

Loading