Skip to content

feat: switch the default model to a newer mini model (affecting only when a model is unset)#3147

Merged
seratch merged 1 commit intomainfrom
feat/gpt-5-4-mini-default-model-settings
May 6, 2026
Merged

feat: switch the default model to a newer mini model (affecting only when a model is unset)#3147
seratch merged 1 commit intomainfrom
feat/gpt-5-4-mini-default-model-settings

Conversation

@seratch
Copy link
Copy Markdown
Member

@seratch seratch commented May 6, 2026

This pull request updates the SDK default model from gpt-4.1 to gpt-5.4-mini for agents that do not specify a model explicitly. Although gpt-5.5 is the latest model, gpt-5.4-mini is a pragmatic default for users getting started because it keeps latency closer to gpt-4.1 while moving the default onto the GPT-5 family. This default is not meant to be permanent; we may update it again as newer models offer a better balance of intelligence, latency, and cost.

Detailed analysis report

gpt-5.4-mini default-model validation report

Status: completed on the current checkout.

Investigation target

Validate gpt-5.4-mini as the Agents SDK default model replacement for gpt-4.1.

The probe compares the current baseline default behavior against the candidate with the model
settings the SDK applies to gpt-5.4-mini: reasoning.effort="none" and verbosity="low".

Validation matrix

case_id scenario mode question setup state_setup variable_under_test held_constant comparison_basis observation_summary result_flag status evidence
L1 local SDK default mapping single-shot Does the SDK default resolve to gpt-5.4-mini with GPT-5 default settings? Inspect agents.models default helpers in the current checkout. Local only default model helper output current checkout expected SDK default contract Default resolves to gpt-5.4-mini with reasoning.effort="none" and verbosity="low". expected pass artifacts/summary.json
S1 single-turn text response warm-up + repeat-10 Can the candidate preserve a constrained simple-response pattern? Agent must answer with exactly READY. Fresh agent state per run model name and default model settings prompt and output constraint gpt-4.1 in the same probe Both models returned the exact constrained response 10/10. expected pass artifacts/results.json
T1 function tool call warm-up + repeat-10 Can the candidate preserve a required function-tool workflow? Agent must call lookup_order_status and report the returned status. Fresh agent state per run model name and default model settings tool schema, prompt, and output constraint gpt-4.1 in the same probe Both models called the required function once and returned the expected status 10/10. expected pass artifacts/results.json
H1 handoff to specialist agent warm-up + repeat-10 Can the candidate preserve a simple handoff workflow? Frontline agent must hand off Spanish requests to a Spanish specialist. Fresh agent state per run model name and default model settings handoff setup, prompt, and output constraint gpt-4.1 in the same probe Both models completed the handoff and returned the specialist response 10/10. expected pass artifacts/results.json
A1 tool approval interruption and resume warm-up + repeat-10 Can the candidate preserve HITL interruption and approval resume? Agent must request an approval-gated refund tool, then resume after approval. Fresh agent state per run model name and default model settings tool schema, approval flow, prompt, and output constraint gpt-4.1 in the same probe Both models interrupted for approval, resumed, and returned the approved tool result 10/10. expected pass artifacts/results.json

Parity controls

  • Held constant: prompts, output constraints, tool definitions, handoff setup, approval flow, SDK
    checkout, Python environment, and Responses path.
  • Variable under test: model name and matching default model settings.
  • Baseline: gpt-4.1 without GPT-5 default model settings.
  • Candidate: gpt-5.4-mini with reasoning.effort="none" and verbosity="low".
  • Scope: pattern parity across representative text, tool, handoff, and HITL workflows. This does
    not claim broad quality equivalence outside the covered patterns.

Docs preflight

The OpenAI developer docs guidance for GPT-5 reasoning models says reasoning.effort can be used
to tune latency and intelligence tradeoffs, and that none is reserved for cases where low latency
is more important than intelligence. The probe therefore treats gpt-5.4-mini with
reasoning.effort="none" as a latency-oriented default candidate that still needs representative
agent-workflow validation.

Probe command

uv run python validation/gpt_5_4_mini_default/probe.py --output-dir validation/gpt_5_4_mini_default/artifacts --warmup-runs 1 --measured-runs 10

Findings

No candidate-specific regression was observed in the covered patterns. gpt-5.4-mini with
reasoning.effort="none" passed 10/10 measured runs for the text, function-tool, handoff, and HITL
approval-resume cases. The local SDK default helper also resolved to gpt-5.4-mini with
reasoning.effort="none" and verbosity="low".

The comparison supports pattern parity for these representative workflows, not a broad quality
equivalence claim. The covered workflows intentionally focus on low-latency agent mechanics:
single-turn constrained output, required function tool use, a simple handoff, and approval
interruption/resume.

Median total latency was comparable or better for the candidate in three of four live cases:

case_id scenario gpt-4.1 median gpt-5.4-mini median candidate delta pass rate
S1 single-turn text response 0.948s 0.955s +0.7% 10/10 vs 10/10
T1 function tool call 2.500s 2.184s -12.7% 10/10 vs 10/10
H1 handoff to specialist agent 2.465s 1.838s -25.4% 10/10 vs 10/10
A1 tool approval interruption and resume 2.570s 2.637s +2.6% 10/10 vs 10/10

Tail latency varied by case. The largest candidate max was 5.662s in the approval-resume case; the
largest baseline max was 5.592s in the handoff case.

Artifact status

The probe was run from the repository root on commit
ce462354fd3bbb841bb808dd63c8b94a4026a680 with Python 3.12.9 and openai 2.26.0. It used the
approved OPENAI_API_KEY environment variable and did not print the secret value.

Raw runtime artifacts were generated under validation/gpt_5_4_mini_default/artifacts/:

  • metadata.json
  • results.json
  • summary.json
Probe script
# mypy: ignore-errors
from __future__ import annotations

import argparse
import asyncio
import json
import os
import platform
import statistics
import subprocess
import sys
import time
from collections import Counter, defaultdict
from dataclasses import dataclass
from importlib import metadata
from pathlib import Path
from typing import Any

from openai.types.shared import Reasoning

from agents import Agent, ModelSettings, Runner, function_tool, handoff
from agents.models import get_default_model, get_default_model_settings, is_gpt_5_default

APPROVED_ENV_VARS = ["OPENAI_API_KEY"]


@dataclass(frozen=True)
class ModelCase:
    label: str
    model: str
    model_settings: ModelSettings


@dataclass(frozen=True)
class ProbeCase:
    case_id: str
    scenario: str
    mode: str
    question: str
    setup: str


@dataclass
class CaseResult:
    case_id: str
    model_label: str
    model: str
    scenario: str
    mode: str
    measured: bool
    total_latency_s: float | None
    observation_summary: str
    result_flag: str
    status: str
    output_preview: str
    error: str | None
    metrics: dict[str, Any]

    def as_dict(self) -> dict[str, Any]:
        return {
            "case_id": self.case_id,
            "model_label": self.model_label,
            "model": self.model,
            "scenario": self.scenario,
            "mode": self.mode,
            "measured": self.measured,
            "total_latency_s": self.total_latency_s,
            "observation_summary": self.observation_summary,
            "result_flag": self.result_flag,
            "status": self.status,
            "output_preview": self.output_preview,
            "error": self.error,
            "metrics": self.metrics,
        }


PROBE_CASES = [
    ProbeCase(
        case_id="L1",
        scenario="local SDK default mapping",
        mode="single-shot",
        question="Does the SDK default resolve to gpt-5.4-mini with GPT-5 default settings?",
        setup="No live API call; inspect agents.models default helpers in the current checkout.",
    ),
    ProbeCase(
        case_id="S1",
        scenario="single-turn text response",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve a constrained simple-response pattern?",
        setup="Agent must answer with exactly READY.",
    ),
    ProbeCase(
        case_id="T1",
        scenario="function tool call",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve a required function-tool workflow?",
        setup="Agent must call lookup_order_status and report the returned status.",
    ),
    ProbeCase(
        case_id="H1",
        scenario="handoff to specialist agent",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve a simple handoff workflow?",
        setup="Frontline agent must hand off Spanish requests to a Spanish specialist.",
    ),
    ProbeCase(
        case_id="A1",
        scenario="tool approval interruption and resume",
        mode="warm-up + repeat-N",
        question="Can the candidate preserve HITL interruption and approval resume?",
        setup="Agent must request an approval-gated refund tool, then resume after approval.",
    ),
]


def _git_value(*args: str) -> str:
    result = subprocess.run(["git", *args], check=False, capture_output=True, text=True)
    if result.returncode != 0:
        return "unknown"
    return result.stdout.strip() or "unknown"


def _package_version(name: str) -> str | None:
    try:
        return metadata.version(name)
    except metadata.PackageNotFoundError:
        return None


def _runtime_context(output_dir: Path) -> dict[str, Any]:
    return {
        "approved_env_vars": {
            name: ("set" if os.getenv(name) else "unset") for name in APPROVED_ENV_VARS
        },
        "cwd": os.getcwd(),
        "git_branch": _git_value("rev-parse", "--abbrev-ref", "HEAD"),
        "git_commit": _git_value("rev-parse", "HEAD"),
        "output_dir": str(output_dir),
        "package_versions": {
            name: version
            for name in ("openai", "openai-agents", "agents")
            if (version := _package_version(name)) is not None
        },
        "platform": platform.platform(),
        "python_executable": sys.executable,
        "python_version": sys.version.split()[0],
        "script_path": str(Path(__file__).resolve()),
    }


def _write_json(path: Path, payload: Any) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n")


def _preview(value: object, *, limit: int = 600) -> str:
    text = str(value).replace("\n", " ").strip()
    return text[:limit]


async def _run_timed(coro: Any) -> tuple[float, Any]:
    started = time.perf_counter()
    result = await coro
    return time.perf_counter() - started, result


def _model_cases() -> list[ModelCase]:
    return [
        ModelCase(label="baseline_gpt_4_1", model="gpt-4.1", model_settings=ModelSettings()),
        ModelCase(
            label="candidate_gpt_5_4_mini_none",
            model="gpt-5.4-mini",
            model_settings=ModelSettings(reasoning=Reasoning(effort="none"), verbosity="low"),
        ),
    ]


def _result_for_exception(
    case: ProbeCase,
    model_case: ModelCase,
    *,
    measured: bool,
    latency_s: float | None,
    error: BaseException,
) -> CaseResult:
    return CaseResult(
        case_id=case.case_id,
        model_label=model_case.label,
        model=model_case.model,
        scenario=case.scenario,
        mode=case.mode,
        measured=measured,
        total_latency_s=latency_s,
        observation_summary=f"{type(error).__name__}: {_preview(error)}",
        result_flag="negative",
        status="fail",
        output_preview="",
        error=f"{type(error).__name__}: {error}",
        metrics={},
    )


async def _case_simple_text(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[1]
    agent = Agent(
        name="Default candidate text probe",
        instructions="Answer with exactly READY and no other text.",
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        latency_s, result = await _run_timed(Runner.run(agent, "Return the required response."))
        output = str(result.final_output).strip()
        ok = output == "READY"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary="Returned the exact constrained response."
            if ok
            else "Output drifted.",
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={},
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


async def _case_tool_call(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[2]
    calls: list[str] = []

    @function_tool
    def lookup_order_status(order_id: str) -> str:
        """Look up a deterministic test order status."""
        calls.append(order_id)
        return "shipped"

    agent = Agent(
        name="Default candidate tool probe",
        instructions=(
            "Always call lookup_order_status with order_id ord_123 before answering. "
            "After the tool result, answer exactly ORDER_STATUS:<status>."
        ),
        tools=[lookup_order_status],
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        latency_s, result = await _run_timed(Runner.run(agent, "What is the status of ord_123?"))
        output = str(result.final_output).strip()
        ok = calls == ["ord_123"] and output == "ORDER_STATUS:shipped"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary=(
                "Called the required function once and returned the expected status."
                if ok
                else "Tool workflow did not match the expected shape."
            ),
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={"tool_calls": calls},
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


async def _case_handoff(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[3]
    spanish_agent = Agent(
        name="Spanish_specialist",
        instructions="Speak only Spanish. Answer with exactly ESPECIALISTA_LISTO.",
        handoff_description="Handles Spanish-language requests.",
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    frontline_agent = Agent(
        name="Frontline",
        instructions=(
            "If the user asks in Spanish or asks for Spanish, immediately hand off to the "
            "Spanish_specialist. Do not answer directly in that case."
        ),
        handoffs=[handoff(spanish_agent)],
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        latency_s, result = await _run_timed(
            Runner.run(frontline_agent, "Por favor responde en espanol.")
        )
        output = str(result.final_output).strip()
        ok = output.rstrip(".") == "ESPECIALISTA_LISTO"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary=(
                "Completed the handoff and returned the specialist response."
                if ok
                else "Handoff response did not match the expected specialist output."
            ),
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={"last_agent": result.last_agent.name},
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


async def _case_approval(model_case: ModelCase, *, measured: bool) -> CaseResult:
    case = PROBE_CASES[4]
    calls: list[str] = []

    @function_tool(needs_approval=True)
    def approve_refund(order_id: str) -> str:
        """Refund a deterministic test order after approval."""
        calls.append(order_id)
        return f"approved_refund:{order_id}"

    agent = Agent(
        name="Default candidate approval probe",
        instructions=(
            "You must call approve_refund with order_id ord_123. After the tool is approved "
            "and returns, answer exactly REFUND_RESULT:<tool result>."
        ),
        tools=[approve_refund],
        model=model_case.model,
        model_settings=model_case.model_settings,
    )
    try:
        started = time.perf_counter()
        result = await Runner.run(agent, "Refund order ord_123.")
        interruptions_before_approval = len(result.interruptions)
        if not result.interruptions:
            latency_s = time.perf_counter() - started
            return CaseResult(
                case_id=case.case_id,
                model_label=model_case.label,
                model=model_case.model,
                scenario=case.scenario,
                mode=case.mode,
                measured=measured,
                total_latency_s=latency_s,
                observation_summary="The run completed without the required approval interruption.",
                result_flag="negative",
                status="fail",
                output_preview=_preview(result.final_output),
                error=None,
                metrics={"interruptions_before_approval": interruptions_before_approval},
            )

        state = result.to_state()
        for interruption in result.interruptions:
            state.approve(interruption)
        resumed = await Runner.run(agent, state)
        latency_s = time.perf_counter() - started
        output = str(resumed.final_output).strip()
        ok = calls == ["ord_123"] and output == "REFUND_RESULT:approved_refund:ord_123"
        return CaseResult(
            case_id=case.case_id,
            model_label=model_case.label,
            model=model_case.model,
            scenario=case.scenario,
            mode=case.mode,
            measured=measured,
            total_latency_s=latency_s,
            observation_summary=(
                "Interrupted for approval, resumed, and returned the approved tool result."
                if ok
                else "Approval resume did not match the expected tool result."
            ),
            result_flag="expected" if ok else "negative",
            status="pass" if ok else "fail",
            output_preview=_preview(output),
            error=None,
            metrics={
                "interruptions_before_approval": interruptions_before_approval,
                "tool_calls": calls,
            },
        )
    except Exception as exc:
        return _result_for_exception(case, model_case, measured=measured, latency_s=None, error=exc)


def _case_local_default() -> CaseResult:
    case = PROBE_CASES[0]
    default_settings = get_default_model_settings()
    reasoning = getattr(default_settings.reasoning, "effort", None)
    ok = (
        get_default_model() == "gpt-5.4-mini"
        and is_gpt_5_default() is True
        and reasoning == "none"
        and default_settings.verbosity == "low"
    )
    return CaseResult(
        case_id=case.case_id,
        model_label="sdk_default",
        model=get_default_model(),
        scenario=case.scenario,
        mode=case.mode,
        measured=True,
        total_latency_s=None,
        observation_summary=(
            "Default model resolves to gpt-5.4-mini with reasoning.effort none and verbosity low."
            if ok
            else "Default helper output did not match the expected gpt-5.4-mini settings."
        ),
        result_flag="expected" if ok else "negative",
        status="pass" if ok else "fail",
        output_preview="",
        error=None,
        metrics={
            "default_model": get_default_model(),
            "is_gpt_5_default": is_gpt_5_default(),
            "reasoning_effort": reasoning,
            "verbosity": default_settings.verbosity,
        },
    )


async def _run_live_case(
    case_id: str,
    model_case: ModelCase,
    *,
    measured: bool,
) -> CaseResult:
    if case_id == "S1":
        return await _case_simple_text(model_case, measured=measured)
    if case_id == "T1":
        return await _case_tool_call(model_case, measured=measured)
    if case_id == "H1":
        return await _case_handoff(model_case, measured=measured)
    if case_id == "A1":
        return await _case_approval(model_case, measured=measured)
    raise ValueError(f"Unknown live case: {case_id}")


def _summarize(results: list[CaseResult]) -> dict[str, Any]:
    grouped: defaultdict[tuple[str, str], list[CaseResult]] = defaultdict(list)
    for result in results:
        grouped[(result.case_id, result.model_label)].append(result)

    cases: dict[str, Any] = {}
    for (case_id, model_label), items in sorted(grouped.items()):
        measured = [item for item in items if item.measured]
        latencies = [item.total_latency_s for item in measured if item.total_latency_s is not None]
        cases[f"{case_id}:{model_label}"] = {
            "case_id": case_id,
            "model_label": model_label,
            "model": items[-1].model,
            "scenario": items[-1].scenario,
            "runs": len(measured),
            "warmups": len(items) - len(measured),
            "status_counts": dict(Counter(item.status for item in measured or items)),
            "result_flags": dict(Counter(item.result_flag for item in measured or items)),
            "median_total_latency_s": statistics.median(latencies) if latencies else None,
            "min_total_latency_s": min(latencies) if latencies else None,
            "max_total_latency_s": max(latencies) if latencies else None,
            "latest_observation": items[-1].observation_summary,
        }

    return {
        "cases": cases,
        "result_flags": dict(Counter(result.result_flag for result in results)),
        "status_counts": dict(Counter(result.status for result in results if result.measured)),
    }


async def _run(args: argparse.Namespace) -> int:
    output_dir = Path(args.output_dir).resolve()
    output_dir.mkdir(parents=True, exist_ok=True)

    results: list[CaseResult] = [_case_local_default()]
    if not args.skip_live:
        if not os.getenv("OPENAI_API_KEY"):
            raise RuntimeError("OPENAI_API_KEY must be set for live probe cases.")

        live_case_ids = [case_id.strip() for case_id in args.cases.split(",") if case_id.strip()]
        model_cases = _model_cases()
        for case_id in live_case_ids:
            for model_case in model_cases:
                for _ in range(args.warmup_runs):
                    results.append(await _run_live_case(case_id, model_case, measured=False))
                for _ in range(args.measured_runs):
                    results.append(await _run_live_case(case_id, model_case, measured=True))

    metadata = {
        "context": _runtime_context(output_dir),
        "probe_cases": [case.__dict__ for case in PROBE_CASES],
        "execution": {
            "cases": args.cases,
            "measured_runs": args.measured_runs,
            "skip_live": args.skip_live,
            "warmup_runs": args.warmup_runs,
        },
        "comparison_parity": {
            "held_constant": [
                "same prompts and output constraints for each scenario",
                "fresh agent state per measured run",
                "same SDK checkout, Python environment, and Responses path",
            ],
            "variable_under_test": "model name and matching default model settings",
            "baseline": "gpt-4.1 without GPT-5 default model settings",
            "candidate": "gpt-5.4-mini with reasoning.effort none and verbosity low",
        },
    }
    summary = _summarize(results)
    _write_json(output_dir / "metadata.json", metadata)
    _write_json(output_dir / "results.json", [result.as_dict() for result in results])
    _write_json(output_dir / "summary.json", summary)
    print(json.dumps({"metadata": metadata, "summary": summary}, indent=2, sort_keys=True))
    return 0 if all(result.status == "pass" for result in results if result.measured) else 1


def _parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Probe gpt-5.4-mini as the Agents SDK default model candidate."
    )
    parser.add_argument(
        "--output-dir",
        default="validation/gpt_5_4_mini_default/artifacts",
        help="Directory for metadata.json, results.json, and summary.json.",
    )
    parser.add_argument(
        "--cases",
        default="S1,T1,H1,A1",
        help="Comma-separated live case IDs to run.",
    )
    parser.add_argument("--warmup-runs", type=int, default=1)
    parser.add_argument("--measured-runs", type=int, default=3)
    parser.add_argument("--skip-live", action="store_true")
    return parser.parse_args()


def main() -> int:
    return asyncio.run(_run(_parse_args()))


if __name__ == "__main__":
    raise SystemExit(main())

see also: openai/openai-agents-js#1248

@github-actions github-actions Bot added enhancement New feature or request feature:core labels May 6, 2026
@seratch seratch force-pushed the feat/gpt-5-4-mini-default-model-settings branch from 9b07c0b to 70ec1ea Compare May 6, 2026 09:54
@seratch seratch mentioned this pull request May 6, 2026
@seratch seratch added this to the 0.16.x milestone May 6, 2026
@seratch seratch merged commit fc2d208 into main May 6, 2026
10 checks passed
@seratch seratch deleted the feat/gpt-5-4-mini-default-model-settings branch May 6, 2026 12:14
@github-actions github-actions Bot mentioned this pull request May 6, 2026
seratch added a commit that referenced this pull request May 7, 2026
@sibblegp
Copy link
Copy Markdown

sibblegp commented May 7, 2026

I've tested both and 4.1 works better and has better conversational tone with more accurate tool calling though.

This is a downgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature:core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants