benchmarks: add first-wave GPT-5.4 vs Opus 4.6 parity harness#64441
benchmarks: add first-wave GPT-5.4 vs Opus 4.6 parity harness#64441steipete merged 12 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR adds the first-wave GPT-5.4 / Opus 4.6 agentic parity harness: a new
Confidence Score: 4/5Not safe to merge as-is: the wrong import in scenario-catalog.test.ts is a compile error that will cause the test to fail. One P1 finding — QA_AGENTIC_PARITY_SCENARIO_IDS imported from the wrong module in scenario-catalog.test.ts — will fail both TypeScript compilation and at runtime. Everything else (core logic, CLI wiring, other tests) looks correct. extensions/qa-lab/src/scenario-catalog.test.ts (wrong import module)
|
There was a problem hiding this comment.
Pull request overview
Adds the initial QA-lab “agentic parity” scenario-pack scaffold to benchmark GPT-5.4 vs Opus 4.6 under a shared suite runner, with CLI wiring and basic assertions to ensure the parity pack scenarios exist in the scenario catalog.
Changes:
- Introduce
agentic-parity.tsdefining the initial parity pack (“agentic”) and its scenario IDs, plus resolution/validation logic. - Wire
--parity-packthrough the QA-lab CLI intorunQaSuiteCommand, expanding scenario selection accordingly. - Add tests/assertions to ensure parity pack scenarios exist and that the CLI expands the pack as expected.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| extensions/qa-lab/src/agentic-parity.ts | Defines parity pack ID list and resolves/validates scenario IDs. |
| extensions/qa-lab/src/cli.runtime.ts | Expands scenario IDs based on --parity-pack before dispatching suite runs. |
| extensions/qa-lab/src/cli.ts | Adds --parity-pack option and passes it through to runtime. |
| extensions/qa-lab/src/cli.runtime.test.ts | Verifies parity pack expansion is applied to suite scenario IDs. |
| extensions/qa-lab/src/scenario-catalog.test.ts | Asserts the parity scenario IDs exist in the bootstrap scenario catalog. |
Comments suppressed due to low confidence (1)
extensions/qa-lab/src/scenario-catalog.test.ts:9
QA_AGENTIC_PARITY_SCENARIO_IDSis imported from./scenario-catalog.js, but that module doesn’t export it (it’s defined inagentic-parity.ts). This will fail at runtime/typecheck for this test; import it from./agentic-parity.jsor re-export it fromscenario-catalog.tsand keep the import consistent.
import { describe, expect, it } from "vitest";
import {
QA_AGENTIC_PARITY_SCENARIO_IDS,
listQaScenarioMarkdownPaths,
readQaBootstrapScenarioCatalog,
readQaScenarioById,
readQaScenarioExecutionConfig,
readQaScenarioPack,
} from "./scenario-catalog.js";
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eda0186044
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Addressed the first review-fix pass on the parity harness branch and added the explainability docs that were still missing. This push:
New docs in this branch:
Local validation:
The linked-worktree focused vitest run still hits the existing |
1326696 to
a8c81be
Compare
|
Rebased this branch onto current What changed in the latest push:
Local validation:
This still does not simulate auth/proxy/DNS inside QA-lab; that remains intentionally gated by PR B’s deterministic runtime-truthfulness suites. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8c81bef4f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Latest proof-layer update is on head What changed:
Local validation on this head:
Note on commit plumbing: the local pre-commit hook hit an unrelated TypeScript failure in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4765930c6c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Current head is now This update finishes the last proof-layer documentation gaps without widening runtime scope:
Local validation on this head:
The remaining red GitHub signal again appears to be shared/base CI noise rather than PR-owned logic: the same PR D remains intentionally proof-only. Once the shared CI noise is out of the way and PR A-C are stable, this should be the release-evidence slice rather than a source of extra runtime churn. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 48273f91a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Addressed the remaining proof-layer review items on head This update:
Focused validation passed locally:
Note: this commit was created with All currently open review threads on this PR have been resolved against the latest head. |
|
Current head Local validation on this head:
This PR still intentionally contains no runtime behavior changes; it only extends the proof layer and maintainer-facing evidence. |
64d8d8d to
6d8e0dc
Compare
Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in openclaw#64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (openclaw#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs openclaw#64227 Unblocks openclaw#64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path.
Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in openclaw#64227. Background: the parity gate in openclaw#64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once openclaw#64441 and openclaw#64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs openclaw#64227 Unblocks the final parity run for openclaw#64441 / openclaw#64662 by making summaries self-describing.
|
Loop-6 status (pass 1): no code fixes needed on this PR. Prior bot findings were all addressed in earlier pushes. Re-requesting fresh review pass from @copilot. Paired with the loop-6 pushes on #64679, #64681, #64685, and the re-opened parity-summary PR (#64789, replacing closed #64689) this PR is ready for merge once main is green. |
3ae2760 to
643ce6b
Compare
- scope computeQaAgenticParityMetrics to QA_AGENTIC_PARITY_SCENARIO_TITLES in buildQaAgenticParityComparison so extra non-parity lanes in a full qa-suite-summary.json cannot influence completion / unintended-stop / valid-tool / fake-success rates - filter coverageMismatch by !parityTitleSet.has(name) so each required parity scenario fails the gate exactly once (from requiredScenarioCoverage) instead of being double-reported as a coverage mismatch too - drop the bare /\\berror\\b/i rule from SUSPICIOUS_PASS_PATTERNS — it was false-flagging legitimate passes that narrate "Error budget: 0" or "no errors found" — and replace it with targeted /error occurred/i and /an error was/i phrases that indicate a real mid-turn error - add regressions: error-budget/no-errors-observed passes yield fakeSuccessCount === 0, genuine error-occurred narration still flags, each missing required scenario fires exactly one failure line, and non-parity lanes do not perturb scoped metrics - isolate the baseline suspicious-pass test by padding it to the full first-wave scenario set so it asserts the isolated fake-success path via toEqual([...]) rather than toContain
643ce6b to
6f55170
Compare
|
Thanks @100yenadmin, landed via rebase. Local gate before final push:
Final PR head before merge: |
Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in openclaw#64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (openclaw#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs openclaw#64227 Unblocks openclaw#64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path.
Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in openclaw#64227. Background: the parity gate in openclaw#64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once openclaw#64441 and openclaw#64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs openclaw#64227 Unblocks the final parity run for openclaw#64441 / openclaw#64662 by making summaries self-describing.
Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in openclaw#64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (openclaw#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs openclaw#64227 Unblocks openclaw#64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path.
Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in openclaw#64227. Background: the parity gate in openclaw#64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once openclaw#64441 and openclaw#64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs openclaw#64227 Unblocks the final parity run for openclaw#64441 / openclaw#64662 by making summaries self-describing.
Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in openclaw#64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (openclaw#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs openclaw#64227 Unblocks openclaw#64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path.
Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in openclaw#64227. Background: the parity gate in openclaw#64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once openclaw#64441 and openclaw#64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs openclaw#64227 Unblocks the final parity run for openclaw#64441 / openclaw#64662 by making summaries self-describing.
Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in openclaw#64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (openclaw#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs openclaw#64227 Unblocks openclaw#64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path.
Summary
This is the benchmark / release-gate slice of the GPT-5.4 / Codex parity program tracked in #64227 and scoped by #64233.
It adds the first-wave QA-lab parity scenario pack, the parity comparison report layer, and the first machine-readable gate verdict so GPT-5.4 and Opus 4.6 can be compared through shared agentic scenarios instead of anecdotes.
Scope
What changed
agentic-parity.tsin QA-lab as the first-wave parity scenario-pack entrypointcli.runtime.tsandcli.tsagentic-parity-report.tsas the comparison layer for two suite summariesqa-agentic-parity-report.mdqa-agentic-parity-summary.jsonpass/failgate verdictapproval-turn-tool-followthroughmodel-switch-tool-continuitysource-docs-discovery-reportimage-understanding-attachmentcompaction-retry-mutating-toolValidation
pnpm buildCI=1 pnpm exec vitest run extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/agentic-parity-report.test.ts extensions/qa-lab/src/scenario-catalog.test.tsNon-goals