Skip to content

Capture tool calls in live OpenAI Responses projector#83

Merged
bgmcmullen merged 2 commits into
masterfrom
log-codex-tool-calls
Jun 4, 2026
Merged

Capture tool calls in live OpenAI Responses projector#83
bgmcmullen merged 2 commits into
masterfrom
log-codex-tool-calls

Conversation

@bgmcmullen
Copy link
Copy Markdown
Contributor

We previously were not logging Codex tool calls, they would appear if backfilled but would not be captured live.

This PR adds the functionality to exchange-projector.js to capture the tool calls.

@bgmcmullen bgmcmullen requested a review from philcunliffe June 4, 2026 17:20
@philcunliffe
Copy link
Copy Markdown
Contributor

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Reviewer reconciliation: The request_changes verdict is driven by one Codex major (Behavioral Correctness) plus medium blast radius. Independent verification found Codex's finding to be a false positive under the current recorder invariant (is_sse ⟹ response_body=null, recorder.js:190,260), which makes the body+stream mixed state it requires unreachable. It remains valid as a latent-fragility flag. The remaining Claude findings are edge-case test-coverage gaps, not defects. Practical recommendation: safe to land; optionally add the four edge-case tests and a guard/comment documenting the recorder invariant.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Codex Behavioral Correctness — stream-only tool calls dropped when body has text, major, exchange-projector.js:263-267,386 Targets (openAiResponsesMessages) + Risks bullet 1. Verified false positive: recorder invariant (isSse ⟹ response_body=null, recorder.js:190,260) makes the mixed body+stream state unreachable. Valid as a latent-fragility flag, not a present bug.
Claude Tool-call drop paths untested, minor, exchange-projector.js:771,781 Risks bullet 2
Claude toolOutputText object branches untested, minor, exchange-projector.js:808 Risks bullet 2
Claude Malformed-JSON args untested, minor, exchange-projector.js:791 Risks bullet 2
Claude Stream call-id dedup untested, minor, exchange-projector.js:427 Risks bullet 2
Codex review

Fix Validations

Live OpenAI Responses tool calls were missing

  • Status: incomplete
  • Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:266, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:267, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:386
  • Assessment: The new stream parser can capture response.output_item.done tool calls, but the top-level dispatcher only uses it when the parsed response_body yields zero assistant messages. Any capture that has both a final body and SSE events can still drop stream-only tool calls.

Findings

1) Behavioral Correctness

  • Severity: major
  • Confidence: high
  • Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:263, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:266, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:267, hypaware-core/plugins-workspace/codex/src/exchange-projector.js:386
  • Why it matters: A live SSE exchange with response_body containing text plus stream_events containing response.output_item.done will project the text from the body and skip the stream parser, so the tool call is still not logged.
  • Suggested fix: When streamEvents.length > 0, run the stream projection and merge/dedupe it with body-projected messages, or pass the parsed top-level body into the stream merge path so stream-only tool calls are appended even when body text exists.

No Finding

  1. Contract & Interface Fidelity
  2. Change Impact / Blast Radius
  3. Concurrency, Ordering & State Safety
  4. Error Handling & Resilience
  5. Security Surface
  6. Resource Lifecycle & Cleanup
  7. Release Safety
  8. Test Evidence Quality
  9. Architectural Consistency
  10. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: openAiResponsesMessages; responsesInputMessages; responsesAssistantMessagesFromBody; responsesAssistantMessagesFromStream; toolUseBlockFromPayload; toolResultBlockFromPayload
  • Impacted callers: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:184
  • Impacted tests: test/plugins/codex-exchange-projector.test.js:131, test/plugins/codex-exchange-projector.test.js:197, test/plugins/codex-exchange-projector.test.js:275, test/plugins/codex-exchange-projector.test.js:311, test/plugins/codex-exchange-projector.test.js:351
  • Unresolved uncertainty: I did not verify the gateway capture invariant for SSE response_body; the new tests only cover SSE with response_body: '', so the mixed body-plus-stream case remains unproven.
Claude review

Claude review

Five parallel review subagents (guidance compliance, shallow bug scan, historical
context, contract & callers, comments & tests) examined the PR at head b482bfc.
Four of the five found no defects: the change is style-compliant (no semicolons,
JSDoc-only types, @import at top, no @typedef), introduces no logic bugs,
re-introduces no historically-fixed bug, keeps the only export
(createCodexExchangeProjector) and its private-helper renames internally
consistent, and the new tool_use/tool_result block shapes are already
supported by the downstream gateway part-expander (message_projector.js:534-542,
columns tool_args/tool_result_for exist) and the kernel types
(AiGatewayProjectedMessage.content: string | JsonObject[]). The live helpers
mirror codex/src/backfill.js byte-for-byte, so turn-1 response rows hash-equal
turn-2 input-replay rows. Full suite: 759 tests pass.

The surviving findings are all test-coverage gaps on branches this PR introduced.
The implementation in each case was independently verified correct; these raise
regression-detection risk rather than describing a present defect, hence minor.

Tool-call drop paths (undefined returns) are untested

  • Severity: minor
  • Confidence: 90
  • Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:771, :781
  • Why it matters: A function_call missing name/call_id (or function_call_output missing call_id) is silently skipped via if (block) …, and the load-bearing call_id ?? id fallback is never exercised — every test supplies call_id, so a regression that started dropping valid calls would pass.
  • Suggested fix: Add cases for (a) a function_call with only id (no call_id) → still captured keyed on id; (b) a function_call missing name → dropped, not emitted broken.

toolOutputText object/wrapper/stringify branches are untested

  • Severity: minor
  • Confidence: 88
  • Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:808
  • Why it matters: Every function_call_output test uses a plain string output; the structured-output path ({ output | content | text } unwrap and JSON.stringify fallback) — the exact case the JSDoc says it handles, and a common real Codex shape — has zero coverage.
  • Suggested fix: Add a function_call_output with output: { output: 'x' } (expect content: 'x') and one with output: { foo: 1 } (expect content: '{"foo":1}').

Malformed-JSON tool arguments are untested

  • Severity: minor
  • Confidence: 85
  • Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:791
  • Why it matters: The raw-string fallback in normalizeToolInput (parsed === value ? value) is what protects against truncated/garbled streamed args, but no test feeds a JSON-looking-but-invalid arguments string (e.g. '{"cmd":').
  • Suggested fix: One case with arguments: '{"cmd":' → expect input === '{"cmd":' (preserved verbatim, not dropped, no throw).

Stream call-id dedup branch is untested

  • Severity: minor
  • Confidence: 80
  • Evidence: hypaware-core/plugins-workspace/codex/src/exchange-projector.js:427
  • Why it matters: The new logic avoids double-emitting a tool call present in both response.output_item.done and the response.completed body, but the "prefers completed body" test has no output_item.done, so toolUsesByCallId is empty and the dedup loop never runs — a regression that emitted the tool twice would pass.
  • Suggested fix: One SSE case where output_item.done carries call_id: 'c1' AND the completed body's output[] also contains function_call call_id: 'c1' → assert the tool appears exactly once.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr83/dual-review/pr-83

Copy link
Copy Markdown
Contributor

@philcunliffe philcunliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not concerned with codex's finding

@bgmcmullen bgmcmullen merged commit 027e77c into master Jun 4, 2026
6 checks passed
@bgmcmullen bgmcmullen deleted the log-codex-tool-calls branch June 4, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants