Skip to content

fix(gateway): bound traced channel startup handoff#82592

Merged
steipete merged 4 commits into
mainfrom
fix/channel-startup-handoff-liveness-82398
May 16, 2026
Merged

fix(gateway): bound traced channel startup handoff#82592
steipete merged 4 commits into
mainfrom
fix/channel-startup-handoff-liveness-82398

Conversation

@steipete
Copy link
Copy Markdown
Contributor

@steipete steipete commented May 16, 2026

Summary

  • Bound traced channel account startup to a launch/handoff span instead of measuring the whole long-lived channel task.
  • Yield one macrotask before traced channel account startup so Gateway startup can finish attaching clients before noisy channel work begins.
  • Keep GitHub Copilot device-login fetches on the guarded SSRF path with a GitHub-only hostname policy, and keep mocked guarded fetch tests hermetic.

Verification

  • node scripts/run-vitest.mjs src/gateway/server-channels.test.ts src/infra/net/fetch-guard.ssrf.test.ts extensions/github-copilot/index.test.ts extensions/github-copilot/embeddings.test.ts --run
  • node scripts/check-no-raw-channel-fetch.mjs
  • /Users/steipete/Projects/agent-scripts/skills/codex-review/scripts/codex-review --mode branch
  • git diff --check

Real behavior proof

Behavior addressed: Startup liveness warnings can stay pinned to channels.<id>.start-account for long-lived channel tasks, making Discord/QQ-style startup stalls look like an active startup phase even after the channel task has been handed off.

Real environment tested: No live QQ Bot login is available; maintainer override is applied for code-level startup-trace proof.

Exact steps or command run after this patch: node scripts/run-vitest.mjs src/gateway/server-channels.test.ts src/infra/net/fetch-guard.ssrf.test.ts extensions/github-copilot/index.test.ts extensions/github-copilot/embeddings.test.ts --run; node scripts/check-no-raw-channel-fetch.mjs; /Users/steipete/Projects/agent-scripts/skills/codex-review/scripts/codex-review --mode branch; git diff --check.

Evidence after fix: server-channels.test.ts verifies startAccount is delayed until after the traced handoff tick, channels.discord.start-account-handoff is no longer active while a long-lived channel account task is still running, and a stop that wins the handoff prevents traced account startup. The guarded-fetch/Copilot tests verify the SSRF guard path stays hermetic under mocked fetches and Copilot login still works through the guarded seam.

Observed result after fix: Focused local shard passed: 4 test files, 117 tests. Raw channel fetch boundary check passed. Codex review reported no accepted/actionable findings on the rebased head.

What was not tested: Live QQ Gateway login, because no QQ Bot credentials are available.

Refs #82398.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: S maintainer Maintainer-authored PR labels May 16, 2026
@steipete steipete added the proof: override Maintainer override for the external PR real behavior proof gate. label May 16, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 696ec1296b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +523 to +527
if (startupTrace) {
await waitForChannelStartupHandoff();
}
let startAccountTask: ReturnType<typeof startAccount> | undefined;
await measureStartup(`channels.${channelId}.start-account-handoff`, () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-check abort state before delayed startAccount handoff

When startupTrace is enabled (the default gateway path), this new setImmediate handoff delays startAccount by a full macrotask but never re-checks whether the account was aborted or manually stopped during that window. A quick stopChannel after startChannel can therefore still invoke startAccount after shutdown was requested, causing unwanted reconnect/login work on accounts that were explicitly stopped; this race is introduced by the new delay and should be guarded with an abort/manual-stop check immediately after the handoff wait and before calling startAccount.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 16, 2026

Codex review: needs changes before merge.

Summary
The PR bounds gateway channel account startup tracing to a short handoff span, yields one macrotask before traced channel startup, and adds regression coverage plus a changelog entry.

Reproducibility: yes. for source-level reproduction: current main wraps long-lived startAccount in startupTrace.measure, and that path keeps the diagnostic phase active until the channel task settles. I did not run a live QQ Bot login, so the live disconnect symptom remains covered by the linked report rather than local E2E proof.

Real behavior proof
Override: A maintainer applied proof: override for this PR.

Next step before merge
There is one narrow, concrete correctness blocker in the PR branch: preserve stop/abort handling across the new startup handoff delay.

Security
Cleared: The diff touches gateway startup control flow, tests, and changelog only; I found no concrete security or supply-chain regression.

Review findings

  • [P1] Re-check stop state after the handoff wait — src/gateway/server-channels.ts:524-528
Review details

Best possible solution:

Keep the handoff-span approach, but add an abort/manual-stop guard immediately after the handoff wait and cover that race with a focused gateway channel test.

Do we have a high-confidence way to reproduce the issue?

Yes for source-level reproduction: current main wraps long-lived startAccount in startupTrace.measure, and that path keeps the diagnostic phase active until the channel task settles. I did not run a live QQ Bot login, so the live disconnect symptom remains covered by the linked report rather than local E2E proof.

Is this the best way to solve the issue?

No, not yet: bounding the trace span is the right narrow direction, but the added macrotask handoff must re-check stop/abort state before starting provider work.

Full review comments:

  • [P1] Re-check stop state after the handoff wait — src/gateway/server-channels.ts:524-528
    After the new setImmediate handoff, stopChannel() can mark the account manually stopped and abort the controller before this code resumes. The task still calls startAccount afterward, which can start reconnect/login work for an account that was explicitly stopped; add the same abort/manual-stop guard after the wait and before invoking startAccount.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.9

Acceptance criteria:

  • node scripts/run-vitest.mjs src/gateway/server-channels.test.ts
  • git diff --check

What I checked:

  • Current stop lifecycle: stopChannel records manual stops, aborts the stored controller, and then waits for the tracked channel task, so a delayed task must re-check stop/abort state before invoking provider startup. (src/gateway/server-channels.ts:680, 1bd10cfee6a4)
  • Current startup guard: Current main checks abort.signal.aborted || manuallyStopped.has(rKey) before runtime/bootstrap and account handoff, which is the guard the PR needs to preserve after adding a later macrotask delay. (src/gateway/server-channels.ts:470, 1bd10cfee6a4)
  • PR diff race: The PR diff adds waitForChannelStartupHandoff() before startAccount, but the provided latest patch does not re-check abort.signal.aborted or manuallyStopped after that wait and before calling startAccount. (src/gateway/server-channels.ts:524, ae54d810bc00)
  • Diagnostic phase contract: startupTrace.measure awaits withDiagnosticPhase, and withDiagnosticPhase keeps the active phase until the wrapped promise settles, which explains why wrapping long-lived startAccount pins startup diagnostics. (src/gateway/server.impl.ts:379, 1bd10cfee6a4)
  • Existing regression shape: Current tests already cover canceling pending startup before startAccount when stopped mid-boot; the PR should add the same protection for the new handoff delay window. (src/gateway/server-channels.test.ts:539, 1bd10cfee6a4)
  • Related report: The linked report describes liveness warnings pinned to channels.qqbot.start-account during Gateway startup, which aligns with the source-level diagnostic-phase behavior.

Likely related people:

  • steipete: git shortlog shows Peter Steinberger with the dominant commit count on src/gateway/server-channels.ts and its test, and commit 605a2c87ae... carried gateway restart trace behavior across respawn. (role: heavy area contributor and current PR author; confidence: high; commits: 605a2c87ae80; files: src/gateway/server-channels.ts, src/gateway/server-channels.test.ts, src/gateway/restart-trace.ts)
  • vincentkoc: Current checkout blame for the channel startup/stop region points to Vincent Koc, and nearby history includes gateway readiness/channel work. (role: recent area contributor; confidence: medium; commits: a85cd65775fc, ab5fcfcc0128; files: src/gateway/server-channels.ts, src/gateway/server-channels.test.ts)
  • samzong: Commit 587b06768f... introduced gateway restart trace instrumentation in src/gateway/restart-trace.ts and src/gateway/server.impl.ts, the adjacent diagnostic surface involved in this bug. (role: adjacent trace contributor; confidence: medium; commits: 587b06768f43; files: src/gateway/restart-trace.ts, src/gateway/server.impl.ts)

Remaining risk / open question:

  • Live QQ Gateway startup was not exercised in this read-only review; the PR body uses proof: override.
  • The current patch should not merge until the new handoff delay preserves the existing abort/manual-stop invariant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against b9921e21b93d.

@steipete steipete force-pushed the fix/channel-startup-handoff-liveness-82398 branch from 696ec12 to ae54d81 Compare May 16, 2026 13:15
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae54d810bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +523 to +527
if (startupTrace) {
await waitForChannelStartupHandoff();
}
let startAccountTask: ReturnType<typeof startAccount> | undefined;
await measureStartup(`channels.${channelId}.start-account-handoff`, () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-check stop state after startup handoff delay

After startupTrace is enabled, this path now waits a full setImmediate before invoking startAccount, but it does not re-check abort.signal.aborted or manuallyStopped after that wait. If stopChannel() runs in that handoff window, startAccount is still called for an account that was explicitly stopped, which can trigger unwanted reconnect/login work and stop-timeout behavior even though shutdown was requested.

Useful? React with 👍 / 👎.

@steipete steipete force-pushed the fix/channel-startup-handoff-liveness-82398 branch from 0c1b517 to 8a4da89 Compare May 16, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extensions: github-copilot gateway Gateway runtime maintainer Maintainer-authored PR proof: override Maintainer override for the external PR real behavior proof gate. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant