Skip to content

fix(daemon): SIGKILL fallback when stale daemon refuses graceful shutdown#1861

Merged
jackwener merged 2 commits into
mainfrom
fix/stale-daemon-sigkill
Jun 5, 2026
Merged

fix(daemon): SIGKILL fallback when stale daemon refuses graceful shutdown#1861
jackwener merged 2 commits into
mainfrom
fix/stale-daemon-sigkill

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

When the CLI detects a stale daemon (daemonVersion !== PKG_VERSION after npm install -g @jackwener/opencli@latest), it currently asks the daemon to exit via POST /shutdown and waits up to 3 seconds for the port to release. If the old daemon hangs, refuses /shutdown, or the endpoint is missing entirely (pre-shutdown-endpoint version), the port stays held and the user sees Stale daemon could not be replaced. They then have to run opencli daemon stop && opencli doctor manually.

99% of "I just upgraded and have to run opencli doctor every time" reports land here.

This patch reads the stale daemon's pid from its existing /status response (src/daemon.ts:252 already exposes pid: process.pid) and falls back to process.kill(pid, 'SIGKILL') after graceful shutdown fails. Cross-platform: Node's process.kill(_, 'SIGKILL') maps to TerminateProcess on Windows, so no taskkill shell-out is needed.

The user-visible Stale daemon could not be replaced error now only fires when both graceful shutdown AND SIGKILL fail (cross-user owner / cross-machine PID, neither of which is reachable from a normal CLI invocation).

Reviewer

@coding-claude-opus (independent cross-check on the deep-investigation thread).

Test plan

  • Added 2 unit tests in src/browser.test.ts:
    • SIGKILL succeeds → stale path is passed; subsequent no-extension wait surfaces (proves the stale branch is no longer terminal).
    • SIGKILL throws EPERM AND waitForDaemonStop keeps returning false → falls through to the original stale-daemon error.
  • All 5 stale-daemon tests pass locally (npx vitest run src/browser.test.ts -t "stale").
  • npm run docs:build clean.
  • npx tsc --noEmit clean.
  • CI green.

Out of scope

  • Extension version mismatch (Chrome Web Store / unpacked dev install lag) — separate concern, opt-out of this PR per @coding-claude-opus's split proposal. Can be a follow-up with status adding extensionVersion if WAWQAQ wants.

…down

When the CLI detects a stale daemon (`daemonVersion !== PKG_VERSION` after
`npm install -g @jackwener/opencli@latest`), it currently asks the daemon to
exit via `POST /shutdown` and waits up to 3 seconds for the port to release.
If the old daemon hangs, refuses /shutdown, or the endpoint is missing
entirely (pre-shutdown-endpoint version), the port stays held and the user
sees `Stale daemon could not be replaced` with a `opencli daemon stop` hint.

99% of "I just upgraded and have to run `opencli doctor` every time" reports
land here: the user upgraded the CLI but the persistent daemon survived from
a previous install, and graceful shutdown is unreliable across versions.

This patch reads the stale daemon's pid from its existing `/status` response
(daemon.ts:252 already exposes `pid: process.pid`) and falls back to
`process.kill(pid, 'SIGKILL')` after graceful shutdown fails, then waits
another 2s for the port to release. Cross-platform: Node's
`process.kill(_, 'SIGKILL')` maps to `TerminateProcess` on Windows, so no
`taskkill` shell-out is needed.

The user-visible "Stale daemon could not be replaced" error only fires when
both graceful shutdown AND SIGKILL fail (cross-user owner / cross-machine
PID — neither is reachable from a normal CLI invocation anyway). The hint
message is updated to reflect that.

Adds 2 tests:
- SIGKILL succeeds → bridge proceeds past the stale block (and eventually
  fails the no-extension wait, proving the stale branch was passed cleanly).
- SIGKILL throws EPERM AND waitForDaemonStop still returns false → falls
  through to the existing stale-daemon error.

Troubleshooting docs note the new auto-fallback.
Copy link
Copy Markdown
Owner Author

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — 22/22 tests pass, typecheck clean. Core fix是干净的 minimal patch(~13 行),用现有 health.status.pid 走 SIGKILL,跨平台正确(Node 在 Windows 把 process.kill(_, 'SIGKILL') 映射成 TerminateProcess)。

两条 非阻塞 observation,是否处理你拍:

Nit 1: ESRCH race — kill 抛异常时漏掉了第二次 poll

bridge.ts:91-98

try {
  process.kill(stalePid, 'SIGKILL');
  portReleased = await waitForDaemonStop(2000);
} catch {
  // EPERM / ESRCH: fall through
}

ESRCH 场景:daemon 在 status 查询和 SIGKILL 之间窗口里自己挂了(很窄但存在)→ process.kill 抛 ESRCH → catch 吞掉 → 不 pollportReleased 留 false → 报错 "Stale daemon could not be replaced"。但此时进程已死、端口大概率已释放,下条命令一定 OK。多一次假错误。

更紧的写法:把 poll 拉到 try/catch 外面,无论 throw 与否都问 OS:

try {
  process.kill(stalePid, 'SIGKILL');
} catch {
  // EPERM (cross-user) / ESRCH (already dead): port may still be free
}
portReleased = await waitForDaemonStop(2000);

三种情况都对:成功→端口释放→poll true;ESRCH→端口本来就释放了→poll true;EPERM→端口仍占→poll false→报错。

Nit 2: 既有 stale tests pid: 1 现在有副作用

PR 前,browser.test.ts:189/210 那两个 stale 测试 mock 了 getDaemonHealth.status.pid = 1 但根本不走 SIGKILL 路径,pid 数字无所谓。PR 后这俩测试会实际执行 process.kill(1, 'SIGKILL') —— 对 init/launchd 发信号,EPERM 兜住所以测试还过。但本地/CI 多了个对 PID 1 的 syscall side effect。

建议这俩既有 test 顺手改成:要么 vi.spyOn(process, 'kill').mockImplementation(() => true),要么 pid: -1(Node 会直接 EINVAL)。


两个都不阻塞 merge。Nit 1 顺手改利落一些;Nit 2 是 hygiene。你拍要不要现在改还是 follow-up。

- bridge.ts: move `await waitForDaemonStop(2000)` out of the try/catch so the
  port poll always runs after `process.kill`, even when the kill itself throws
  ESRCH (target already dead) or EPERM (cross-user owner).
- browser.test.ts: bump the existing 3 stale-daemon test fixtures from
  `pid: 1` to `pid: 999999` so the new SIGKILL fallback no longer fires a
  signal at init when the older tests reach the fallback path.
- browser.test.ts: mock `waitForDaemonStop` in those 3 tests too, since the
  real implementation now polls for 2s in the fallback path (test runtime
  was up to ~6s before; back to ~400ms).
@jackwener jackwener merged commit 3f1a723 into main Jun 5, 2026
11 checks passed
@jackwener jackwener deleted the fix/stale-daemon-sigkill branch June 5, 2026 11:20
@jackwener jackwener mentioned this pull request Jun 5, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant