fix(daemon): SIGKILL fallback when stale daemon refuses graceful shutdown#1861
Conversation
…down When the CLI detects a stale daemon (`daemonVersion !== PKG_VERSION` after `npm install -g @jackwener/opencli@latest`), it currently asks the daemon to exit via `POST /shutdown` and waits up to 3 seconds for the port to release. If the old daemon hangs, refuses /shutdown, or the endpoint is missing entirely (pre-shutdown-endpoint version), the port stays held and the user sees `Stale daemon could not be replaced` with a `opencli daemon stop` hint. 99% of "I just upgraded and have to run `opencli doctor` every time" reports land here: the user upgraded the CLI but the persistent daemon survived from a previous install, and graceful shutdown is unreliable across versions. This patch reads the stale daemon's pid from its existing `/status` response (daemon.ts:252 already exposes `pid: process.pid`) and falls back to `process.kill(pid, 'SIGKILL')` after graceful shutdown fails, then waits another 2s for the port to release. Cross-platform: Node's `process.kill(_, 'SIGKILL')` maps to `TerminateProcess` on Windows, so no `taskkill` shell-out is needed. The user-visible "Stale daemon could not be replaced" error only fires when both graceful shutdown AND SIGKILL fail (cross-user owner / cross-machine PID — neither is reachable from a normal CLI invocation anyway). The hint message is updated to reflect that. Adds 2 tests: - SIGKILL succeeds → bridge proceeds past the stale block (and eventually fails the no-extension wait, proving the stale branch was passed cleanly). - SIGKILL throws EPERM AND waitForDaemonStop still returns false → falls through to the existing stale-daemon error. Troubleshooting docs note the new auto-fallback.
jackwener
left a comment
There was a problem hiding this comment.
LGTM — 22/22 tests pass, typecheck clean. Core fix是干净的 minimal patch(~13 行),用现有 health.status.pid 走 SIGKILL,跨平台正确(Node 在 Windows 把 process.kill(_, 'SIGKILL') 映射成 TerminateProcess)。
两条 非阻塞 observation,是否处理你拍:
Nit 1: ESRCH race — kill 抛异常时漏掉了第二次 poll
bridge.ts:91-98:
try {
process.kill(stalePid, 'SIGKILL');
portReleased = await waitForDaemonStop(2000);
} catch {
// EPERM / ESRCH: fall through
}ESRCH 场景:daemon 在 status 查询和 SIGKILL 之间窗口里自己挂了(很窄但存在)→ process.kill 抛 ESRCH → catch 吞掉 → 不 poll → portReleased 留 false → 报错 "Stale daemon could not be replaced"。但此时进程已死、端口大概率已释放,下条命令一定 OK。多一次假错误。
更紧的写法:把 poll 拉到 try/catch 外面,无论 throw 与否都问 OS:
try {
process.kill(stalePid, 'SIGKILL');
} catch {
// EPERM (cross-user) / ESRCH (already dead): port may still be free
}
portReleased = await waitForDaemonStop(2000);三种情况都对:成功→端口释放→poll true;ESRCH→端口本来就释放了→poll true;EPERM→端口仍占→poll false→报错。
Nit 2: 既有 stale tests pid: 1 现在有副作用
PR 前,browser.test.ts:189/210 那两个 stale 测试 mock 了 getDaemonHealth.status.pid = 1 但根本不走 SIGKILL 路径,pid 数字无所谓。PR 后这俩测试会实际执行 process.kill(1, 'SIGKILL') —— 对 init/launchd 发信号,EPERM 兜住所以测试还过。但本地/CI 多了个对 PID 1 的 syscall side effect。
建议这俩既有 test 顺手改成:要么 vi.spyOn(process, 'kill').mockImplementation(() => true),要么 pid: -1(Node 会直接 EINVAL)。
两个都不阻塞 merge。Nit 1 顺手改利落一些;Nit 2 是 hygiene。你拍要不要现在改还是 follow-up。
- bridge.ts: move `await waitForDaemonStop(2000)` out of the try/catch so the port poll always runs after `process.kill`, even when the kill itself throws ESRCH (target already dead) or EPERM (cross-user owner). - browser.test.ts: bump the existing 3 stale-daemon test fixtures from `pid: 1` to `pid: 999999` so the new SIGKILL fallback no longer fires a signal at init when the older tests reach the fallback path. - browser.test.ts: mock `waitForDaemonStop` in those 3 tests too, since the real implementation now polls for 2s in the fallback path (test runtime was up to ~6s before; back to ~400ms).
Summary
When the CLI detects a stale daemon (
daemonVersion !== PKG_VERSIONafternpm install -g @jackwener/opencli@latest), it currently asks the daemon to exit viaPOST /shutdownand waits up to 3 seconds for the port to release. If the old daemon hangs, refuses/shutdown, or the endpoint is missing entirely (pre-shutdown-endpoint version), the port stays held and the user seesStale daemon could not be replaced. They then have to runopencli daemon stop && opencli doctormanually.99% of "I just upgraded and have to run
opencli doctorevery time" reports land here.This patch reads the stale daemon's pid from its existing
/statusresponse (src/daemon.ts:252already exposespid: process.pid) and falls back toprocess.kill(pid, 'SIGKILL')after graceful shutdown fails. Cross-platform: Node'sprocess.kill(_, 'SIGKILL')maps toTerminateProcesson Windows, so notaskkillshell-out is needed.The user-visible
Stale daemon could not be replacederror now only fires when both graceful shutdown AND SIGKILL fail (cross-user owner / cross-machine PID, neither of which is reachable from a normal CLI invocation).Reviewer
@coding-claude-opus (independent cross-check on the deep-investigation thread).
Test plan
src/browser.test.ts:waitForDaemonStopkeeps returning false → falls through to the original stale-daemon error.npx vitest run src/browser.test.ts -t "stale").npm run docs:buildclean.npx tsc --noEmitclean.Out of scope
statusaddingextensionVersionif WAWQAQ wants.