Skip to content

feat(core): SIGINT/SIGTERM graceful drain + unhandled-rejection redaction (PER-7855 phase 3/3)#2198

Closed
Shivanshu-07 wants to merge 1 commit intomasterfrom
feat/per-7855-cli-qos-phase-3
Closed

feat(core): SIGINT/SIGTERM graceful drain + unhandled-rejection redaction (PER-7855 phase 3/3)#2198
Shivanshu-07 wants to merge 1 commit intomasterfrom
feat/per-7855-cli-qos-phase-3

Conversation

@Shivanshu-07
Copy link
Copy Markdown
Contributor

@Shivanshu-07 Shivanshu-07 commented Apr 27, 2026

Summary

Final phase of PER-7855 — graceful drain on SIGINT/SIGTERM, global unhandled-rejection / uncaught-exception handlers, and a bonus fix for the existing POSIX child-tree leak in browser.js. Independent of Phase 1 (#2196) and Phase 2 (#2197); all three can land in any order.

R1 — Graceful drain on SIGINT/SIGTERM

  • New module-level shutdownState bag exposed to commands as ctx.shutdown so they can call percy.stop(ctx.shutdown.forced) for graceful-on-first-signal, force-on-second-signal behavior.
  • First signal logs ${signal} received, draining (press Ctrl-C again to force)... to stderr, arms a 30s drain timer that flips forced=true.
  • Second signal (or the 30s timer) flips forced=true and arms a 5s hard-exit safety timer.
  • Production exit codes: SIGINT130, SIGTERM143, surfaced via process.exit only when definition.exitOnError is true. Tests with exitOnError: false preserve the legacy clean-resolution (AbortError carries exitCode: 0).
  • start.js, snapshot.js, exec.js read ctx.shutdown.forced to choose percy.stop(force). Non-signal errors preserve the original force-stop behavior.

R3 — Global unhandled-rejection / uncaught-exception handlers

  • Attached exactly once per Node process via ensureProcessHandlers().
  • Stack trace routed through redactSecrets() so CDP rejections that include serialized page-script bodies, Authorization headers, or cookie strings cannot leak via the new log path. (Phase 1 deepening security finding.)
  • Tags activeContext.runFailed = true; runs that complete cleanly but saw an unhandled rejection throw a synthetic exit-1 error at the end so CI doesn't see a green build.

Bonus — POSIX child-tree leak in browser.js:207

The previous this.process.kill('SIGKILL') targeted only the lead Chromium pid. Despite spawning detached at :266, that left renderer / utility / zygote children orphaned on every kill. The fix matches the Puppeteer / Playwright convention: taskkill /pid <pid> /T /F on Windows; process.kill(-pid, 'SIGKILL') on POSIX (negative-pid signals the whole process group). Falls back to the lead-pid kill on either path's error so a missing process doesn't wedge _closed.

HTTP server graceful drain

Server.close() becomes async with a drainMs option (default 5s). Uses Node 18.2+ closeIdleConnections / closeAllConnections when available; falls back to manual socket-set iteration on Node 14 (Windows CI is pinned there). this.draining flag set for future Connection: close middleware.

Test plan

  • CI green (Linux, macOS, Windows) modulo the same 27 pre-existing failures (21 install Chromium + 5 runDoctorOnFailure + 1 API server) that show on master and Phases 1/2
  • Phase 3: shutdown + unhandled-rejection + exit codes (cli-command/test/shutdown.test.js) — 4 new specs, all pass
  • Manual: `percy start`, Ctrl-C → drain message + clean exit 130
  • Manual: `percy start`, Ctrl-C, Ctrl-C again → forced exit ≤ 2s, no orphan Chromium processes (`ps -A | grep -i chrom`)
  • Manual: kill -TERM → exit 143
  • Windows manual: console Ctrl+C → drain message + clean exit (SIGINT works on Windows; SIGTERM is documented as best-effort because Windows can't deliver it gracefully)

Pre-existing test infrastructure note

cli-snapshot/test/file.test.js has 4 pre-existing ESM-resolver failures (mockfs cannot intercept dynamic import() of test fixtures like pages.js). Unrelated to this PR. Same 4 failures appear on master.

Test infrastructure

  • `_resetShutdownForTest()` exported from `@percy/cli-command` for spec isolation
  • Module-level state auto-resets at the start of each `runCommandWithContext` so back-to-back specs don't leak signal state
  • `try/finally` in `runCommandWithContext` ensures per-run signal listeners are always removed (eliminates the pre-existing `MaxListenersExceededWarning` that surfaced when running these tests)
  • Updated existing assertions in `command.test.js` and `exec.test.js` for the new "draining" stderr line and the absence of the legacy "Stopping percy..." log on graceful single-signal interrupts

Risks

Risk Mitigation
Tests that emit `process.emit('SIGINT')` and expect empty stderr Updated to expect the drain announcement
Tests that expect `Stopping percy...` log on signal interrupt Updated — graceful drain doesn't force-stop, so that log doesn't fire
`process.exit(130)` in test mode kills the test runner Production-only via `definition.exitOnError` gate
Browser child-tree kill change (POSIX negative-pid) might fail in containers Fallback to lead-pid SIGKILL preserves at-least-as-good-as-before behavior
Drain hangs at `Percy.stop(false)` 5s hard-exit safety timer after second signal / 30s drain timeout

Post-Deploy Monitoring & Validation

  • What to monitor/search
    • Logs: orphaned Chromium reports in `#percy-cli` Slack should drop to zero post-merge
    • Logs: any unhandled rejection appearing in build logs — should now have `[REDACTED]` markers replacing previously-leaked secrets in stack traces
    • Metrics: build cancel rate (graceful drain may slightly increase wall-clock time for cancelled builds; should be <30s)
  • Validation checks
    • `ps -A | grep -i chrom` after a SIGINT'd `percy start` — must be empty
    • On Windows: `tasklist | findstr chrome` after Ctrl+C — must be empty
  • Expected healthy behavior
    • SIGINT → drain message + exit 130, no orphans
    • SIGTERM → drain message + exit 143, no orphans
    • SIGKILL on Percy → lockfile reclaimed by next start (Phase 2)
  • Failure signal(s) / rollback trigger
    • Reports of stuck builds that never exit on Ctrl-C — drain hang somewhere in `percy.stop(false)`
    • Any chromium child accumulating in process listings post-Ctrl-C
  • Validation window & owner
    • Window: 1 week post-merge (this PR has the highest production risk of the three phases)
    • Owner: @shivanshu.si

Origin / Plan


Compound Engineering v2.50.0
🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via Claude Code

…tion (PER-7855)

Phase 3 of PER-7855 CLI QoS hardening, plus a bonus fix for the
existing POSIX child-tree leak in `browser.js`.

R1 — Graceful drain on SIGINT/SIGTERM (`cli-command/src/command.js`):

- New module-level `shutdownState` bag (`signal`, `forced`,
  `drainTimer`, `hardExitTimer`) is exposed to commands as
  `ctx.shutdown` so they can call `percy.stop(ctx.shutdown.forced)`
  for graceful-on-first-signal, force-on-second-signal behavior.
- First SIGINT/SIGTERM logs `${signal} received, draining (press Ctrl-C
  again to force)...` to stderr and arms a 30s drain timer that flips
  `shutdown.forced=true` if the runner hasn't completed.
- Second signal (or the 30s timer) flips `forced=true` immediately and
  arms a 5s hard-exit safety timer to bail if `percy.stop(true)` hangs.
- Production exit codes: SIGINT→130, SIGTERM→143, surfaced via
  `process.exit` only when `definition.exitOnError` is true. Tests
  with `exitOnError:false` preserve the legacy clean-resolution
  behavior because AbortError still carries `exitCode:0`.
- `start.js`, `snapshot.js`, `exec.js` callbacks now read
  `ctx.shutdown.forced` to choose the `percy.stop(force)` argument.
  Non-signal errors preserve the original force-stop behavior.

R3 — Global unhandled-rejection / uncaught-exception handlers:

- Attached exactly once per process by `ensureProcessHandlers()` (called
  on every runner invocation; no-op after first attach).
- Stack trace routed through `redactSecrets()` so CDP rejections that
  include serialized page-script bodies, Authorization headers, or
  cookie strings cannot leak via the new log path.
- Sets `activeContext.runFailed=true`; runs that complete cleanly but
  saw an unhandled rejection now throw a synthetic exit-1 error at
  the end so CI doesn't see a green build.

Bonus — POSIX child-tree leak in `core/src/browser.js:207`:

The previous `this.process.kill('SIGKILL')` targeted only the lead
Chromium pid. Despite spawning detached at `:266`, that left renderer
/ utility / zygote children orphaned on every kill. The fix matches
the Puppeteer / Playwright convention: shell out to `taskkill /T /F`
on Windows; on POSIX use `process.kill(-pid, 'SIGKILL')` to signal
the whole process group. Falls back to the old lead-pid kill on
either path's error so a missing process doesn't wedge `_closed`.

HTTP server graceful drain (`core/src/server.js`):

`Server.close()` becomes async with a `drainMs` option (default 5s).
Uses Node 18.2+ `closeIdleConnections` / `closeAllConnections` when
available; falls back to manual socket-set iteration on Node 14
(Windows CI is pinned there per `.github/workflows/windows.yml:15`).
The `this.draining` flag is set so future request middleware can
emit `Connection: close` headers.

Test infrastructure:

- `_resetShutdownForTest()` exported from `@percy/cli-command` for
  spec isolation; module-level state is also auto-reset at the start
  of each `runCommandWithContext` so back-to-back specs don't leak
  signal state.
- `try/finally` in `runCommandWithContext` ensures per-run signal
  listeners are always removed, even on paths where
  `generatePromise`'s cleanup callback wouldn't fire — eliminates
  the MaxListenersExceededWarning that was a pre-existing concern.
- Updated `command.test.js` and `cli-exec/test/exec.test.js`
  assertions for the new "draining" announcement on stderr and the
  removal of the legacy "Stopping percy..." log on graceful (single-
  signal) interrupts.

Tests added: `cli-command/test/shutdown.test.js` (4 specs) covers
SIGINT→130, SIGTERM→143, `shutdown.forced` transition on first vs
second signal, and the redactSecrets path for unhandled rejections.

Origin:        docs/brainstorms/2026-04-24-per-7855-cli-qos-hardening-requirements.md
Plan:          docs/plans/2026-04-27-001-feat-per-7855-cli-qos-hardening-plan.md
Phase 1:       commit e135e9a (network refactors + redaction + hint)
Phase 2:       commit e8a6d44 (per-port lockfile)

Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com>
@Shivanshu-07
Copy link
Copy Markdown
Contributor Author

Closing in favor of consolidated PR #2199, which contains all three commits (the same content) so review can happen against a single diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant