Skip to content

feat(core): per-port lockfile with stale-lock reclaim (PER-7855 phase 2/3)#2197

Closed
Shivanshu-07 wants to merge 1 commit intomasterfrom
feat/per-7855-cli-qos-phase-2
Closed

feat(core): per-port lockfile with stale-lock reclaim (PER-7855 phase 2/3)#2197
Shivanshu-07 wants to merge 1 commit intomasterfrom
feat/per-7855-cli-qos-phase-2

Conversation

@Shivanshu-07
Copy link
Copy Markdown
Contributor

@Shivanshu-07 Shivanshu-07 commented Apr 27, 2026

Summary

Phase 2 of PER-7855. Independent of Phase 1 (#2196) — different files, different concern. Both can land in either order.

Today, after a Percy crash, the next `percy start` on the same port fails late and noisily with `EADDRINUSE` once `server.listen()` runs. This PR adds an upfront per-port lockfile with stale-lock reclaim so the second invocation refuses cleanly with an actionable message naming the live pid and lock-file path.

What's new

  • `packages/core/src/lock.js` (new):
    • `acquireLock({port})` writes `~/.percy/agent-.lock` atomically via `wx`. Payload `{pid, port, startedAt}`; mode `0o600` on the file, `0o700` on the parent dir.
    • `LockHeldError` carries `{meta, lockPath}` so the refusal message can name the live pid + lock path for manual cleanup.
    • Stale-lock reclaim via `process.kill(pid, 0)` liveness probe: ESRCH = dead → reclaim; EPERM = alive-but-foreign → refuse; self-pid → reclaim (we cannot conflict with ourselves).
    • Reclaim is unlink + retry-`wx`, not rename-based: Windows CI is pinned to Node 14 (`.github/workflows/windows.yml:15`) where `fs.renameSync` over an existing target is unreliable.
  • `Percy.start()` acquires the lock as the first step inside the `try {` (before monitoring, proxy detection, queue starts), so a held-lock fails fast before burning real work.
  • `Percy.start()` registers a one-shot `process.on('exit')` synchronous unlink as last-chance cleanup if the process exits without a normal `stop()`. (Phase 3 will replace this with a signal-driven drain.)
  • `Percy.stop()` releases the lock in the `finally` block.

Backwards compatibility

When the lock is held, the `start()` catch maps `LockHeldError` to the legacy "Percy is already running or the port \$port is in use" message string (downstream tooling may grep for it) AND also `log.error`s the actionable detail (live pid, lockfile path) so users can recover. The existing test at `Percy #start() throws when the port is in use` continues to pass unchanged.

Tests

  • 13 new unit specs (`core/test/unit/lock.test.js`) covering:
    • Lock content (pid, port, startedAt)
    • mkdir-p behavior
    • SC3 stale-pid reclaim (recorded pid is non-existent)
    • SC4 live-foreign refusal with actionable message
    • Self-pid stale optimization (avoids cross-test contamination)
    • EPERM-as-alive (cross-user pid)
    • Corrupt-payload recovery (truncated JSON)
    • SC5 distinct ports lock concurrently
    • Mode bits `0o600`/`0o700` on POSIX
    • Release idempotency (no-op for missing/already-removed handle)
    • Re-acquire after release

Test infra change

`core/test/helpers/index.js` — added `/.percy/agent-` to the mockfs `$bypass` list so lock files go through the real fs rather than the in-memory mock. Files are cleaned by `Percy.stop()`'s release path; the self-pid stale optimization in `lock.js` handles same-process collisions during sequential Jasmine runs.

Test run on this branch: 697 specs, 27 pre-existing failures (same 21 `Unit / Install Chromium` + 5 `runDoctorOnFailure` + 1 `API Server when the server is disabled` as on master). All 13 new Lock tests pass; all Percy tests that exercise start/stop continue to pass.

Test plan

  • CI green (modulo the pre-existing failures above)
  • Manual: `rm -rf ~/.percy/`, run `percy start`, kill -9, run `percy start` again — second succeeds (stale reclaim)
  • Manual: `percy start` in two terminals on the same port — second refuses with the new actionable message
  • Manual: `percy start --port 5338` and `percy start --port 5339` concurrently — both succeed (SC5)
  • Windows CI verifies cross-platform process.kill(pid, 0) and wx+unlink reclaim

Risks

Risk Mitigation
Lock file leaks on hard kill (SIGKILL) Reclaimed on next start via `process.kill(pid, 0)`
Multi-user host: another user's pid happens to match Treated as alive (EPERM); user must manually delete the file (path is in the refusal message)
Test pollution (real lockfiles appearing on dev machines during `yarn test`) Self-pid optimization + stop() always releases; cleaned within the test process. Files appear briefly under `~/.percy/` during a test run.
Restricted CI without writable `$HOME` acquire surfaces EACCES via the catch path with an actionable message; future ticket can add tmpdir fallback if real users hit this

Post-Deploy Monitoring & Validation

  • What to monitor/search
    • Logs: any new `LockHeldError` reports — should drop after legitimate stale locks are reclaimed once
    • Filesystem: stale lockfiles under `~/.percy/` accumulating on dev machines (would indicate stop is not running)
  • Validation checks
    • `ls ~/.percy/` after a clean `percy start && percy stop` — should be empty
    • Manual: kill -9 a running Percy, then start again — confirm reclaim works
  • Expected healthy behavior
    • Second `percy start` on same port refuses cleanly with pid + path
    • Crashed Percy's lock auto-reclaims on next start
  • Failure signal(s) / rollback trigger
    • Users report "Percy is already running" but no Percy is running and `~/.percy/agent-X.lock` is present (liveness check broken — probably PID reuse on a long-running host or the EPERM heuristic is wrong)
  • Validation window & owner
    • Window: 1 week post-merge
    • Owner: @shivanshu.si

Origin / Plan


Compound Engineering v2.50.0
🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via Claude Code

Phase 2 of PER-7855 CLI QoS hardening — short-circuit "Percy already
running" at command entry instead of failing late and noisily with
EADDRINUSE on `server.listen()`.

New module `core/src/lock.js`:
- `acquireLock({port})` writes `~/.percy/agent-<port>.lock` atomically
  via `wx`. Payload is `{pid, port, startedAt}`; mode `0o600` on the
  file, `0o700` on the parent dir.
- `LockHeldError` carries `{meta, lockPath}` so the refusal message
  can name the live pid + lock path for manual cleanup.
- Stale-lock reclaim: `process.kill(pid, 0)` liveness probe; ESRCH
  treated as dead, EPERM as alive-but-foreign. A self-pid lock (left
  over by an earlier in-process invocation) is reclaimed without
  consulting `process.kill` — we cannot conflict with ourselves.
- Reclaim is unlink + retry-`wx`, NOT rename-based: Windows CI is
  pinned to Node 14 (`.github/workflows/windows.yml:15`), where
  `fs.renameSync` over an existing target is unreliable.

`Percy.start()`:
- Acquires the lock as the first step inside `try {` (before
  monitoring, proxy detection, queue starts), so a held-lock fails
  fast.
- Registers a one-shot `process.on('exit')` synchronous unlink as
  last-chance cleanup if the process exits without a normal `stop()`.
  Phase 3 will replace this with a signal-driven drain.

`Percy.stop()`:
- Releases the lock in the `finally` block, alongside monitoring
  teardown. Idempotent: re-running release on an already-released
  handle is a no-op.

Backwards compatibility: when the lock is held, the start() catch maps
`LockHeldError` to the legacy "Percy is already running or the port X
is in use" message string (downstream tooling may grep for it) AND
also logs the actionable detail (live pid, lockfile path) via
`log.error` so users can recover.

Test infrastructure (`core/test/helpers/index.js`):
- Added `~/.percy/agent-*` to the mockfs `$bypass` list so lock files
  go through the real fs rather than the in-memory mock. Files are
  cleaned by `Percy.stop()`'s release path; the self-pid stale
  optimization handles same-process collisions during sequential
  Jasmine runs.

Tests added: 13 unit specs (`core/test/unit/lock.test.js`) covering
SC3 stale reclaim, SC4 live-foreign refusal, SC5 multi-port,
EPERM-as-alive, corrupt-payload recovery, mkdir-p, mode bits on POSIX,
release idempotency, re-acquire after release.

Origin:        docs/brainstorms/2026-04-24-per-7855-cli-qos-hardening-requirements.md
Plan:          docs/plans/2026-04-27-001-feat-per-7855-cli-qos-hardening-plan.md
Phase 1:       commit e135e9a (network refactors + redaction + hint)
Phase 3 next:  signal drain + unhandled-rejection handlers (PER-7855)

Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com>
Comment thread packages/core/src/lock.js
}

export function lockPathFor(port) {
return join(os.homedir(), '.percy', `agent-${port}.lock`);
@Shivanshu-07
Copy link
Copy Markdown
Contributor Author

Closing in favor of consolidated PR #2199, which contains all three commits (the same content) so review can happen against a single diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants