Skip to content

Harden deploy memory watchdog; fix phone-signer 'step 5 of 4'#22

Merged
UtkarshBhardwaj007 merged 1 commit intomainfrom
ub/deploy-watchdog-and-signing-counter-fixes
Apr 18, 2026
Merged

Harden deploy memory watchdog; fix phone-signer 'step 5 of 4'#22
UtkarshBhardwaj007 merged 1 commit intomainfrom
ub/deploy-watchdog-and-signing-counter-fixes

Conversation

@UtkarshBhardwaj007
Copy link
Copy Markdown
Member

Summary

Two-part fix for dot deploy in phone-signer mode:

  • Memory: the 4 GB RSS watchdog runs in a worker_threads Worker so it can't be starved by main-thread microtask floods (how we ended up with a 20 GB RSS and a mystery zsh: killed last round). Watchdog samples every 1 s and force-exits with a clear abort message. Adds DOT_DEPLOY_VERBOSE=1 for streaming every bulletin-deploy log line to stderr with [+<seconds>s] timestamps, plus destroys the idle Asset Hub client right after preflight.
  • Signing counter: for PoP-gated names, dot deploy --signer phone --playground actually fires 5 txs (setPop + commit + finalize + setCH + playground publish), not 4. The approvals list was hardcoded to 3 DotNS taps, so the summary card advertised "4 approvals" and the phone prompt later said "approve step 5 of 4". Fixed by predicting needsPopUpgrade in the availability check and threading a structured plan through to resolveSignerSetup + the signing proxy.

Test plan

  • pnpm exec tsc --noEmit clean
  • pnpm exec biome format . clean
  • pnpm test — 170/170 (161 existing + 9 new):
    • 4 in availability.test.tsneedsPopUpgrade prediction, update path, RPC-flake fallback
    • 2 in run.test.ts — 5-approval (PoP upgrade) + 2-approval (re-deploy) phone flows
    • 3 in signingProxy.test.ts — counter clamps total upward when step > total
  • Manual smoke: dot deploy --signer phone --playground on a PoP-gated label — summary card reports 5 approvals, phone prompt shows "step 1..5 of 5", no mystery SIGKILL
  • Manual smoke: re-deploy of an already-owned domain — summary card reports 1 DotNS approval (contenthash only), no register/commit prompts fire
  • Manual smoke: DOT_MEMORY_TRACE=1 DOT_DEPLOY_VERBOSE=1 dot deploy 2>/tmp/dot-trace.log — RSS stays stable, trace contains timestamped bulletin-deploy lines + memory samples

Background

The "step 5 of 4" symptom from the previous diagnostic run surfaced after we added DOT_DEPLOY_VERBOSE=1 — which confirmed the deploy itself completed fine, memory stayed under 2 GB, and the miscounted counter was a separate UX bug tied to setUserPopStatus firing on PoP-gated labels. Both classes of bug are addressed in this PR; the verbose log + worker watchdog remain in place as defense-in-depth if the 20 GB leak resurfaces.

Two-part fix addressing (1) a 20 GB RSS growth + freeze during chunk upload
in phone-signer mode that macOS jetsam terminated with SIGKILL, and (2) a
phone-signer approval counter that printed "approve step 5 of 4" whenever
the domain required a Proof-of-Personhood upgrade.

Memory hardening
----------------
* `src/utils/process-guard.ts` moves the RSS watchdog into a `worker_threads`
  Worker. The previous `setInterval`-on-main version was starved by
  polkadot-api subscription microtask floods; RSS climbed past 10 GB between
  samples and macOS jetsam delivered SIGKILL before we could print any abort
  message. The worker has its own event loop and samples at 1 s (down from
  5 s). On cap crossed, it SIGKILLs the whole process with a clear reason;
  on main-thread shutdown it tears down via `postMessage("stop")` with a
  `terminate()` fallback.
* `src/utils/deploy/storage.ts` gains a `DOT_DEPLOY_VERBOSE=1` env var that
  passes every bulletin-deploy log line through to stderr with a
  `[+<seconds>s]` prefix. Previously the interceptor dropped everything that
  wasn't a phase banner or `[N/M]` chunk line, which made freeze reports
  diagnostically opaque. Pair with `DOT_MEMORY_TRACE=1` to correlate log
  events with RSS growth.
* `src/commands/deploy/index.ts` destroys the Asset Hub client immediately
  after preflight. Nothing in the deploy flow between preflight and
  playground publish uses `getConnection()`, so holding an idle polkadot-api
  client + live best-block subscription for the full deploy window was
  unnecessary background pressure. Publish re-establishes via
  `getConnection()`.

Signing counter
---------------
For a PoP-gated label signed by a lower-tier signer, bulletin-deploy submits
`setUserPopStatus` before `register()`, so `dot deploy --signer phone
--playground` actually fires 5 sigs (setPop + commit + finalize + setCH +
playground publish), not 4. The approvals list was hardcoded to 3 DotNS
taps, so the summary card advertised "4 approvals" and the phone prompt
later said "approve step 5 of 4".

Fix threads a structured `plan: { action: "register" | "update",
needsPopUpgrade }` from the availability check through to the signing proxy:

* `availability.ts` calls `getUserPopStatus(userH160)` + `isTestnet()` and
  predicts `needsPopUpgrade` via a mirrored `simulateUserStatus` rule
  (reproduced locally because the helper isn't exported from bulletin-
  deploy's root; the rule has been stable since 0.6.9-rc.5).
* `signerMode.ts` generates a variable-length approvals list per plan:
  `update` -> only setContenthash (1 tap); `register` -> commit + finalize +
  contenthash (3), plus a prefix "Grant Proof of Personhood" entry when
  `needsPopUpgrade`. Summary card + runtime counter consume the same list
  so they always agree.
* `run.ts` passes the built approvals into `maybeWrapAuthForSigning`
  instead of its old hardcoded 3 labels; the Nth `signTx` call now labels
  itself with the Nth dotns entry of the approvals list.
* `signingProxy.ts` `createSigningCounter` clamps `total` upward when
  `step > total`. Belt-and-braces -- if a future bulletin-deploy version
  adds a sig our prediction missed, the TUI will show "N of N" instead of
  regressing to "N of N-1".

Tests
-----
170/170 pass (161 existing + 9 new):
* 4 in availability.test.ts covering needsPopUpgrade prediction, update
  path, and RPC-flake fallback to safe default
* 2 in run.test.ts covering 5-approval (PoP upgrade) and 2-approval
  (re-deploy) phone paths
* 3 in signingProxy.test.ts covering the counter clamp

`tsc --noEmit` clean, `biome format` clean.

Related CLAUDE.md invariants touched: process-guard watchdog now worker-
based (the "Process-guard safety net" bullet), and deploy log stream now
has a documented verbose opt-in.
@github-actions
Copy link
Copy Markdown
Contributor

Dev build ready — try this branch:

curl -fsSL https://raw.githubusercontent.com/paritytech/playground-cli/main/install.sh | VERSION=dev/ub/deploy-watchdog-and-signing-counter-fixes bash

@UtkarshBhardwaj007 UtkarshBhardwaj007 merged commit 8746feb into main Apr 18, 2026
5 checks passed
@UtkarshBhardwaj007 UtkarshBhardwaj007 deleted the ub/deploy-watchdog-and-signing-counter-fixes branch April 18, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant