Harden deploy memory watchdog; fix phone-signer 'step 5 of 4'#22
Merged
UtkarshBhardwaj007 merged 1 commit intomainfrom Apr 18, 2026
Merged
Conversation
Two-part fix addressing (1) a 20 GB RSS growth + freeze during chunk upload
in phone-signer mode that macOS jetsam terminated with SIGKILL, and (2) a
phone-signer approval counter that printed "approve step 5 of 4" whenever
the domain required a Proof-of-Personhood upgrade.
Memory hardening
----------------
* `src/utils/process-guard.ts` moves the RSS watchdog into a `worker_threads`
Worker. The previous `setInterval`-on-main version was starved by
polkadot-api subscription microtask floods; RSS climbed past 10 GB between
samples and macOS jetsam delivered SIGKILL before we could print any abort
message. The worker has its own event loop and samples at 1 s (down from
5 s). On cap crossed, it SIGKILLs the whole process with a clear reason;
on main-thread shutdown it tears down via `postMessage("stop")` with a
`terminate()` fallback.
* `src/utils/deploy/storage.ts` gains a `DOT_DEPLOY_VERBOSE=1` env var that
passes every bulletin-deploy log line through to stderr with a
`[+<seconds>s]` prefix. Previously the interceptor dropped everything that
wasn't a phase banner or `[N/M]` chunk line, which made freeze reports
diagnostically opaque. Pair with `DOT_MEMORY_TRACE=1` to correlate log
events with RSS growth.
* `src/commands/deploy/index.ts` destroys the Asset Hub client immediately
after preflight. Nothing in the deploy flow between preflight and
playground publish uses `getConnection()`, so holding an idle polkadot-api
client + live best-block subscription for the full deploy window was
unnecessary background pressure. Publish re-establishes via
`getConnection()`.
Signing counter
---------------
For a PoP-gated label signed by a lower-tier signer, bulletin-deploy submits
`setUserPopStatus` before `register()`, so `dot deploy --signer phone
--playground` actually fires 5 sigs (setPop + commit + finalize + setCH +
playground publish), not 4. The approvals list was hardcoded to 3 DotNS
taps, so the summary card advertised "4 approvals" and the phone prompt
later said "approve step 5 of 4".
Fix threads a structured `plan: { action: "register" | "update",
needsPopUpgrade }` from the availability check through to the signing proxy:
* `availability.ts` calls `getUserPopStatus(userH160)` + `isTestnet()` and
predicts `needsPopUpgrade` via a mirrored `simulateUserStatus` rule
(reproduced locally because the helper isn't exported from bulletin-
deploy's root; the rule has been stable since 0.6.9-rc.5).
* `signerMode.ts` generates a variable-length approvals list per plan:
`update` -> only setContenthash (1 tap); `register` -> commit + finalize +
contenthash (3), plus a prefix "Grant Proof of Personhood" entry when
`needsPopUpgrade`. Summary card + runtime counter consume the same list
so they always agree.
* `run.ts` passes the built approvals into `maybeWrapAuthForSigning`
instead of its old hardcoded 3 labels; the Nth `signTx` call now labels
itself with the Nth dotns entry of the approvals list.
* `signingProxy.ts` `createSigningCounter` clamps `total` upward when
`step > total`. Belt-and-braces -- if a future bulletin-deploy version
adds a sig our prediction missed, the TUI will show "N of N" instead of
regressing to "N of N-1".
Tests
-----
170/170 pass (161 existing + 9 new):
* 4 in availability.test.ts covering needsPopUpgrade prediction, update
path, and RPC-flake fallback to safe default
* 2 in run.test.ts covering 5-approval (PoP upgrade) and 2-approval
(re-deploy) phone paths
* 3 in signingProxy.test.ts covering the counter clamp
`tsc --noEmit` clean, `biome format` clean.
Related CLAUDE.md invariants touched: process-guard watchdog now worker-
based (the "Process-guard safety net" bullet), and deploy log stream now
has a documented verbose opt-in.
Contributor
|
Dev build ready — try this branch: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-part fix for
dot deployin phone-signer mode:worker_threadsWorker so it can't be starved by main-thread microtask floods (how we ended up with a 20 GB RSS and a mysteryzsh: killedlast round). Watchdog samples every 1 s and force-exits with a clear abort message. AddsDOT_DEPLOY_VERBOSE=1for streaming every bulletin-deploy log line to stderr with[+<seconds>s]timestamps, plus destroys the idle Asset Hub client right after preflight.dot deploy --signer phone --playgroundactually fires 5 txs (setPop + commit + finalize + setCH + playground publish), not 4. The approvals list was hardcoded to 3 DotNS taps, so the summary card advertised "4 approvals" and the phone prompt later said "approve step 5 of 4". Fixed by predictingneedsPopUpgradein the availability check and threading a structuredplanthrough toresolveSignerSetup+ the signing proxy.Test plan
pnpm exec tsc --noEmitcleanpnpm exec biome format .cleanpnpm test— 170/170 (161 existing + 9 new):availability.test.ts—needsPopUpgradeprediction, update path, RPC-flake fallbackrun.test.ts— 5-approval (PoP upgrade) + 2-approval (re-deploy) phone flowssigningProxy.test.ts— counter clampstotalupward whenstep > totaldot deploy --signer phone --playgroundon a PoP-gated label — summary card reports 5 approvals, phone prompt shows "step 1..5 of 5", no mystery SIGKILLDOT_MEMORY_TRACE=1 DOT_DEPLOY_VERBOSE=1 dot deploy 2>/tmp/dot-trace.log— RSS stays stable, trace contains timestamped bulletin-deploy lines + memory samplesBackground
The "step 5 of 4" symptom from the previous diagnostic run surfaced after we added
DOT_DEPLOY_VERBOSE=1— which confirmed the deploy itself completed fine, memory stayed under 2 GB, and the miscounted counter was a separate UX bug tied tosetUserPopStatusfiring on PoP-gated labels. Both classes of bug are addressed in this PR; the verbose log + worker watchdog remain in place as defense-in-depth if the 20 GB leak resurfaces.