Skip to content

docs(mpc): consensus-mode installation + operator runbook + smoke test (closes #1)#2

Open
abhicris wants to merge 312 commits intoluxfi:mainfrom
abhicris:docs/consensus-mode-runbook
Open

docs(mpc): consensus-mode installation + operator runbook + smoke test (closes #1)#2
abhicris wants to merge 312 commits intoluxfi:mainfrom
abhicris:docs/consensus-mode-runbook

Conversation

@abhicris
Copy link
Copy Markdown

Closes #1.

Follow-up to the issue I filed: rewrites docs/INSTALLATION.md to lead with consensus-mode (ZAP transport, no NATS/Consul deps), moves the legacy NATS+Consul transport to an appendix with a deprecation banner + migration checklist, and lands the three companion docs / scripts the issue asked for.

Files

docs/INSTALLATION.md (rewrite)

  • Leads with the consensus-mode quick start that matches the README (mpcd start --node-id ... --peer ..., no external message bus).
  • Full flag reference for mpcd start — node id, listen/api/api-listen ports, threshold, peers, --hsm-provider, --hsm-signer, --hsm-attest, log level — cross-checked against cmd/mpcd/main.go on main.
  • Peer manifest + identity generation via the mpc CLI (matches cmd/mpc/), event initiator setup, Docker Compose example aligned with the actual compose.yml (SQLite default, Postgres+Valkey commented), Kustomize pointer for k8s/.
  • Production hardening checklist: environment: production + cloud HSM provider (the startup guard in resolveZapDBPassword rejects env/file in production), encrypted identities, share rotation cadence, off-host backups.
  • Appendix A — legacy NATS+Consul transport retained for back-compat, including a step-by-step migration to consensus mode.

docs/RUNBOOK.md (new)

Operator checklist for the day-to-day:

  • Daily / periodic health cadence table.
  • Kubernetes liveness + readiness probe YAML with rationale on thresholds and initialDelaySeconds.
  • Key rotation: share refresh (public key stays, via POST /v1/wallets/{id}/reshare) vs. wallet rotation (new key); operator preconditions / post-conditions / checklist for each.
  • Node add / remove without resharing (new wallets can use the expanded cohort; existing wallets are cohort-bound).
  • Backup + restore from age-encrypted ZapDB archives; the important caveat that a backup is share-bound to identity + HSM password, not a key backup.
  • Incident triage: ready: false, signing timeouts, suspected node compromise (isolate → reshare → rotate password → destroy backups), HSM/KMS outage.
  • Rollback policy (no mixed-version cohorts for signing).
  • Useful one-liners for cluster-wide status, audit tailing, key counts.

docs/HEALTH.md (new)

Spec for GET /healthz on the internal API listener:

  • Response schema pulled from the healthHandler in cmd/mpcd/main.gostatus, node_id, mode, expected_peers, connected_peers, ready, threshold, version.
  • Status code contract: 200 when ready: true, 503 when ready: false.
  • Recommended probe intervals + initialDelaySeconds and the rationale for each.
  • Sampling cadence table by consumer (K8s probes, external monitoring, incident triage) and a don't-poll-faster-than note.
  • Example Prometheus alert rules.
  • Explicit what /healthz does NOT check section so operators add canaries (see scripts/smoke-test.sh) instead of over-instrumenting the probe.

scripts/smoke-test.sh (new, executable)

End-to-end canary:

  • Generates a peer manifest (mpc generate-peers -n 3), per-node identities (mpc generate-identity), an event initiator (mpc generate-initiator).
  • Boots 3 mpcd nodes on loopback (:19651-3 P2P, :19800-2 API).
  • Waits for /healthz ready=true on each node (60 s timeout, configurable).
  • Fires POST /keygen with a random org + wallet id, requires result_type == "success" and a non-empty ecdsa_pub_key.
  • Signs a random 32-byte nonce, structurally verifies signature length (64 or 65 bytes).
  • SKIP_BOOT=1 points the same test at a running staging cluster for periodic canary use.
  • Traps to tear down processes and WORKDIR on exit.
  • Requires: mpcd, mpc, curl, jq, openssl.

Cross-checks done

  • Flag names + env var names verified against cmd/mpcd/main.go (--node-id, --listen, --api, --api-listen, --hsm-provider, MPC_HSM_PROVIDER, --hsm-attest, MPC_HSM_ATTEST, etc.).
  • Endpoint paths verified against mux.HandleFunc calls on main: /healthz, /keys, /backup, /keygen. Response fields verified against the healthHandler map literal.
  • config.yaml schema verified against the existing config.yaml in the repo (mode, environment, mpc_threshold, max_concurrent_keygen, db_path, backup_enabled, backup_period_seconds, backup_dir, event_initiator_pubkey).
  • Build targets verified against the Makefile (mpcd and mpc binaries).
  • Docker Compose example mirrors the actual compose.yml (SQLite default, Postgres + Valkey commented out).
  • K8s manifest reference matches k8s/mpc-statefulset.yaml (5 replicas, readOnlyRootFilesystem, node{0..4}@mpc-node-{0..4}.mpc-node-headless.lux-mpc.svc:9651).
  • Production guard mentioned in INSTALLATION.md matches the explicit fatal in resolveZapDBPassword(): production + env/file provider is rejected.

Happy to split any of these into smaller PRs if preferred — e.g. land INSTALLATION + HEALTH first and RUNBOOK + smoke-test as a follow-up. LMK.

anhthii and others added 30 commits July 28, 2025 22:29
* Fix error message when signing with duplicate tx id

* Update error message for duplicate signing requests to be more concise
Major changes:
- Replaced Binance TSS library with luxfi/threshold (supports CGGMP21 & FROST)
- Added XRPL network support for secp256k1 signatures
- Implemented resharing functionality for both ECDSA and EdDSA
- Fixed protocol adapters for CGGMP21 and FROST
- Updated event consumer to use new node methods
- Fixed all build errors in core packages

Technical improvements:
- Simplified party.ID usage (now string-based)
- Added proper message types and encoding
- Updated session interfaces with required methods
- Fixed test suite to work with new library structure

All core package tests passing. Ready for mainnet deployment.
- Remove local replace directive for threshold library
- Fix unused directMessaging variable warning
- Replace undefined logger.NewLogger with zap.NewProduction
- Fix GetMessage method call to use Raw()
- Mark bridge implementation as TODO
- Remove unused imports (os/signal, syscall)
- Fix unused variable warning (msgBytes)
- Set Go version to 1.21 for CI compatibility
- Add .golangci.yml configuration for consistent linting
- Update .gitignore to exclude built binaries
- Keep go.mod at version 1.24.5 as required by the project
- Fix import formatting with goimports
- Remove unnecessary type conversions for syscall.Stdin
- Format imports with correct grouping
- Run goimports on all pkg files
- Ensure consistent import grouping throughout codebase
- Fix imports in cmd and scripts directories
- Ensure consistent import grouping across all Go files
- Update CI workflow to use Go 1.22
- Update go.mod to match CI environment
- This ensures compatibility across local and CI environments
- Update .golangci.yml to use new configuration format
- Set Go version to 1.23 in both go.mod and CI workflow
- Remove deprecated linter settings
- Merge both issues sections into one
- Keep all exclude and configuration settings
- Remove security-scan from build dependencies
- Keep Go version at 1.24.5 in go.mod as required by project
- Focus on getting core CI (test, lint, build) working
- Test job: ✅ All tests passing
- Lint job: ✅ All linting checks passing (golangci-lint)
- CodeQL job: ✅ Security analysis passing

Successfully completed migration to luxfi/threshold library with:
- Full CGGMP21 support for ECDSA (Bitcoin, Ethereum, XRPL)
- Full FROST support for EdDSA (Solana)
- Complete resharing implementation
- All tests passing
- Clean code with proper imports

Remaining issues (non-blocking):
- Security scanning requires Go 1.24
- SARIF file generation needs fixes
- Updates threshold from v1.0.0 to v1.1.0
- Includes all test fixes and improvements
- All MPC tests pass with updated dependency
- Fixed shell script compatibility for macOS
- Updated e2e module dependencies
- Simplified benchmark tests
- Fixed KMS client build
- Go 1.24.5 is not yet released
- golangci-lint is built with Go 1.23
- Fixes CI lint job failure
hanzo-dev and others added 29 commits April 13, 2026 16:22
Matches the sshd/ssh, httpd/http unix pattern. Single Docker image
now ships both binaries: ENTRYPOINT is mpcd (the daemon); override
with 'mpc <subcommand>' for CLI ops (peers, identity, recover, etc).

Also aligns with lux/kms (single binary with subcommands) — operators
get the same shape across all Base-family daemons.
Vite + React SPA with 9 pages (Dashboard, Wallets, Vaults, Policies,
Settlements, Bridge, Payments, Audit, Settings) served at /_/mpc/ via
go:embed. Replaces the separate Next.js dashboard pod — the UI now ships
inside the mpcd binary. Uses @tanstack/react-query, wouter hash router,
dark theme, inline styles. Build output: 272KB.
- ws.go: EventBus pubsub + /v1/ws upgrade handler (stdlib, no deps)
- zap_server.go: StartZAP(port) with opcodes 0x0060-0x0064
  keygen, sign, status, wallets, intents over ZAP binary protocol
- Events field on Server for cross-transport event broadcasting
- Each consensus node serves full API (not separate mpc-api sidecar)
closes luxfi#1)

Rewrite INSTALLATION.md to lead with consensus-mode (ZAP transport, no
NATS/Consul deps), matching the production path described in README.
The legacy NATS+Consul transport is retained in Appendix A with a
deprecation banner and a migration checklist for operators running
older installs.

New docs:
  - docs/RUNBOOK.md   operator checklist: daily health probes, key
                     rotation (share refresh vs. wallet rotation),
                     node add/remove without resharing, backup and
                     restore, incident triage (degraded node, signing
                     timeouts, suspected compromise, HSM outage),
                     rollback procedure, audit log guidance.
  - docs/HEALTH.md   /healthz endpoint contract: response schema,
                     status-code semantics, K8s probe recommendations,
                     sampling cadence, alert rules, and the explicit
                     list of what the endpoint does NOT check (so
                     operators add canaries, not instrumentation).

New script:
  - scripts/smoke-test.sh   end-to-end local canary. Generates a peer
                           manifest, per-node identities, an event
                           initiator, and boots a 3-node consensus-
                           mode cluster on loopback. Waits for
                           /healthz ready=true on each node, runs
                           keygen (CGGMP21 secp256k1), signs a random
                           nonce, structurally verifies the signature,
                           and tears down. Exits 0 on success.
                           SKIP_BOOT=1 repoints at a running staging
                           cluster for periodic canary use.

All flag names, env vars, endpoint paths, response fields, and config
keys cross-checked against cmd/mpcd/main.go and pkg/api/server.go on
main. The CLI binary names (mpcd daemon, mpc helper) match the
Makefile build targets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: update INSTALLATION.md for consensus-mode + add operator runbook outline

9 participants