docs(mpc): consensus-mode installation + operator runbook + smoke test (closes #1)#2
Open
abhicris wants to merge 312 commits intoluxfi:mainfrom
Open
docs(mpc): consensus-mode installation + operator runbook + smoke test (closes #1)#2abhicris wants to merge 312 commits intoluxfi:mainfrom
abhicris wants to merge 312 commits intoluxfi:mainfrom
Conversation
Implement badger backup recovery
Enable backup mode by default
* Fix error message when signing with duplicate tx id * Update error message for duplicate signing requests to be more concise
Major changes: - Replaced Binance TSS library with luxfi/threshold (supports CGGMP21 & FROST) - Added XRPL network support for secp256k1 signatures - Implemented resharing functionality for both ECDSA and EdDSA - Fixed protocol adapters for CGGMP21 and FROST - Updated event consumer to use new node methods - Fixed all build errors in core packages Technical improvements: - Simplified party.ID usage (now string-based) - Added proper message types and encoding - Updated session interfaces with required methods - Fixed test suite to work with new library structure All core package tests passing. Ready for mainnet deployment.
- Remove local replace directive for threshold library - Fix unused directMessaging variable warning - Replace undefined logger.NewLogger with zap.NewProduction - Fix GetMessage method call to use Raw() - Mark bridge implementation as TODO
- Remove unused imports (os/signal, syscall) - Fix unused variable warning (msgBytes) - Set Go version to 1.21 for CI compatibility
- Add .golangci.yml configuration for consistent linting - Update .gitignore to exclude built binaries - Keep go.mod at version 1.24.5 as required by the project
- Fix import formatting with goimports - Remove unnecessary type conversions for syscall.Stdin - Format imports with correct grouping
- Run goimports on all pkg files - Ensure consistent import grouping throughout codebase
- Fix imports in cmd and scripts directories - Ensure consistent import grouping across all Go files
- Update CI workflow to use Go 1.22 - Update go.mod to match CI environment - This ensures compatibility across local and CI environments
- Update .golangci.yml to use new configuration format - Set Go version to 1.23 in both go.mod and CI workflow - Remove deprecated linter settings
- Merge both issues sections into one - Keep all exclude and configuration settings
- Remove security-scan from build dependencies - Keep Go version at 1.24.5 in go.mod as required by project - Focus on getting core CI (test, lint, build) working
- Test job: ✅ All tests passing - Lint job: ✅ All linting checks passing (golangci-lint) - CodeQL job: ✅ Security analysis passing Successfully completed migration to luxfi/threshold library with: - Full CGGMP21 support for ECDSA (Bitcoin, Ethereum, XRPL) - Full FROST support for EdDSA (Solana) - Complete resharing implementation - All tests passing - Clean code with proper imports Remaining issues (non-blocking): - Security scanning requires Go 1.24 - SARIF file generation needs fixes
- Updates threshold from v1.0.0 to v1.1.0 - Includes all test fixes and improvements - All MPC tests pass with updated dependency
- Fixed shell script compatibility for macOS - Updated e2e module dependencies - Simplified benchmark tests - Fixed KMS client build
- Go 1.24.5 is not yet released - golangci-lint is built with Go 1.23 - Fixes CI lint job failure
Matches the sshd/ssh, httpd/http unix pattern. Single Docker image now ships both binaries: ENTRYPOINT is mpcd (the daemon); override with 'mpc <subcommand>' for CLI ops (peers, identity, recover, etc). Also aligns with lux/kms (single binary with subcommands) — operators get the same shape across all Base-family daemons.
Vite + React SPA with 9 pages (Dashboard, Wallets, Vaults, Policies, Settlements, Bridge, Payments, Audit, Settings) served at /_/mpc/ via go:embed. Replaces the separate Next.js dashboard pod — the UI now ships inside the mpcd binary. Uses @tanstack/react-query, wouter hash router, dark theme, inline styles. Build output: 272KB.
- ws.go: EventBus pubsub + /v1/ws upgrade handler (stdlib, no deps) - zap_server.go: StartZAP(port) with opcodes 0x0060-0x0064 keygen, sign, status, wallets, intents over ZAP binary protocol - Events field on Server for cross-transport event broadcasting - Each consensus node serves full API (not separate mpc-api sidecar)
closes luxfi#1) Rewrite INSTALLATION.md to lead with consensus-mode (ZAP transport, no NATS/Consul deps), matching the production path described in README. The legacy NATS+Consul transport is retained in Appendix A with a deprecation banner and a migration checklist for operators running older installs. New docs: - docs/RUNBOOK.md operator checklist: daily health probes, key rotation (share refresh vs. wallet rotation), node add/remove without resharing, backup and restore, incident triage (degraded node, signing timeouts, suspected compromise, HSM outage), rollback procedure, audit log guidance. - docs/HEALTH.md /healthz endpoint contract: response schema, status-code semantics, K8s probe recommendations, sampling cadence, alert rules, and the explicit list of what the endpoint does NOT check (so operators add canaries, not instrumentation). New script: - scripts/smoke-test.sh end-to-end local canary. Generates a peer manifest, per-node identities, an event initiator, and boots a 3-node consensus- mode cluster on loopback. Waits for /healthz ready=true on each node, runs keygen (CGGMP21 secp256k1), signs a random nonce, structurally verifies the signature, and tears down. Exits 0 on success. SKIP_BOOT=1 repoints at a running staging cluster for periodic canary use. All flag names, env vars, endpoint paths, response fields, and config keys cross-checked against cmd/mpcd/main.go and pkg/api/server.go on main. The CLI binary names (mpcd daemon, mpc helper) match the Makefile build targets.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1.
Follow-up to the issue I filed: rewrites
docs/INSTALLATION.mdto lead with consensus-mode (ZAP transport, no NATS/Consul deps), moves the legacy NATS+Consul transport to an appendix with a deprecation banner + migration checklist, and lands the three companion docs / scripts the issue asked for.Files
docs/INSTALLATION.md(rewrite)mpcd start --node-id ... --peer ..., no external message bus).mpcd start— node id, listen/api/api-listen ports, threshold, peers,--hsm-provider,--hsm-signer,--hsm-attest, log level — cross-checked againstcmd/mpcd/main.goonmain.mpcCLI (matchescmd/mpc/), event initiator setup, Docker Compose example aligned with the actualcompose.yml(SQLite default, Postgres+Valkey commented), Kustomize pointer fork8s/.environment: production+ cloud HSM provider (the startup guard inresolveZapDBPasswordrejectsenv/filein production), encrypted identities, share rotation cadence, off-host backups.docs/RUNBOOK.md(new)Operator checklist for the day-to-day:
initialDelaySeconds.POST /v1/wallets/{id}/reshare) vs. wallet rotation (new key); operator preconditions / post-conditions / checklist for each.ready: false, signing timeouts, suspected node compromise (isolate → reshare → rotate password → destroy backups), HSM/KMS outage.docs/HEALTH.md(new)Spec for
GET /healthzon the internal API listener:healthHandlerincmd/mpcd/main.go—status,node_id,mode,expected_peers,connected_peers,ready,threshold,version.200whenready: true,503whenready: false.initialDelaySecondsand the rationale for each./healthzdoes NOT check section so operators add canaries (seescripts/smoke-test.sh) instead of over-instrumenting the probe.scripts/smoke-test.sh(new, executable)End-to-end canary:
mpc generate-peers -n 3), per-node identities (mpc generate-identity), an event initiator (mpc generate-initiator).mpcdnodes on loopback (:19651-3P2P,:19800-2API)./healthz ready=trueon each node (60 s timeout, configurable).POST /keygenwith a random org + wallet id, requiresresult_type == "success"and a non-emptyecdsa_pub_key.SKIP_BOOT=1points the same test at a running staging cluster for periodic canary use.WORKDIRon exit.mpcd,mpc,curl,jq,openssl.Cross-checks done
cmd/mpcd/main.go(--node-id,--listen,--api,--api-listen,--hsm-provider,MPC_HSM_PROVIDER,--hsm-attest,MPC_HSM_ATTEST, etc.).mux.HandleFunccalls onmain:/healthz,/keys,/backup,/keygen. Response fields verified against thehealthHandlermap literal.config.yamlschema verified against the existingconfig.yamlin the repo (mode,environment,mpc_threshold,max_concurrent_keygen,db_path,backup_enabled,backup_period_seconds,backup_dir,event_initiator_pubkey).Makefile(mpcdandmpcbinaries).compose.yml(SQLite default, Postgres + Valkey commented out).k8s/mpc-statefulset.yaml(5 replicas,readOnlyRootFilesystem,node{0..4}@mpc-node-{0..4}.mpc-node-headless.lux-mpc.svc:9651).resolveZapDBPassword(): production + env/file provider is rejected.Happy to split any of these into smaller PRs if preferred — e.g. land INSTALLATION + HEALTH first and RUNBOOK + smoke-test as a follow-up. LMK.