Replies: 1 comment 1 reply
-
|
Hi @a-kuprin ! Couple questions:
Do we really need all of them to be not private?
Of we want single instance to be responsible for the whole single PoC cycle?
Definitely agree with part 1, let's do it
I support the idea to allow kubernetes deployment. But i'd keep simplest single-instance-per-each-service as base deploy approach in Overall, that's a solid long term goal. I'd try to split it in smaller steps for implementation and phases of deploy. And keep them compartible with deployed version:
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Make every node-side service horizontally scalable and rolling-update safe, with no single point of failure, and deliver a Kubernetes-ready deployment packaged as a Helm chart.
This proposal builds on the current architecture
(../high-availability-architecture.md) and the binary-rollout design
(../rolling-update.md). It does not change the
inference-chain.
1. Goals
ideally across machines.
../rolling-update.md).
callbacks) horizontally.
transaction and each block-driven action happens once.
prerequisite for packaging the node stack as a Helm chart: one chart (or
subcharts) per service, externalized shared dependencies (Postgres, NATS,
Redis), and native K8s primitives for scaling, health, drain, and rolling
updates (see §9).
What already meets the bar
proxyedge-apiedge-api-router(round-robin)versiond+devsharddversiond-router(sticky hash) on shared PostgresWhat blocks HA today
decentralized-api(dapi) is a single monolithic process: its chain eventlistener / phase engine has no leader election, it embeds a per-process
NATS, holds a local keyring for signing, and queries the chain
directly. Two dapi instances would duplicate transactions and ML commands
(see ../high-availability-architecture.md §5). This proposal restructures
dapi so it can be made highly available.
2. Target architecture (overview)
Two pillars:
separate edge-api tier gathers block events from the chain, publishes
them to NATS (and other subscribers), and caches chain queries so node
services do not each open their own gRPC/RPC subscriptions. The same surface
can later be reused by dashboards and monitoring (read APIs + event stream)
without coupling observability to dapi or devshardd. Redis backs edge-api's
shared state and leader election.
standalone HA NATS queue, a stateless signer service, Postgres as
the only stateful backend, and stateless REST/echo workers.
3. Pillar A — edge-api as the HA event hub & chain cache
Purpose of a separate edge-api
edge-api is split out as its own service so the node has one chain-facing
tier with two jobs:
NewBlock+ per-tx events), normalize them, and transmit the stream toNATS and other subscribers (dapi node-manager, devshardd, PoC workers).
params, escrows, etc.) so every consumer does not dial gRPC/RPC independently.
That separation keeps dapi and devshardd focused on their domain logic while
edge-api owns how the node talks to the chain. The same HTTP/gRPC read
surface and event fan-out can be reused later by dashboard and monitoring
systems (status pages, ops tooling, external observers) without embedding
chain clients in each product binary.
Today edge-api is a stateless read-only proxy for 22 Tier A routes. We extend it
to be the chain-access layer for the whole node.
3.1 Move the event listener into edge-api
(
decentralized-api/internal/event_listener/) into edge-api.NewBlock+ RPCBlockResultsper-tx events) and re-publishes normalized events toNATS and other consumers (dapi services, devshardd) via pub/sub.
edge-api events instead of opening their own chain subscriptions. This
removes N independent chain subscriptions and centralizes block processing.
3.2 Leader election (only one instance triggers events)
edge-api scales to N instances, but block-driven side effects must fire
once. So:
cursor), but only the elected leader advances the canonical block cursor
and emits the authoritative event stream.
SET NX PXlease with renewal) and theshared event cursor (
last_processed_height) so a new leader resumesexactly where the old one stopped — no gaps, no replays.
via pub/sub, so followers and consumers see the same stream the leader
produced. On leader loss, another instance takes the lock within the lease TTL
and continues from the Redis cursor.
3.3 Redis as edge-api shared state
last_processed_height)3.4 Consumers stop querying the chain directly
for chain reads and subscribes to edge-api for events. This shrinks dapi to its
unique responsibilities (below).
new block/phase) rather than maintaining its own chain WebSocket, reducing
per-child chain connections. (devshardd keeps its own gRPC tx path for
disputes, or routes them through the signer queue — see §5.)
4. Pillar B — decentralized-api becomes single-purpose
After Pillar A, dapi sheds chain-query and event-subscription duties. Its
remaining unique responsibilities are:
etc.).
validation, proof serving).
These become independently deployable, mostly-stateless services. The only
mutable backends are Postgres (shared) and the NATS queue.
4.1 Service decomposition
/v2/poc-batches/...) and admin REST events4.2 Standalone HA NATS queue
Replace the embedded per-process NATS
(
decentralized-api/internal/nats/server/server.go) with a standalone,clustered NATS (JetStream) shared by all instances.
chain.tx),carrying the message and metadata. Producers are any service that needs to
write to chain (node-manager, PoC commit, edge-srv, devshardd disputes).
survives instance restarts and decouples producers from the signer.
4.3 Signer service (warm-key signing, exactly-once consume)
chain.tx: with a queuegroup, each message is delivered to exactly one signer instance, so N
signers share the load but never double-sign the same message.
authz.MsgExecwith feegrant from the cold account where applicable (current model in
cosmosclient/tx_manager), signs, and broadcasts to the chain.(
txs_to_send/txs_to_observeequivalents) so any signer instance can pickup retries. Idempotency keys (e.g. inference id + msg type) guard against
duplicate submission across retries/failover.
signer — other services never need the keyring.
4.4 Postgres as the only HA data backend
(payloads, stats, PoC artifacts/commits metadata, config/cursors as needed).
Per-process SQLite KV (e.g. dapi
apiconfiglast_processed_height) moves toPostgres/Redis so any instance is interchangeable.
(../high-availability-architecture.md §4).
4.5 Stateless echo workers (PoC callbacks + admin)
it only reads/writes Postgres or publishes to NATS. Therefore it can run
N instances behind the proxy with no coordination.
commands) belong to the leader-elected node-manager, not the echo workers.
5. Rolling updates
Rolling updates apply per service and reuse the design in
../rolling-update.md. The HA stack depends on the same
drain semantics at two layers: binary swap inside a live supervisor, and
whole-host evacuation behind the sticky router.
Rolling-update concepts (summary)
The rolling-update plan defines how we roll out new
binaries without dropping in-flight work. Three operator guarantees:
while work is still running.
instance drains until idle, then exits.
Blue/green + drain inside
versiond(Part 1 §1.1). When governance publishesa same version name, new
sha256binary,versionddownloads the newdevshardd, starts it on a new port while the old child keeps serving,waits for
GET /ready(not just TCP accept), atomically swaps the in-processroute table so new requests hit the new child, marks the old child draining
(out of the route table but still alive), polls in-flight count until zero
(or a drain timeout), then
SIGTERMwith a long shutdown grace. Old and newcan overlap only when durable state lives in shared Postgres — SQLite is
single-writer and cannot support concurrent children (Part 1 §1.2).
Two drain layers — do not conflate (Part 1 §1.7–§1.8).
versiond-routerupstream unchangedversiond-routerhost evacuationdown, drain pinned escrows, then stop the hostDuring a devshardd binary swap, sticky routing is unchanged: the router still
points at
versiond-N:8080; only the child port inside versiond swaps. Routerdrain is for when the versiond process itself must leave the pool (scale-down,
host maintenance, versiond binary upgrade).
Signals the plan adds to
devshardd:/healthz(liveness),/ready(readiness gate for route swap),
/drain/status(in-flight work), and configurableDEVSHARD_SHUTDOWN_GRACEso long SSE streams are not cut at 5s.Kubernetes mapping (Part 2). The same guarantees map to
RollingUpdate(
maxUnavailable: 0,maxSurge: 1), readinessProbe →/ready,preStop (drop from endpoints before
SIGTERM), andterminationGracePeriodSeconds aligned with shutdown grace. Pod/host evacuation
maps to Part 1 §1.8 (router drain), not the in-versiond binary swap.
How rolling updates apply in this HA proposal
rolling update — bring a new instance up, health-check, route to it, drain the
old. Behind their routers / queue groups this is transparent.
may trigger a leader handoff; the Redis lease + cursor make this safe (new
leader resumes from the cursor).
versiond, with the shared Postgres making old+new overlap correct — see
../rolling-update.md §1.
versiond-router(mark upstream down, wait for pinned escrows idle, thenstop the host) — see ../rolling-update.md §1.8.
their native rolling procedures, independent of app rollouts.
6. Kubernetes & Helm (deployment target)
Docker Compose overlays (
local-test-net/,deploy/join/) prove multi-instancetopology today; production HA should land on Kubernetes with a Helm
chart that encodes the same service boundaries as this proposal.
A Helm chart alone does not make a monolith HA.
Intended chart shape (high level)
edge-apiDeployment+Service;readinessProbe→/healthz; HPA-friendlyedge-apievent hubversiondDeployment+ stickyServiceor Ingress consistent-hash on escrow iddevsharddsignerDeployment; NATS queue-group consumer; one message consumed oncedapiservicesproxyvalues.yamlHelm deliverable
HA overlays (multi edge-api, multi versiond, external NATS/Redis) via values.
PGHOST, NATS URL, Redis URL, chain gRPC/RPC URLs,replica counts, resource limits, and graceful shutdown timeouts aligned with
inference/SSE duration.
helm template/helm linton chart changes; optional kind smoke.Compose remains the developer / integration-test path; Helm is the target
for production Kubernetes once Pillars A–B and phasing steps 1–5 are in place.
7. Notes
JetStream KV suffices. (Proposal assumes Redis per the stated direction.)
Beta Was this translation helpful? Give feedback.
All reactions