manifests nats supercluster

NATS Supercluster

Multi-cluster NATS JetStream topology with gateway-meshed clusters. The operational guide for the static sample manifest committed at ci/manifests/nats-supercluster/.

For the underlying Python generator (ClusterTopology / SuperclusterTopology / build_nats_conf / build_cluster_manifests in noetl/core/runtime/nats_topology.py), see the NATS Supercluster page on the noetl/noetl wiki.

What's in `ci/manifests/nats-supercluster/`

File	Purpose
`namespace.yaml`	`nats-supercluster` Namespace. Separate from the existing `nats` namespace so both topologies can coexist.
`cluster-a.yaml`	ConfigMap + 3-replica StatefulSet + headless Service for cluster `a` in `us-east-1`.
`cluster-b.yaml`	Symmetric for cluster `b` in `us-west-2`.
`README.md`	Quick-start: apply + verify + regen recipe.

Sibling: a parameterized renderer playbook at automation/infrastructure/nats_supercluster.yaml deploys one cluster member at a time with overrideable parameters. The static 2-cluster manifests in this directory are the opinionated reference for local kind validation; the playbook is the parameterized path for arbitrary deployments.

Cluster vs. supercluster

NATS exposes two related topologies:

Cluster — 3+ NATS servers connected via a cluster {} block; share JetStream state via Raft consensus. Single account namespace, mutual route URLs between members.
Supercluster — multiple clusters connected via NATS gateway connections (gateway {} block). Each cluster has its own JetStream state; gateways enable cross-cluster subject routing without shared Raft.

The static manifests here ship the supercluster shape — two 3-replica clusters with mutual gateway connections.

Generated `nats.conf` (cluster `a`)

port: 4222
http_port: 8222

jetstream {
  store_dir: /data/jetstream
  domain: "tenant_default_org_default_region_us_east_1_cluster_a"
  max_memory_store: 1GB
  max_file_store: 5GB
}

cluster {
  name: "a"
  port: 6222
  routes: [
    nats-route://nats-cluster-a-0.nats-cluster-a.nats-supercluster.svc.cluster.local:6222
    nats-route://nats-cluster-a-1.nats-cluster-a.nats-supercluster.svc.cluster.local:6222
    nats-route://nats-cluster-a-2.nats-cluster-a.nats-supercluster.svc.cluster.local:6222
  ]
}

gateway {
  name: "a"
  port: 7222
  gateways: [
    { name: "b", urls: ["nats://nats-cluster-b.nats-supercluster.svc.cluster.local:7222"] }
  ]
}

accounts {
  $SYS {
    users: [
      { user: sys, password: sys }
    ]
  }
  NOETL {
    jetstream: enabled
    users: [
      { user: noetl, password: noetl }
    ]
  }
}

The accounts block is preserved verbatim from the existing single-node ci/manifests/nats/nats.yaml (on the noetl/noetl repo) so the noetl user — and every client + worker that currently authenticates against it — keeps working against the supercluster without re-issuing credentials. Per-tenant accounts are out-of-phase follow-up work.

The JetStream domain is URN-derived from ClusterTopology.cluster_urn: the URN's NATS subject form with noetl. stripped and ./- collapsed to _.

Install + verify

Manual one-off cluster setup — not bundled into noetl k8s deploy. The existing single-node deployment in the nats namespace stays untouched; the supercluster is a separate, opt-in topology.

Apply

# From the noetl/ops repo root
kubectl apply -f ci/manifests/nats-supercluster/namespace.yaml
kubectl apply -f ci/manifests/nats-supercluster/cluster-a.yaml
kubectl apply -f ci/manifests/nats-supercluster/cluster-b.yaml

kubectl rollout status statefulset/nats-cluster-a -n nats-supercluster
kubectl rollout status statefulset/nats-cluster-b -n nats-supercluster

Verify

# Pods Running / Ready
kubectl get pods -n nats-supercluster -o wide

# Inspect routes and gateways via the monitoring port
kubectl port-forward -n nats-supercluster nats-cluster-a-0 18222:8222
curl -s http://localhost:18222/routez   | jq '.num_routes'
curl -s http://localhost:18222/gatewayz | jq '{name, outbound: .outbound_gateways|keys, inbound: .inbound_gateways|keys}'

# Or via the nats CLI
nats server gateway list
nats stream cluster-info <stream-name>

In a healthy 2-cluster supercluster you should see:

server_name: unique per pod (e.g. nats-cluster-a-0).
cluster.name: matches the cluster ID (a or b).
cluster.urls: N pod-DNS routes inside the same cluster.
gateway.outbound_gateways: the peer cluster's name.
gateway.inbound_gateways: the peer cluster's name (bidirectional once both clusters are up).

Validation in a local kind cluster confirmed all of the above plus the URN-derived JetStream domain: tenant_default_org_default_region_us_east_1_cluster_a.

Tuning

Knob	Default	Rule of thumb
`cluster_size`	`3`	JetStream Raft requires 3 minimum for HA. Bump to 5 for higher fault tolerance; odd sizes only.
`region` / `zone` (per cluster)	`None`	Set per-cluster locality so pod labels carry placement metadata for the scheduler.
Gateway TLS	not configured	Production deployments should add `tls { ... }` inside the `gateway` block. Out-of-phase follow-up.
`max_file_store` / `max_memory_store`	`5GB` / `1GB`	Per-cluster JetStream storage. Match `volumeClaimTemplates.storage`.
PVC size	`5Gi`	`volumeClaimTemplates` matches `max_file_store`; bump together.
Per-tenant accounts	single `NOETL`	Out-of-phase. The current manifests preserve the existing `NOETL` account; per-tenant accounts wait for the catalog era.

Resource footprint

The default cluster_size: 3 × 2 clusters = 6 pods, each requesting cpu: 250m/memory: 512Mi and limited at cpu: 1000m/memory: 2Gi. So the full default supercluster sits at ~1500m CPU requests / 3Gi memory requests, peak 6000m CPU / 12Gi memory under burst.

Single-node kind warning

A stock kind cluster runs on one Kubernetes node with the podman VM's CPU budget (typically 4 vCPU on Apple Silicon defaults). The full default supercluster alone consumes ~38% of that. Stacked with the rest of NoETL (postgres + nats single- node + noetl-server / projector / outbox-publisher / 3 workers

paginated-api + KEDA operator) the node hits ~96% CPU requests, at which point the scheduler refuses to place additional pods — including a third supercluster replica or KEDA-driven scale-up of the noetl-worker pool.

Mitigations for local kind validation:

Drop cluster_size to 1 per cluster — sufficient to validate the gateway mesh + JetStream domain derivation, not HA. 2 pods instead of 6.
Scale the supercluster StatefulSets to 0 while running a KEDA scale-up smoke test, restore after. Validated this pattern in the local kind validation; works cleanly.
Bump the podman machine to 6+ vCPU and rebuild the kind cluster.

For production deployments (GKE / EKS / multi-node clusters), the default cluster_size: 3 works as designed — the node-spread that JetStream Raft benefits from happens naturally with multiple nodes available.

Operational notes worth knowing

These are bugs caught during live-kind validation that the generator now bakes in correctly. Documented here so anyone hand-editing manifests doesn't accidentally regress them:

server_name is required for cluster-mode JetStream. Each pod must register under a unique server name. The generator pulls it from the downward API (metadata.name → POD_NAME → --name $(POD_NAME)). Without it NATS refuses to start with jetstream cluster requires server_name to be set.
Use the split /healthz endpoints. Plain /healthz returns failure during normal JetStream meta-layer recovery, which is long enough that liveness probes kill pods before the cluster forms. Use:
- livenessProbe → /healthz?js-server-only=true
- readinessProbe → /healthz?js-enabled-only=true
- startupProbe → /healthz?js-server-only=true with a long failureThreshold (60+) to give cluster formation time.
Headless Service needs publishNotReadyAddresses: true. Gateway URLs resolve through the peer cluster's headless Service DNS. Default headless Services only publish Ready pods → chicken-and-egg between peer clusters (each waits for the other to be Ready before its own gateway DNS resolves). publishNotReadyAddresses: true breaks the cycle.

What this round does NOT do

No client-side rewiring. NoETL's Python publishers / subscribers and the worker ConfigMap keep pointing at nats.nats.svc.cluster.local. Cluster-aware client routing arrives once the catalog can pick the right cluster per request.
No per-tenant NATS accounts. Future round.
No cross-cluster stream mirror / source. The gateway topology enables it; nothing in this round configures a stream to mirror.
No edits to the existing single-node ci/manifests/nats/. The supercluster is opt-in and lives alongside the existing deployment.

KEDA Scaler — worker autoscaling. Currently uses the single NOETL account; per-tenant accounts + account-aware scalers are out-of-phase work that will key off this supercluster topology.
NATS Supercluster (noetl/noetl wiki) — Python generator API reference for ClusterTopology / SuperclusterTopology / build_nats_conf / build_cluster_manifests.
Resource Locator (noetl/noetl wiki) — URN scheme. JetStream domains derive from this.
NATS supercluster docs: https://docs.nats.io/running-a-nats-service/configuration/gateways
NATS clustering: https://docs.nats.io/running-a-nats-service/configuration/clustering

Ops

Home

Manifests

Monitoring

Production monitoring (GMP)

Automation

See also

noetl/noetl wiki (app)
noetl/docs (design docs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manifests nats supercluster

NATS Supercluster

What's in `ci/manifests/nats-supercluster/`

Cluster vs. supercluster

Generated `nats.conf` (cluster `a`)

Install + verify

Apply

Verify

Tuning

Resource footprint

Single-node kind warning

Operational notes worth knowing

What this round does NOT do

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

manifests nats supercluster

NATS Supercluster

What's in ci/manifests/nats-supercluster/

Cluster vs. supercluster

Generated nats.conf (cluster a)

Install + verify

Apply

Verify

Tuning

Resource footprint

Single-node kind warning

Operational notes worth knowing

What this round does NOT do

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

What's in `ci/manifests/nats-supercluster/`

Generated `nats.conf` (cluster `a`)