Skip to content
Kadyapam edited this page Jun 19, 2026 · 13 revisions

NoETL Ops

This wiki is the operational + deployment companion to noetl/ops. Topics that live in this repository — Kubernetes manifests, deployment playbooks, CI/CD, infrastructure automation — have their reference pages here.

For NoETL application documentation (Python API, DSL semantics, the v2 distributed-runtime spec, etc.) see the noetl/noetl wiki.

Where the manifests live. As of the Scope B consolidation (May 2026), all NoETL operational manifests live exclusively in noetl/ops/ci/manifests/. The previous parallel copy at noetl/noetl/ci/manifests/ was deleted; only a MOVED.md breadcrumb remains there. The automation/development/noetl.yaml playbook reads from local ci/manifests/... paths (no more cross-repo $NOETL_REPO/ci/manifests/...).

Pages

Manifests (ci/manifests/)

Page What
KEDA Scaler Worker-pool autoscaling via NATS JetStream consumer lag. Install Helm chart + apply ScaledObject.
NATS Supercluster Multi-cluster JetStream topology with gateway-meshed clusters. Apply 2-cluster reference manifest.
System worker pool Proposed. Deploy topology for the privileged worker-system-pool that runs platform-internal logic (auth, RBAC, scheduled cleanups) as WASM-compiled NoETL playbooks. Tracked under noetl/ai-meta#45 + noetl/ai-meta#46.
mTLS (Rust stack) cert-manager-issued mutual TLS between noetl-server-rust + noetl-worker-rust (Secrets Wallet Phase 4, noetl/ai-meta#61). ci/manifests/noetl/tls/.

Monitoring & alerting

Page What
Production monitoring (GMP) Prod runs Google Managed Prometheus, not VictoriaMetrics. The PodMonitoring (worker+server scrape) + Rules (materializer-lag) under ci/manifests/noetl/gmp/, query recipes, managedAlertmanager pager wiring, and the CQRS PUBLISH_ONLY flip prep status (noetl/ai-meta#103).

Automation playbooks (automation/)

Page What
GKE Helm install Install + upgrade NoETL on GKE via the Helm chart + Cloud SQL + PgBouncer + chart-templated KEDA. The GKE deploy path the project supports.
Firestore MCP agent Firestore document, event-log, replay, and batch helper methods used by domain playbooks.

(The automation/development/noetl.yaml kind playbook is currently documented inline in the noetl/noetl wiki under operational sections — will migrate here in a future Scope B refactor.)

Kind-cluster validation rigs (automation/development/)

Reproducible end-to-end smoke tests for individual feature surfaces. Each rig is <feature>-validation.yaml + validate-<feature>.sh + validate-<feature>.sql.

Rig What it exercises Worker pool
rust-worker-r2-validation R-2.1 cross-node durable PUT, R-2.1 colocated shm cache, R-2.2 Arrow IPC encoding, producer-side credential scrub. Same PIN_RUST_WORKER=1 auto-pinning shape as the other Rust-only rigs (scales Python worker pool → 0, waits for drain, restores on exit via cleanup trap). SQL probes filter by execution_id = :exec_id passed via psql -v since worker_id only lands on command.claimed events under the post-EE-4 schema. Rust
result-fetch-validation result_fetch tool kind (noetl-tools 2.11+) — producer over-budget Arrow IPC ref → fetch_via_flight + fetch_via_http via the playbook surface. Scales the Python worker pool to 0 + waits for full pod drain to pin commands to the Rust worker (Phase A over-budget branch only fires on the Rust side). Rust
flight-tls-validation R-2.3 Phase C2 full trust boundary — server TLS (C2.1) + client TLS (C2.2) + bearer-token middleware (C2.3) + mTLS (C2.4) all on, talking through the result_fetch tool kind. Companion generate-flight-tls.sh bootstraps the certs + Secrets via openssl (private tmpdir, no repo leakage); --off reverts. Production swap to cert-manager is drop-in — Secret shape stays the same. Rust
validate-shard-drift-guard.sh Phase F R3b end-to-end: posts to noetl-server GET /api/runtime/shard-info (R3b-1) and noetl-gateway GET /sharding/preview (R3b-2) for a battery of (execution_id, shard_count) pairs and asserts shard_index agreement. Catches runtime drift the unit-test pinning can't see — twox-hash crate version split, SHARD_HASH_SEED divergence, i64→bytes endianness flip. No NoETL playbook execution (it probes diagnostic endpoints only); auto-cleans port-forwards on exit. n/a (control-plane probe)
validate-shard-routing-n2.sh Phase F R4-5 end-to-end: validates the in-server DbPoolMap routing. Creates noetl_shard_0 + noetl_shard_1 + noetl_cluster databases on the existing postgres pod (cheap path — exercises routing without 3 separate Postgres pods; per-pod isolation is a Phase G concern), applies the noetl schema DDL to each, patches noetl-server-rust deployment with NOETL_SHARDS + NOETL_CLUSTER_DSN env vars, spawns N executions via POST /api/execute, asserts each landed on the predicted shard per shard_for(execution_id, 2) (queries each per-shard DB directly + cross-checks against the R3b-1 endpoint), then re-runs the R3b drift-guard against the sharded server. Auto-reverts the deployment patch on exit (trap EXIT); idempotent re-creation of databases. Rust

Each .sh is self-contained: registers the playbook, kicks off the execution, polls completion, runs the SQL probes, samples the worker /metrics. Pairs with agents/rules/deployment-validation.md — anything that ships in a container image MUST run through one of these before GKE rollout.

Cross-references

Clone this wiki locally