Skip to content

runtime shape

Kadyapam edited this page Jun 3, 2026 · 3 revisions

Runtime shape — compiled core and plug-in ring

Status: Design proposal — tracked under noetl/ai-meta#45 (compiled rewrite) and noetl/ai-meta#46 (system pool + WASM plug-ins). Not yet implemented in this crate.

The engineer-facing architecture rationale lives in the docs site: System Worker Pool and WASM Plug-in Surface. This page is the implementation-level companion — what shape this Rust crate takes and which binaries it produces.

The crate produces multiple binaries

The Rust replacement for the Python control plane is one crate that ships one image with four binaries. Each Helm Deployment runs the same image with a different args: to select its role.

repos/server/
├── Cargo.toml              # name = "noetl-server", produces 4 bins
├── src/
│   ├── lib.rs              # shared library — see "Shared library" below
│   ├── bin/
│   │   ├── server.rs       # --mode=server     HTTP control plane
│   │   ├── publisher.rs    # --mode=publisher  Postgres outbox → NATS
│   │   ├── projector.rs    # --mode=projector  NATS → noetl.event
│   │   └── system_pool.rs  # --mode=system     wasmtime host + dispatch
│   └── wasm/               # only compiled into system_pool.rs
│       ├── host.rs
│       ├── cache.rs
│       └── caller.rs

This shape is Postgres-style: one source tree, multiple roles via separate binaries that share the same library. See ADR for the design rationale.

Shared library — what goes in src/lib.rs

Every binary depends on the same set of primitives:

Module Purpose Used by
config env → typed config all bins
db::pool sqlx PgPool with min/max connections server, publisher, projector
db::models noetl.event, noetl.command, noetl.execution row types server, projector
nats::client async-nats wrapper, JetStream context, publish + subscribe helpers server, publisher, projector, system_pool
nats::subjects subject helpers — event_subject_for_exec, command_subject_for_pool, system_subject_for_kind server, publisher, projector, system_pool
envelope the canonical event envelope per PR-EE-3 (snowflake event_id, meta.attempts, status taxonomy) server, projector, system_pool
scrub response-boundary credential redaction (scrub_in_place) server, projector
metrics prometheus-crate registry + the per-binary metric defs all bins
tracing_init structured tracing setup, execution_id span helpers per observability.md all bins
snowflake snowflake-id generator (snowflaked crate, NoETL epoch, machine_id from pod name hash) server, system_pool

The 60% of code that any single Python pod has today lives here once. Each binary's main is a thin shell around the shared modules.

Binary 1 — server (HTTP control plane)

src/bin/server.rs builds the axum app, wires routes, listens on :8082.

Route inventory (parity with the Python noetl.server):

Method Path What
GET /api/health Liveness
POST /api/catalog/list Browse catalog
GET /api/catalog/{path}/ui_schema UI metadata for a playbook
POST /api/catalog/register Register a playbook
GET /api/catalog/resource Fetch raw playbook YAML
POST /api/execute Start a playbook execution
GET /api/executions/{id} Status
GET /api/executions/{id}/events Event stream
POST /api/events Worker writes an event (the put_result boundary)
GET /api/events/{id}/result Fetch a stored result
POST /api/credentials/... Credential CRUD
GET /api/runtime/contract The runtime contract the gateway and SPA depend on
GET /api/worker/pools List worker pool runtime registrations
... ... (see Python noetl.server.api for the full list)

Implementation status (Phase A of noetl/ai-meta#49): the inventory above is the parity target. Currently wired in Rust: everything except /api/catalog/{path}/ui_schema (the YAML inference helper port — tracked as noetl/server#18). /api/runtimes (a no-filter list of all runtime registrations) was a Rust-only extension and is removed in noetl/server#19 until the Python backport lands.

What's the auth+RBAC layer? In the plug-in design, the server's per-route auth middleware delegates to the system/auth and system/rbac playbooks (dispatched onto worker-system-pool). See the ADR for the capability surface. Until the plug-in ring lands, auth stays inline in the server binary.

Binary 2 — publisher (outbox → NATS)

src/bin/publisher.rs runs a single loop:

async fn main() -> anyhow::Result<()> {
    // 1. PG LISTEN on `noetl.event_outbox_notify`
    // 2. On wake (or 100ms tick), SELECT FOR UPDATE SKIP LOCKED ...
    // 3. For each row: PUBLISH to NATS NOETL_EVENTS with the row's subject
    // 4. DELETE the row (or mark published)
    // 5. Loop
}

Latency win over today's Python publisher: LISTEN/NOTIFY instead of poll-tail. When the server inserts into event_outbox inside the same txn as the business write, the post-commit trigger fires NOTIFY, and this publisher wakes within microseconds. The Python publisher polls (because asyncpg's LISTEN inside FastAPI's event loop is awkward); tokio-postgres handles LISTEN cleanly.

Scaling: 1 replica. Outbox publishing is single-writer by design (FOR UPDATE SKIP LOCKED makes it safe to run >1, but there's no throughput gain).

Metrics (per observability.md):

  • noetl_publisher_published_total (counter) — rows published
  • noetl_publisher_lag_seconds (gauge) — time between event_outbox.created_at and NATS publish ack
  • noetl_publisher_publish_duration_seconds (histogram) — per-row publish latency

Binary 3 — projector (NATS → event log)

src/bin/projector.rs consumes NOETL_EVENTS and batch-INSERTs into noetl.event.

async fn main() -> anyhow::Result<()> {
    // 1. Subscribe to NOETL_EVENTS consumer NOETL_PROJECTOR_NATS_CONSUMER
    //    (which equals the pod name — sharded)
    // 2. Accumulate N events or wait T ms, whichever first
    // 3. INSERT INTO noetl.event VALUES (...), (...), ...
    //    with ON CONFLICT (event_id) DO NOTHING (idempotent)
    // 4. ACK the batch
    // 5. Loop
}

Sharding model preserves today's Python projector behaviour: one StatefulSet, each pod's consumer is named for the pod (e.g. noetl-projector-0), each consumer subscribes to a filter subject that hashes by execution_id. Same NATS supercluster manifests in noetl/ops/ci/manifests/nats-supercluster/.

Why this stays compiled (not a plug-in): Throughput. Peak load is thousands of events per second; WASM's 2-5× overhead and the cross-boundary memory copies for each row payload would make this the bottleneck.

Metrics:

  • noetl_projector_inserted_total (counter)
  • noetl_projector_batch_size (histogram)
  • noetl_projector_lag_seconds (gauge) — NATS-consumer-side lag
  • noetl_projector_insert_duration_seconds (histogram)

Binary 4 — system_pool (wasmtime host + dispatch)

src/bin/system_pool.rs is a NATS worker — same pull-claim-execute shape as noetl-worker-rust — but its dispatch target is a WASM module loaded from the catalog rather than a tool kind from noetl-tools.

async fn main() -> anyhow::Result<()> {
    let engine = wasmtime::Engine::default();
    let module_cache = ModuleCache::new(&engine); // (path, version) -> Module

    loop {
        let cmd = claim_next_command().await?;          // from NOETL_COMMANDS
        let playbook = fetch_catalog_entry(cmd.path).await?;
        let module = module_cache.get_or_compile(&playbook).await?;
        let store = wasmtime::Store::new(&engine, /* host capabilities */);
        let instance = linker.instantiate(&mut store, &module).await?;
        let result = instance.get_func("execute")
            .unwrap()
            .call_async(&mut store, ...).await?;
        write_result(cmd, result).await?;
    }
}

Capability surface (granted to system WASM modules via wasmtime::Linker):

  • host_put_event(envelope) — emit an event (server boundary)
  • host_get_credential(name) — read from keychain
  • host_query_pg(sql, params) — execute SQL (logged + audited)
  • host_read_event_log(execution_id, after_id) — read events
  • host_mutate_catalog(path, yaml) — register / update catalog entry
  • host_system_call(name, args) — extensible kernel-like surface

Tenant-supplied WASM modules (e.g. acme/system/auth_with_saml) run with a restricted linker that omits host_mutate_catalog, host_read_event_log (except own), and host_system_call.

Module cache invalidation: keyed on (path, version, digest). Catalog version bump → cache miss → recompile → next claim picks up the new module. Hot reload without process restart.

Helm deployment shape

# Single ConfigMap shared by all four deployments
apiVersion: v1
kind: ConfigMap
metadata:
  name: noetl-server-config
data:
  POSTGRES_HOST: postgres.postgres.svc.cluster.local
  NATS_URL: nats://nats.nats.svc.cluster.local:4222
  ...
---
# Deployment 1 — HTTP server
apiVersion: apps/v1
kind: Deployment
metadata:
  name: noetl-server
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: server
        image: ghcr.io/noetl/server:<v>
        args: ["--mode=server"]
        envFrom:
        - configMapRef: { name: noetl-server-config }
---
# Deployment 2 — outbox publisher
apiVersion: apps/v1
kind: Deployment
metadata:
  name: noetl-outbox-publisher
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: publisher
        image: ghcr.io/noetl/server:<v>   # same image
        args: ["--mode=publisher"]
---
# Deployment 3 — projector (StatefulSet for stable pod names → consumer names)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: noetl-projector
spec:
  serviceName: noetl-projector
  replicas: 1    # KEDA scales N
  template:
    spec:
      containers:
      - name: projector
        image: ghcr.io/noetl/server:<v>
        args: ["--mode=projector"]
        env:
        - name: NOETL_PROJECTOR_SHARD_ID
          valueFrom: { fieldRef: { fieldPath: metadata.name } }
---
# Deployment 4 — system worker pool (KEDA-scaled on NATS lag)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: noetl-worker-system-pool
spec:
  replicas: 1   # KEDA scales 1-N
  template:
    spec:
      containers:
      - name: system-pool
        image: ghcr.io/noetl/server:<v>
        args: ["--mode=system"]
        env:
        - name: NATS_CONSUMER
          value: noetl_worker_pool_system
        - name: NATS_FILTER_SUBJECT
          value: noetl.commands.system.>

Operations team owns one image lifecycle, one chart, one set of ConfigMaps. See the noetl-ops wiki — system worker pool deploy page for the full manifest reference.

Sequencing

Smallest surface first to amortise the shared-library scaffolding:

  1. publisher (smallest LoC, narrowest surface). Adds the shared library skeleton.
  2. projector (reuses the shared envelope + NATS code).
  3. server (largest; everything else is now proven).
  4. system_pool (depends on server for catalog reads).

Steps 1 + 2 drop two of the three Python pods even before server ships — useful intermediate checkpoint.

Step 4 unblocks the plug-in ring per the ADR and noetl/ai-meta#46.

Related

Clone this wiki locally