Skip to content

Umbrella Rust Server Port

Kadyapam edited this page Jun 2, 2026 · 57 revisions

Umbrella — Rust Server Port (PRIMARY)

ai-task: noetl/ai-meta#49 · Opened: 2026-06-02 · Last update: 2026-06-02 · Priority: PRIMARY (interlocked with Umbrella: System Pool Design — #46) · Target crate: noetl/server (currently v2.0.1, early skeleton) · Source: noetl/noetl/server/ Python FastAPI

Goal

Port the Python noetl-server (FastAPI / uvicorn, ~15-20k LoC, 87 route decorators) to the existing Rust noetl/server crate. Full HTTP API parity so the gateway + workers + CLI don't notice a swap. Cutover via strangler-fig at the ingress layer.

Visual — interlock with #46

   ┌──────────────────────┐          ┌──────────────────────────┐
   │  noetl/ai-meta #49   │          │   noetl/ai-meta #46      │
   │  Rust server port    │          │   System pool playbooks  │
   │  (THIS UMBRELLA)     │          │                          │
   │                      │          │                          │
   │  Phase A: reads      │          │                          │
   │  Phase B: writes     │          │                          │
   │  Phase C: internal   ├─────────►│  Phase 2:                │
   │   endpoints          │ unblocks │   system/outbox_publisher│
   │                      │          │   system/projector       │
   │  Phase D: engine     │          │                          │
   │  Phase E: SSE etc    │          │  Phase 1.b: deployment   │
   │  Phase F: shards     │          │                          │
   └──────────┬───────────┘          └──────────┬───────────────┘
              │                                 │
              │ produces                        │ consumes
              ▼                                 ▼
   ┌──────────────────────────────────────────────────────────────┐
   │ HTTP API surface (server is the data gatekeeper)             │
   │                                                              │
   │  POST /api/internal/outbox/claim                             │
   │  POST /api/internal/outbox/mark-published                    │
   │  POST /api/internal/outbox/mark-failed                       │
   │  GET  /api/internal/outbox/pending-count                     │
   │  POST /api/internal/events/project                           │
   └──────────────────────────────────────────────────────────────┘

Phase C lands on both Python (noetl/noetl) and Rust (noetl/server) in parallel — system playbooks call HTTP, not DB, so they don't care which server is responding. This is the single PR that unblocks both umbrellas' next phases.

Why now

Three architectural decisions in the same 2026-06-02 session converge to make this a top priority:

  1. System worker pool requires /api/internal/* endpoints (per data-access-boundary rule) — workers don't touch noetl.* direct; they call the server. Those endpoints don't exist yet in Python OR Rust.
  2. Sharding readiness — the platform's path to multi-region / multi-tenant scale runs through a sharded server. Re-engineering the Python FastAPI server for sharding is comparable cost to a Rust port that does sharding correctly from day one.
  3. Python footprint reduction — after the publisher + projector retire via #46 system playbooks, the FastAPI server is the largest remaining Python service. Porting it closes the loop on the runtime hot path.

Constraints (from latest architectural decisions)

Rule Implication for this port
Data access boundary Rust server is the only thing that talks to noetl.* directly; new /api/internal/* endpoints land here for the system pool
Execution model Server stays the gatekeeper for data + the orchestrator of state machines; doesn't move to playbooks
Observability Every endpoint ships with span + metric + execution_id correlation in the same change set
Deployment validation Kind-first per endpoint port; production cutover via ingress flip on prod-shaped env
API contract preserved Rust request/response shapes are byte-identical to Python's during migration; no drift; no "new and improved" during port
Sharding-first Every endpoint that touches per-execution state derives execution_id; routing layer built in from day one
Strangler-fig cutover Endpoint-by-endpoint flip via ingress, never big-bang

Phases

Phase A — Read endpoints (no orchestrator dependency)

Routes that just read DB state. Lowest risk; biggest test of the read path. Many already scaffolded in repos/server/src/handlers/.

  • GET /api/health, /api/pool/status (already wired)
  • GET /api/catalog/{path}/ui_schema (already wired)
  • POST /api/catalog/list (already wired)
  • GET /api/catalog/resource
  • GET /api/executions/{id}
  • GET /api/executions/{id}/events
  • GET /api/events/{id}/result
  • GET /api/runtime/contract
  • GET /api/variables/...
  • GET /api/credentials/...
  • GET /api/keychain/...

Acceptance: every read endpoint returns byte-identical JSON to the Python version against the same DB state. Diff harness in kind validation.

Phase B — Worker write boundary

Endpoints the Rust worker uses to emit results. Must be solid.

  • POST /api/events (worker's put_result) — already wired; verify under load
  • POST /api/catalog/register — already wired
  • POST /api/credentials (encrypted-at-rest write)
  • POST /api/keychain
  • POST /api/runtime/heartbeat
  • POST /api/runtime/register

Acceptance: Rust worker pointed at Rust server completes a full playbook execution against kind with event log identical to Python-server-pointed run.

Phase C — Internal endpoints for system pool (UNBLOCKS #46 Phase 2)

NEW endpoints — Python doesn't have them today. Lands on BOTH Python AND Rust so the system pool can deploy against either during migration.

Python side ✅ landed + kind-validated 2026-06-02 via #659 (v4.10.0) + #660 (v4.10.1 with kind-validation fixes).

Rust side ✅ landed 2026-06-02 via #12 (v2.1.0) + #13 (v2.1.1 with schema fix). Kind validation deferred — requires side-by-side deploy with Python.

Endpoint Python Rust
POST /api/internal/outbox/claim?limit=N ✅ kind-validated ✅ unit-tested
POST /api/internal/outbox/mark-published ✅ kind-validated ✅ unit-tested
POST /api/internal/outbox/mark-failed ✅ kind-validated (backoff confirmed) ✅ unit-tested
GET /api/internal/outbox/pending-count (KEDA scaler source) ✅ kind-validated ✅ unit-tested
POST /api/internal/events/project ✅ kind-validated (fresh + idempotent) ✅ unit-tested
ServiceAccount bearer-token auth gate ✅ kind-validated (403 + 503 paths) ✅ unit-tested
Span (tracing::instrument) + execution_id per endpoint ⚠️ partial ✅ tracing spans (Prometheus metrics deferred)

Three real-world bugs found + fixed during kind validation (see Sessions Log 2026-06-02 (late evening)):

  1. Python router prefix double-prefix — /api/internal/internal (Python-only).
  2. Python dict-row tuple subscript in pending-count (Python-only).
  3. noetl.event schema mismatch — timestamp column missing, NOT NULL columns absent, partitioned table doesn't support ON CONFLICT (event_id) (both Python + Rust).

Validation harness: noetl/ops automation/development/validate-internal-api.sh (PR awaiting merge).

Acceptance: system worker pool on kind runs system/outbox_publisher end-to-end against either the Python or Rust server's internal endpoints. Python side validated; full pipeline lands when #46 Phase 1.b deploys the system pool.

Phase D — Orchestrator engine port (the big lift)

Python's catalog → command-generation → state-machine logic. ~5-8k LoC Python in repos/noetl/noetl/server/. Rust skeleton at repos/server/src/engine/ (~1,967 LoC).

  • POST /api/execute — kick off execution (full port, not the skeleton)
  • State-machine orchestrator (_handle_event_inner family)
  • Reuse noetl-executor crate where it overlaps with CLI's local-mode runner

Acceptance: full execution lifecycle handled by Rust server, replayable against the same event log as Python.

Phase E — SSE + remaining endpoints

  • GET /api/executions/{id}/events/stream — SSE for the gateway (axum has SSE support built-in)
  • Remaining ~20-30 Python @router routes triaged; port the ones with callers; drop the ones without

Phase F — Sharding design + cutover

  • shard_id = hash(execution_id) % N
  • StatefulSet deployment (replaces today's Deployment)
  • Inter-shard coordination (catalog + credentials shared; executions sharded)
  • Gateway/load-balancer extension to route by execution_id header
  • Migration path: single-replica StatefulSet → scale to N → cutover
  • Helm chart values: server.replicasserver.shards
  • Production cutover — flip ingress; Python server retires

Sharding sequence — request flow with N shards

Client          Gateway          Server-Shard-0    Server-Shard-1     Postgres
  │                │                   │                 │                │
  │ POST /api/execute                  │                 │                │
  │ (no execution_id yet)              │                 │                │
  ├───────────────►│                   │                 │                │
  │                │ pick any shard    │                 │                │
  │                │ (load-balance)    │                 │                │
  │                ├──────────────────►│                 │                │
  │                │                   │ generate eid    │                │
  │                │                   │ (snowflake)     │                │
  │                │                   │ INSERT execution│                │
  │                │                   ├─────────────────┼───────────────►│
  │                │ ◄─────────────────┤                 │                │
  │ ◄──────────────┤ {execution_id: 12345}               │                │
  │                │                                     │                │
  │ GET /api/executions/12345                            │                │
  │ X-Execution-ID: 12345                                │                │
  ├───────────────►│                                     │                │
  │                │ shard = 12345 % 2 = 1               │                │
  │                ├─────────────────────────────────────►│                │
  │                │                                     │ SELECT state    │
  │                │                                     ├────────────────►│
  │                │ ◄───────────────────────────────────┤                │
  │ ◄──────────────┤ {status: RUNNING, ...}              │                │
  │                                                                       │
  │  Phase C internal endpoints (system pool calls)                       │
  │                                                                       │
  │  POST /api/internal/outbox/claim                                      │
  │  X-Execution-ID: not required (claim is shard-aware via worker pod)   │
  ├───────────────►│                                                      │
  │                │  outbox claim fans out across all shards via         │
  │                │  per-shard `outbox-publisher-<shard>` subscription   │
  │                │  (each system pool worker pod owns a shard slice    │
  │                │   like the projector does today)                     │
  ▼                ▼                                                      ▼

Sharding rules:

Resource Strategy
noetl.execution / noetl.event / noetl.command / noetl.outbox Per-execution_id sharding (write to owning shard)
noetl.catalog / noetl.credential / noetl.keychain Shared (read from any shard, write to designated leader shard)
noetl.runtime (worker heartbeats) Per-pool sharding (worker_id hash)

Migration to sharded mode is N=1 first (no functional change), then scale to N=3 with executionID % N routing.

Out of scope

First three sub-issues (next session opens these)

Per the issue-tracking convention, file these against noetl/server when work begins:

  1. Phase A read-endpoint parity audit + diff harness — surfaces drift between Rust and Python responses for already-wired endpoints.
  2. Phase C internal endpoints (LANDS FIRST — even before Phase A finishes) — /api/internal/outbox/* + /api/internal/events/project on BOTH Python and Rust. Unblocks #46 Phase 2.
  3. Event envelope crate (EE-4 from TaskList #51)noetl-events shared crate that worker + executor + server depend on.

Recent activity

Date Event
2026-06-02 Umbrella filed during the architecture-pivot session. Priority PRIMARY alongside #46.
2026-06-02 Cross-linked from #46 — the two umbrellas interlock (#49 provides API endpoints #46's playbooks consume).
2026-06-02 No code work started. Phase 1 plan + first three sub-issues defined; ready to pick up.

Next concrete steps

  1. File Phase C sub-issue on noetl/server/api/internal/outbox/* + /api/internal/events/project. Same endpoints land on noetl/noetl (Python) in parallel so the system pool can deploy against either.
  2. Phase A diff harnessrepos/ops/automation/development/server-parity-validation.yaml rig that hits the same endpoint on Python and Rust, diffs the responses.
  3. EE-4 envelope crate — extract shared event types into noetl-events crate; worker + executor + server depend.
  4. Phase B/C/D sub-issues filed as each phase begins.

Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally