-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Rust Server Port
ai-task: noetl/ai-meta#49
· Opened: 2026-06-02
· Last update: 2026-06-02
· Priority: PRIMARY (interlocked with Umbrella: System Pool Design — #46)
· Target crate: noetl/server (currently v2.0.1, early skeleton)
· Source: noetl/noetl/server/ Python FastAPI
Port the Python noetl-server (FastAPI / uvicorn, ~15-20k LoC, 87 route decorators) to the existing Rust noetl/server crate. Full HTTP API parity so the gateway + workers + CLI don't notice a swap. Cutover via strangler-fig at the ingress layer.
┌──────────────────────┐ ┌──────────────────────────┐
│ noetl/ai-meta #49 │ │ noetl/ai-meta #46 │
│ Rust server port │ │ System pool playbooks │
│ (THIS UMBRELLA) │ │ │
│ │ │ │
│ Phase A: reads │ │ │
│ Phase B: writes │ │ │
│ Phase C: internal ├─────────►│ Phase 2: │
│ endpoints │ unblocks │ system/outbox_publisher│
│ │ │ system/projector │
│ Phase D: engine │ │ │
│ Phase E: SSE etc │ │ Phase 1.b: deployment │
│ Phase F: shards │ │ │
└──────────┬───────────┘ └──────────┬───────────────┘
│ │
│ produces │ consumes
▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ HTTP API surface (server is the data gatekeeper) │
│ │
│ POST /api/internal/outbox/claim │
│ POST /api/internal/outbox/mark-published │
│ POST /api/internal/outbox/mark-failed │
│ GET /api/internal/outbox/pending-count │
│ POST /api/internal/events/project │
└──────────────────────────────────────────────────────────────┘
Phase C lands on both Python (noetl/noetl) and Rust (noetl/server) in parallel — system playbooks call HTTP, not DB, so they don't care which server is responding. This is the single PR that unblocks both umbrellas' next phases.
Three architectural decisions in the same 2026-06-02 session converge to make this a top priority:
-
System worker pool requires
/api/internal/*endpoints (per data-access-boundary rule) — workers don't touchnoetl.*direct; they call the server. Those endpoints don't exist yet in Python OR Rust. - Sharding readiness — the platform's path to multi-region / multi-tenant scale runs through a sharded server. Re-engineering the Python FastAPI server for sharding is comparable cost to a Rust port that does sharding correctly from day one.
- Python footprint reduction — after the publisher + projector retire via #46 system playbooks, the FastAPI server is the largest remaining Python service. Porting it closes the loop on the runtime hot path.
| Rule | Implication for this port |
|---|---|
| Data access boundary | Rust server is the only thing that talks to noetl.* directly; new /api/internal/* endpoints land here for the system pool |
| Execution model | Server stays the gatekeeper for data + the orchestrator of state machines; doesn't move to playbooks |
| Observability | Every endpoint ships with span + metric + execution_id correlation in the same change set |
| Deployment validation | Kind-first per endpoint port; production cutover via ingress flip on prod-shaped env |
| API contract preserved | Rust request/response shapes are byte-identical to Python's during migration; no drift; no "new and improved" during port |
| Sharding-first | Every endpoint that touches per-execution state derives execution_id; routing layer built in from day one |
| Strangler-fig cutover | Endpoint-by-endpoint flip via ingress, never big-bang |
Routes that just read DB state. Lowest risk; biggest test of the read path. Many already scaffolded in repos/server/src/handlers/.
-
GET /api/health,/api/pool/status(already wired) -
GET /api/catalog/{path}/ui_schema(already wired) -
POST /api/catalog/list(already wired) -
GET /api/catalog/resource -
GET /api/executions/{id} -
GET /api/executions/{id}/events -
GET /api/events/{id}/result -
GET /api/runtime/contract -
GET /api/variables/... -
GET /api/credentials/... -
GET /api/keychain/...
Acceptance: every read endpoint returns byte-identical JSON to the Python version against the same DB state. Diff harness in kind validation.
Endpoints the Rust worker uses to emit results. Must be solid.
-
POST /api/events(worker'sput_result) — already wired; verify under load -
POST /api/catalog/register— already wired -
POST /api/credentials(encrypted-at-rest write) -
POST /api/keychain -
POST /api/runtime/heartbeat -
POST /api/runtime/register
Acceptance: Rust worker pointed at Rust server completes a full playbook execution against kind with event log identical to Python-server-pointed run.
NEW endpoints — Python doesn't have them today. Lands on BOTH Python AND Rust so the system pool can deploy against either during migration.
Python side ✅ landed + kind-validated 2026-06-02 via #659 (v4.10.0) + #660 (v4.10.1 with kind-validation fixes).
Rust side ✅ landed + kind-validated 2026-06-02 via #12 (v2.1.0) + #13 (v2.1.1 with schema fix) + #14 (axum 0.8 route-syntax fix — without this v2.1.1 panics in Router::route() at startup before binding the HTTP listener) + noetl/ops#147 (kind deployment manifest).
All 11 assertions of automation/development/validate-internal-api.sh pass against the Rust server (same harness that validated Python; identical pass rate).
| Endpoint | Python | Rust |
|---|---|---|
POST /api/internal/outbox/claim?limit=N |
✅ kind-validated | ✅ kind-validated |
POST /api/internal/outbox/mark-published |
✅ kind-validated | ✅ kind-validated |
POST /api/internal/outbox/mark-failed |
✅ kind-validated (backoff confirmed) | ✅ kind-validated (1s backoff for attempts=1) |
GET /api/internal/outbox/pending-count (KEDA scaler source) |
✅ kind-validated | ✅ kind-validated |
POST /api/internal/events/project |
✅ kind-validated (fresh + idempotent) | ✅ kind-validated (fresh + idempotent) |
| ServiceAccount bearer-token auth gate | ✅ kind-validated (403 + 503 paths) | ✅ kind-validated (403 on no-auth / wrong-token / Basic-scheme) |
Span (tracing::instrument) + execution_id per endpoint |
✅ tracing spans (Prometheus metrics deferred) |
Three real-world bugs found + fixed during kind validation (see Sessions Log 2026-06-02 (late evening)):
- Python router prefix double-prefix —
/api/internal→/internal(Python-only). - Python dict-row tuple subscript in
pending-count(Python-only). -
noetl.eventschema mismatch —timestampcolumn missing, NOT NULL columns absent, partitioned table doesn't supportON CONFLICT (event_id)(both Python + Rust).
Validation harness: noetl/ops automation/development/validate-internal-api.sh (PR awaiting merge).
Acceptance: system worker pool on kind runs system/outbox_publisher end-to-end against either the Python or Rust server's internal endpoints. Python side validated; full pipeline lands when #46 Phase 1.b deploys the system pool.
Python's catalog → command-generation → state-machine logic. ~5-8k LoC Python in repos/noetl/noetl/server/. Rust skeleton at repos/server/src/engine/ (~1,967 LoC).
-
POST /api/execute— full port: persists initial command, publishes NATS notification, snowflake-generated event_id (noetl/server#27, #28, #29) - State-machine orchestrator wired into event ingest —
trigger_orchestratorloads events, callsWorkflowOrchestrator::evaluate, persists generated events + commands, emits terminalplaybook.completed/playbook.failed(noetl/server#31, Phase D R2) -
persist_engine_commandextracted as shared helper for/api/executeand orchestrator paths -
noetl-executorcrate already feedsWorkflowOrchestrator(Phase D R1 survey closed the gap)
Acceptance: full execution lifecycle handled by Rust server, replayable against the same event log as Python. Kind validation tests/fixtures/r2_two_step 2-step linear playbook runs end-to-end, terminates at playbook.completed (Phase D R2, ai-meta@TBD). Phase D materially complete — remaining work is conditional/iterator/parallel control-flow coverage (future rounds), not the core engine wiring.
-
GET /api/executions/{id}/events/stream— SSE for the gateway (axum has SSE support built-in) - Remaining ~20-30 Python
@routerroutes triaged; port the ones with callers; drop the ones without
-
shard_id = hash(execution_id) % N - StatefulSet deployment (replaces today's Deployment)
- Inter-shard coordination (catalog + credentials shared; executions sharded)
- Gateway/load-balancer extension to route by
execution_idheader - Migration path: single-replica StatefulSet → scale to N → cutover
- Helm chart values:
server.replicas→server.shards - Production cutover — flip ingress; Python server retires
Client Gateway Server-Shard-0 Server-Shard-1 Postgres
│ │ │ │ │
│ POST /api/execute │ │ │
│ (no execution_id yet) │ │ │
├───────────────►│ │ │ │
│ │ pick any shard │ │ │
│ │ (load-balance) │ │ │
│ ├──────────────────►│ │ │
│ │ │ generate eid │ │
│ │ │ (snowflake) │ │
│ │ │ INSERT execution│ │
│ │ ├─────────────────┼───────────────►│
│ │ ◄─────────────────┤ │ │
│ ◄──────────────┤ {execution_id: 12345} │ │
│ │ │ │
│ GET /api/executions/12345 │ │
│ X-Execution-ID: 12345 │ │
├───────────────►│ │ │
│ │ shard = 12345 % 2 = 1 │ │
│ ├─────────────────────────────────────►│ │
│ │ │ SELECT state │
│ │ ├────────────────►│
│ │ ◄───────────────────────────────────┤ │
│ ◄──────────────┤ {status: RUNNING, ...} │ │
│ │
│ Phase C internal endpoints (system pool calls) │
│ │
│ POST /api/internal/outbox/claim │
│ X-Execution-ID: not required (claim is shard-aware via worker pod) │
├───────────────►│ │
│ │ outbox claim fans out across all shards via │
│ │ per-shard `outbox-publisher-<shard>` subscription │
│ │ (each system pool worker pod owns a shard slice │
│ │ like the projector does today) │
▼ ▼ ▼
Sharding rules:
| Resource | Strategy |
|---|---|
noetl.execution / noetl.event / noetl.command / noetl.outbox
|
Per-execution_id sharding (write to owning shard) |
noetl.catalog / noetl.credential / noetl.keychain
|
Shared (read from any shard, write to designated leader shard) |
noetl.runtime (worker heartbeats) |
Per-pool sharding (worker_id hash) |
Migration to sharded mode is N=1 first (no functional change), then scale to N=3 with executionID % N routing.
- System worker pool runtime + system playbooks → Umbrella: System Pool Design (#46). This umbrella provides the endpoints #46 needs; #46 builds the deployment + playbooks.
- Rust worker tool-kind gaps → Umbrella: Rust Worker Parity Gaps (#47 + #48). Orthogonal.
- Container tool kind callback → Umbrella: Container Tool Callback (#43). Orthogonal.
- DSL parser /
noetl-tools/noetl-executor— separate Rust crates; not part of the server port.
Per the issue-tracking convention, file these against noetl/server when work begins:
- Phase A read-endpoint parity audit + diff harness — surfaces drift between Rust and Python responses for already-wired endpoints.
-
Phase C internal endpoints (LANDS FIRST — even before Phase A finishes) —
/api/internal/outbox/*+/api/internal/events/projecton BOTH Python and Rust. Unblocks #46 Phase 2. -
Event envelope crate (EE-4 from TaskList #51) —
noetl-eventsshared crate that worker + executor + server depend on.
| Date | Event |
|---|---|
| 2026-06-02 | Umbrella filed during the architecture-pivot session. Priority PRIMARY alongside #46. |
| 2026-06-02 | Cross-linked from #46 — the two umbrellas interlock (#49 provides API endpoints #46's playbooks consume). |
| 2026-06-02 | No code work started. Phase 1 plan + first three sub-issues defined; ready to pick up. |
| 2026-06-02 | Phase A read-endpoint parity harness landed (noetl/server#18 + #20). Phase C internal endpoints wired (#17, #19, #25). v2.2.0 shipped. |
| 2026-06-03 (morning) | Phase B write-boundary parity complete — Prometheus surface + all 6 write endpoints instrumented (noetl/server#21, #23). Rust worker → Rust server e2e validated; 60k/60k load smoke at 920 req/s, p99 164ms. v2.4.0 shipped. |
| 2026-06-03 (afternoon) |
/api/execute full port (noetl/server#27) + args:null fix (#28) + result-envelope shape compliance (#29 → noetl/server#30). Phase D R1 survey closed; template resolution already wired via existing TemplateRenderer.render_value. |
| 2026-06-03 (evening) | Phase E SSE port (GET /api/executions/{id}/events/stream) shipped. Phase D R2 orchestrator wired into event ingest (noetl/server#31). Kind validation: 2-step linear playbook (tests/fixtures/r2_two_step) runs end-to-end on Rust server, event log terminates at playbook.completed. v2.4.3 + 2 commits. |
-
Phase D R3 — multi-step playbook coverage beyond linear: conditionals (
when/unless), iterators (for_each), parallel (parallel). Each gets a focused round with kind-val on a fixture playbook. -
Dual-worker NATS subject race — both Rust + Python workers consume the shared subject in kind today, so each step shows two claim cycles in the event log. Cosmetic for the orchestrator wiring (terminal lifecycle converges) but worth a routing follow-up that switches the Rust worker to a Rust-only subject prefix or queues the Python worker via a
system-onlyfilter. -
EE-4 envelope crate — extract shared event types into
noetl-eventscrate; worker + executor + server depend. Still queued behind Phase D coverage. - Phase F sharding — design + cutover (next major phase after Phase D coverage rounds finish).
- Umbrella: System Pool Design (#46) — interlocks with this umbrella.
- Umbrella: Container Tool Callback (#43) — orthogonal.
- Umbrella: Rust Worker Parity Gaps (#47, #48) — orthogonal.
- ADR: System Worker Pool and WASM Plug-in Surface — covers the data access boundary that shapes Phase C.
- noetl-server wiki: Runtime shape — implementation-level companion.
- Data Access Boundary — the rule that shapes Phase C.
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture