-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Container Tool Callback
ai-task: noetl/ai-meta#43 · Opened: 2026-06-02 · Last update: 2026-06-07 · Status: Design settled (Option B chosen); sub-issue tree opened across noetl/ops + noetl/server + noetl/tools + noetl/noetl + noetl/e2e · Parent umbrella: Rust Worker Migration (specifically R-3 Phase C-2)
Design + implement the callback pattern that lets the Rust worker dispatch a container tool kind as a Kubernetes Job and resume the playbook when the Job's container completes — without holding a worker slot for the duration of the container run.
This is the canonical instance of the Callback / hook rule from execution-model.md:
A block must not hold a worker slot waiting for an external operation that takes more than a few seconds.
The Python noetl-worker has tool: kind: container
(noetl/tools/container/executor.py,
1146 LoC across 4 files) that creates a K8s Job + waits for completion
via a polling loop:
def _wait_for_job_completion(batch_api, namespace, job_name, timeout_seconds):
deadline = time.time() + timeout_seconds
while time.time() < deadline:
job = batch_api.read_namespaced_job_status(name=job_name, namespace=namespace)
# ... check status ...
time.sleep(POLL_INTERVAL_SECONDS)A K8s Job can run for minutes or hours; the worker slot is blocked for the full duration. Porting verbatim to Rust would propagate the anti-pattern.
Settled 2026-06-07. Three options were considered (see noetl/ai-meta#43 body for full rationale):
- Option A (rejected) — sidecar / final command in the Job POSTs completion to noetl-server. Brittle to Job failure modes (the callback might never fire if the Job crashes before reaching the final command).
-
Option B (chosen) — separate
noetl-k8s-watcherdeployment uses the K8s watch API to observe Job state transitions across thenoetlnamespace; on terminal-state transition it POSTs the resume event to noetl-server. Handles all Job termination modes (success, failure, eviction); no Job-spec changes required. -
Option C (rejected) — K8s 1.31+
successPolicy/ built-in event watchers hooked into noetl-server directly. Requires K8s version pin + adds K8s API surface to the server; deferred.
The watcher is small (~200 LoC Rust); RBAC + K8s watch are
well-trodden territory. The Python container tool keeps the
slot-holding poll loop as the "legacy" path until both worker
pools dispatch via the new callback shape, then it gets deleted
in a follow-up cleanup.
Six concrete acceptance criteria, mapped to per-repo sub-issues:
| # | Sub-issue | Repo | Round shape |
|---|---|---|---|
| 1 |
noetl-k8s-watcher deployment manifest + RBAC + dev kind validation (noetl/ops#166) |
noetl/ops |
Round 1 — small Rust binary or shell-wrapped kubectl get --watch; manifests under ci/manifests/k8s-watcher/; cluster-scoped RBAC for Jobs/watch
|
| 2 |
Server-side callback handler at POST /api/internal/container-callback/<execution_id>/<step> consuming the watcher's POST + emitting the resume event (noetl/server#140) |
noetl/server |
Round 2 — single new handler + sub-route + parser for the standard Job-terminal-state payload (success / failure / oomkilled / image-pull-back-off / timeout) |
| 3 |
Tool::Container in noetl-tools that creates the Job (kube client) + labels it noetl.execution-id=<id> + noetl.step-name=<name> + returns immediately; worker slot frees on the create-Job RPC return (noetl/tools#36) |
noetl/tools |
Round 3 — new tool kind alongside result_fetch/artifact; depends on existing keychain / catalog resolution for image refs |
| 4 |
Python tools/container/executor.py updated to follow the same callback shape — drops _wait_for_job_completion, keeps the create-Job side, returns immediately |
noetl/noetl |
Round 4 — Python work; off-limits per the Rust-only standing direction. Defer this round; track separately under a Python-revival umbrella if/when needed. The Rust path (rounds 1–3) is sufficient for new playbooks |
| 5 |
E2E kind validation: a playbook with a kind: container step runs on the Rust worker; Job creation completes within a few seconds; the resume event fires when the Job terminates (noetl/e2e#29) |
noetl/e2e |
Round 5 — kind_validate_container_callback.sh rig; trivial sleep + echo Job; validates both happy path + an OOMKilled failure mode |
| 6 |
Wiki updates on noetl-tools (Container tool kind page), noetl-server (callback endpoint page + deployment-specification env vars if any), noetl-ops (k8s-watcher deployment page) |
wikis | Lands per-round with the matching code change set (Rule 0a / Rule 2) |
Round 1 (ops watcher) and Round 3 (tools tool kind) can ship in either order — they communicate through the server's callback handler in Round 2. Round 2 should land first so the watcher has somewhere to POST. Recommended:
- Round 2 — server callback endpoint (handler returns 202 even if no execution is in flight; just emits the resume event when one matches).
-
Round 1 — k8s-watcher deployment + RBAC, configured to
POST to the Round-2 endpoint. Kind-validate by manually
kubectl apply-ing a Job with the labels; watcher fires; server logs the resume event. -
Round 3 — Rust
Tool::Containerthat creates labeled Jobs. Now end-to-end works. - Round 5 — e2e kind validation rig.
- Round 4 (Python) — deferred per standing Rust-only direction.
Settled. YAML:
tool:
kind: container
image: gcr.io/my-project/long-running:v1.2.3
command: ["./run.sh"]
args: ["--input", "{{ start.input_path }}"]
env:
- name: API_KEY
valueFrom: "{{ auth_alias }}" # resolves via the keychain
resources:
requests: { cpu: 500m, memory: 1Gi }
limits: { cpu: 2, memory: 4Gi }
timeout_seconds: 3600
service_account: noetl-container-job # optional; per-step overrideThe catalog's existing image-ref pinning + signing logic (used
by noetl/ops and noetl/script tool kinds today) is reused
unchanged.
The watcher POSTs the K8s-reported terminal state mapped to a
structured call.done status:
| K8s Job condition |
call.done status |
Notes |
|---|---|---|
Complete (exit 0) |
succeeded |
Standard happy path |
Failed with BackoffLimitExceeded
|
failed |
Container exited non-zero N times |
Failed with ImagePullBackOff (init never reached) |
failed_image_pull |
Distinguished — alert-worthy when sustained |
Pod OOMKilled
|
failed_oom |
Memory limit too low — playbook author actionable |
Pod Evicted (node lost) |
failed_node_lost |
Transient; orchestrator may retry |
Job.spec.activeDeadlineSeconds exceeded |
failed_timeout |
Distinguishes from per-step retry timeout |
Each status fingerprint flows into the Job's call.done event;
playbook authors can branch on the specific value.
| Date | Event |
|---|---|
| 2026-06-02 | Issue filed as part of R-3 phase C-2 (after #42 closed with the agent-tool-kind routing decision). |
| 2026-06-07 | Design settled — Option B (external watcher) chosen; sub-issue tree mapped across noetl/ops + noetl/server + noetl/tools + noetl/noetl + noetl/e2e. Round ordering documented. Failure-mode taxonomy specified. Python round (4) deferred per Rust-only standing direction. |
| 2026-06-07 |
Round 2 shipped — server#141 (closes server#140; v2.48.0). POST /api/internal/container-callback/{execution_id}/{step} consumes watcher payloads, checks staleness via a single indexed SELECT on noetl.event, emits call.done on match (or bumps noetl_container_callback_stale_total{state} + returns 202 on stale). 6 TerminalState variants survive in meta.terminal_state. 7 new unit tests; lib 487/0. Unblocks Round 1 (watcher Deployment) + Round 3 (Tool::Container). |
| 2026-06-07 |
Round 1 shipped — ops#167 (closes ops#166; ops@8892043). noetl-k8s-watcher Deployment + RBAC + shell watcher script in ci/manifests/k8s-watcher/. 5 files (527 lines). Shell MVP — kubectl get jobs --watch -o json piped through jq + curl; a pure-Rust binary is a clean follow-up. Cluster-scoped read-only RBAC (Jobs/Pods get,list,watch). Single-replica Deployment, Recreate strategy. jq classifier maps Job conditions to the six TerminalState enum variants. 3× retry with backoff on 5xx/transport; never on 4xx. In-memory dedup by Job UID. Authorization: Bearer from the existing noetl-internal-api-token Secret. kubectl kustomize renders 327 lines of valid YAML; sh -n watcher.sh clean; jq classification dry-run resolves Complete → succeeded. Rounds 1 + 2 chain can be kind-validated end-to-end by manually kubectl apply-ing a labeled Job. |
Rounds 1, 2, 4 done (Round 4 = the closed Python parity round that's deferred per the Rust-only standing direction). Remaining Rust rounds:
-
Ship Round 2 (server callback endpoint)— DONE (server#141, v2.48.0). -
Ship Round 1 (k8s-watcher Deployment + RBAC + dev kind validation)— DONE (ops#167, commit8892043). - Ship Round 3 (tools#36)
— Rust
Tool::Containerthat creates labeled Jobs + introduces thePendingCallbackmarker so the worker doesn't emitcall.donefrom the tool (the watcher's callback handler will). - Ship Round 5 (e2e#29) — kind validation rig with happy-path + OOMKilled fixtures. Closes the umbrella once green.
- Umbrella: Rust Worker Migration — parent.
- Execution model rule — the design constraint.
- noetl-worker wiki: https://github.com/noetl/worker/wiki.
- noetl-tools wiki: https://github.com/noetl/tools/wiki.
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture