-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Container Tool Callback
ai-task: noetl/ai-meta#43 · Opened: 2026-06-02 · Last update: 2026-06-07 · Status: Design settled (Option B chosen); sub-issue tree opened across noetl/ops + noetl/server + noetl/tools + noetl/noetl + noetl/e2e · Parent umbrella: Rust Worker Migration (specifically R-3 Phase C-2)
Design + implement the callback pattern that lets the Rust worker dispatch a container tool kind as a Kubernetes Job and resume the playbook when the Job's container completes — without holding a worker slot for the duration of the container run.
This is the canonical instance of the Callback / hook rule from execution-model.md:
A block must not hold a worker slot waiting for an external operation that takes more than a few seconds.
The Python noetl-worker has tool: kind: container
(noetl/tools/container/executor.py,
1146 LoC across 4 files) that creates a K8s Job + waits for completion
via a polling loop:
def _wait_for_job_completion(batch_api, namespace, job_name, timeout_seconds):
deadline = time.time() + timeout_seconds
while time.time() < deadline:
job = batch_api.read_namespaced_job_status(name=job_name, namespace=namespace)
# ... check status ...
time.sleep(POLL_INTERVAL_SECONDS)A K8s Job can run for minutes or hours; the worker slot is blocked for the full duration. Porting verbatim to Rust would propagate the anti-pattern.
Settled 2026-06-07. Three options were considered (see noetl/ai-meta#43 body for full rationale):
- Option A (rejected) — sidecar / final command in the Job POSTs completion to noetl-server. Brittle to Job failure modes (the callback might never fire if the Job crashes before reaching the final command).
-
Option B (chosen) — separate
noetl-k8s-watcherdeployment uses the K8s watch API to observe Job state transitions across thenoetlnamespace; on terminal-state transition it POSTs the resume event to noetl-server. Handles all Job termination modes (success, failure, eviction); no Job-spec changes required. -
Option C (rejected) — K8s 1.31+
successPolicy/ built-in event watchers hooked into noetl-server directly. Requires K8s version pin + adds K8s API surface to the server; deferred.
The watcher is small (~200 LoC Rust); RBAC + K8s watch are
well-trodden territory. The Python container tool keeps the
slot-holding poll loop as the "legacy" path until both worker
pools dispatch via the new callback shape, then it gets deleted
in a follow-up cleanup.
Six concrete acceptance criteria, mapped to per-repo sub-issues:
| # | Sub-issue | Repo | Round shape |
|---|---|---|---|
| 1 |
noetl-k8s-watcher deployment manifest + RBAC + dev kind validation (noetl/ops#166) |
noetl/ops |
Round 1 — small Rust binary or shell-wrapped kubectl get --watch; manifests under ci/manifests/k8s-watcher/; cluster-scoped RBAC for Jobs/watch
|
| 2 |
Server-side callback handler at POST /api/internal/container-callback/<execution_id>/<step> consuming the watcher's POST + emitting the resume event (noetl/server#140) |
noetl/server |
Round 2 — single new handler + sub-route + parser for the standard Job-terminal-state payload (success / failure / oomkilled / image-pull-back-off / timeout) |
| 3 |
Tool::Container in noetl-tools that creates the Job (kube client) + labels it noetl.execution-id=<id> + noetl.step-name=<name> + returns immediately; worker slot frees on the create-Job RPC return (noetl/tools#36) |
noetl/tools |
Round 3 — new tool kind alongside result_fetch/artifact; depends on existing keychain / catalog resolution for image refs |
| 4 |
Python tools/container/executor.py updated to follow the same callback shape — drops _wait_for_job_completion, keeps the create-Job side, returns immediately |
noetl/noetl |
Round 4 — Python work; off-limits per the Rust-only standing direction. Defer this round; track separately under a Python-revival umbrella if/when needed. The Rust path (rounds 1–3) is sufficient for new playbooks |
| 5 |
E2E kind validation: a playbook with a kind: container step runs on the Rust worker; Job creation completes within a few seconds; the resume event fires when the Job terminates (noetl/e2e#29) |
noetl/e2e |
Round 5 — kind_validate_container_callback.sh rig; trivial sleep + echo Job; validates both happy path + an OOMKilled failure mode |
| 6 |
Wiki updates on noetl-tools (Container tool kind page), noetl-server (callback endpoint page + deployment-specification env vars if any), noetl-ops (k8s-watcher deployment page) |
wikis | Lands per-round with the matching code change set (Rule 0a / Rule 2) |
Round 1 (ops watcher) and Round 3 (tools tool kind) can ship in either order — they communicate through the server's callback handler in Round 2. Round 2 should land first so the watcher has somewhere to POST. Recommended:
- Round 2 — server callback endpoint (handler returns 202 even if no execution is in flight; just emits the resume event when one matches).
-
Round 1 — k8s-watcher deployment + RBAC, configured to
POST to the Round-2 endpoint. Kind-validate by manually
kubectl apply-ing a Job with the labels; watcher fires; server logs the resume event. -
Round 3 — Rust
Tool::Containerthat creates labeled Jobs. Now end-to-end works. - Round 5 — e2e kind validation rig.
- Round 4 (Python) — deferred per standing Rust-only direction.
Settled. YAML:
tool:
kind: container
image: gcr.io/my-project/long-running:v1.2.3
command: ["./run.sh"]
args: ["--input", "{{ start.input_path }}"]
env:
- name: API_KEY
valueFrom: "{{ auth_alias }}" # resolves via the keychain
resources:
requests: { cpu: 500m, memory: 1Gi }
limits: { cpu: 2, memory: 4Gi }
timeout_seconds: 3600
service_account: noetl-container-job # optional; per-step overrideThe catalog's existing image-ref pinning + signing logic (used
by noetl/ops and noetl/script tool kinds today) is reused
unchanged.
The watcher POSTs the K8s-reported terminal state mapped to a
structured call.done status:
| K8s Job condition |
call.done status |
Notes |
|---|---|---|
Complete (exit 0) |
succeeded |
Standard happy path |
Failed with BackoffLimitExceeded
|
failed |
Container exited non-zero N times |
Failed with ImagePullBackOff (init never reached) |
failed_image_pull |
Distinguished — alert-worthy when sustained |
Pod OOMKilled
|
failed_oom |
Memory limit too low — playbook author actionable |
Pod Evicted (node lost) |
failed_node_lost |
Transient; orchestrator may retry |
Job.spec.activeDeadlineSeconds exceeded |
failed_timeout |
Distinguishes from per-step retry timeout |
Each status fingerprint flows into the Job's call.done event;
playbook authors can branch on the specific value.
| Date | Event |
|---|---|
| 2026-06-02 | Issue filed as part of R-3 phase C-2 (after #42 closed with the agent-tool-kind routing decision). |
| 2026-06-07 | Design settled — Option B (external watcher) chosen; sub-issue tree mapped across noetl/ops + noetl/server + noetl/tools + noetl/noetl + noetl/e2e. Round ordering documented. Failure-mode taxonomy specified. Python round (4) deferred per Rust-only standing direction. |
| 2026-06-07 |
Round 2 shipped — server#141 (closes server#140; v2.48.0). POST /api/internal/container-callback/{execution_id}/{step} consumes watcher payloads, checks staleness via a single indexed SELECT on noetl.event, emits call.done on match (or bumps noetl_container_callback_stale_total{state} + returns 202 on stale). 6 TerminalState variants survive in meta.terminal_state. 7 new unit tests; lib 487/0. Unblocks Round 1 (watcher Deployment) + Round 3 (Tool::Container). |
| 2026-06-07 |
Round 1 shipped — ops#167 (closes ops#166; ops@8892043). noetl-k8s-watcher Deployment + RBAC + shell watcher script in ci/manifests/k8s-watcher/. 5 files (527 lines). Shell MVP — kubectl get jobs --watch -o json piped through jq + curl; a pure-Rust binary is a clean follow-up. Cluster-scoped read-only RBAC (Jobs/Pods get,list,watch). Single-replica Deployment, Recreate strategy. jq classifier maps Job conditions to the six TerminalState enum variants. 3× retry with backoff on 5xx/transport; never on 4xx. In-memory dedup by Job UID. Authorization: Bearer from the existing noetl-internal-api-token Secret. kubectl kustomize renders 327 lines of valid YAML; sh -n watcher.sh clean; jq classification dry-run resolves Complete → succeeded. Rounds 1 + 2 chain can be kind-validated end-to-end by manually kubectl apply-ing a labeled Job. |
| 2026-06-07 |
Round 3 shipped — tools#37 (closes tools#36; v2.21.0). Tool::Container creates a labeled K8s Job and returns immediately — worker slot frees as soon as api.create() returns. ContainerConfig mirrors the catalog YAML shape; default ns noetl, default backoffLimit: 0, default restartPolicy: Never; generateName: noetl-container-<slug>-<eid>- (DNS-1123-safe). Additive ToolResult.pending_callback: Option<bool> marker — set by Container to suppress the worker's own call.done emit. Worker-side adoption is a coordinated follow-up; until then the watcher's callback is treated as stale (recorded by noetl_container_callback_stale_total), which is a harmless race during the transition. 10 existing struct-literal sites backfilled with pending_callback: None. 17 new unit tests; lib 258/0. This closes the last code round. Only Round 5 remains. |
Rounds 1, 2, 3, 4 done (Round 4 = the closed Python parity round that's deferred per the Rust-only standing direction). Remaining Rust rounds:
-
Ship Round 2 (server callback endpoint)— DONE (server#141, v2.48.0). -
Ship Round 1 (k8s-watcher Deployment + RBAC + dev kind validation)— DONE (ops#167, commit8892043). -
Ship Round 3 (Rust— DONE (tools#37, v2.21.0).Tool::Containerthat creates labeled Jobs + introduces thePendingCallbackmarker) - Ship Round 5 (e2e#29) — kind validation rig with happy-path + OOMKilled fixtures. Closes the umbrella once green.
Round 3 added the pending_callback marker on ToolResult,
but the worker still emits its own call.done regardless of
the marker. Until the worker is updated to recognise the
marker and skip the emit, the watcher's callback will be
treated as stale by the server (recorded by
noetl_container_callback_stale_total). Harmless during the
transition. Tracked as a comment on this umbrella; will be
filed as a noetl/worker sub-issue once the worker repo's
catalog is ready.
- Umbrella: Rust Worker Migration — parent.
- Execution model rule — the design constraint.
- noetl-worker wiki: https://github.com/noetl/worker/wiki.
- noetl-tools wiki: https://github.com/noetl/tools/wiki.
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture