-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Container Tool Callback
ai-task: noetl/ai-meta#43 · Opened: 2026-06-02 · Closed: 2026-06-07 · Status: CLOSED — all four Rust rounds shipped (1 ops watcher / 2 server callback endpoint / 3 Tool::Container / 5 e2e kind-val rig). Round 4 (Python parity) parked per Rust-only standing direction. · Parent umbrella: Rust Worker Migration (specifically R-3 Phase C-2)
All four Rust rounds in. The container tool kind now ships as a
self-contained chain — Tool::Container creates a labeled K8s
Job and returns immediately; the noetl-k8s-watcher Deployment
observes Job state transitions and POSTs to noetl-server's
container-callback endpoint; the server emits call.done
(matched in-flight) or bumps the stale-counter (transition
state until the worker adopts the pending_callback marker).
| Round | Repo | Landing | Sub-issue |
|---|---|---|---|
| 1 | noetl/ops |
#167 → 8892043 (ci/manifests/k8s-watcher/) |
ops#166 |
| 2 | noetl/server | #141 → v2.48.0 | server#140 |
| 3 | noetl/tools | #37 → v2.21.0 | tools#36 |
| 4 | noetl/noetl | Parked per Rust-only direction | — |
| 5 | noetl/e2e |
#30 → 17de21d
|
e2e#29 |
Worker-side pending_callback adoption (suppressing the
worker's own call.done emit when the marker is set) remains as
the only coordinated follow-up. Harmless during the transition:
the watcher's callback is recorded by
noetl_container_callback_stale_total, which is the migration
dashboard signal. Tracked as a comment on this umbrella issue
until the worker repo's catalog is ready for a sub-issue.
Acceptance criterion #5 (kind validation) is now GREEN both
probes: repos/e2e/scripts/kind_validate_container_callback.sh
asserts happy_path → succeeded and oom → failed_oom, and both
counter-delta probes pass on the local kind cluster.
Reaching green past the watcher-deploy point (#80) surfaced and fixed three layered bugs that the Round-1/2/5 wiring had left latent:
| Layer | Symptom | Fix |
|---|---|---|
| Watcher image |
watcher.sh: curl: not found → callback POST HTTP 000 (the retired bitnami/kubectl:1.30.3 + a runtime install that never put curl on PATH) |
image → alpine/k8s:1.30.3 (kubectl + jq + curl baked in); drop the install hack — ops#168 → cacc513
|
| Server insert | every callback 500'd: column "attempt" of relation "event" does not exist — the handler inserted call.done via a stale query targeting columns the deployed noetl.event doesn't have |
inline INSERT matching handlers::events; outcome in a chk_event_result_shape-shaped result — server#173 (sub-issue server#174) → v3.0.3
|
| OOM path | watcher could never emit failed_oom (only read Job-level conditions); fixture's bytes(40MiB) was calloc-lazy so it never OOM'd; completed_at fallback now → HTTP 422 |
pod-level OOMKilled→failed_oom classification + RFC3339 now | todate (ops#168); fixture dirties pages via bytearray (e2e#38 → 6aaf06e) |
Final run:
kind-val: PASS — happy_path (state=succeeded counter delta = 1)
kind-val: PASS — oom (state=failed_oom counter delta = 1)
kind-val: ALL PROBES PASS — Container Tool Callback chain green
The watcher now also classifies failed_image_pull
(ImagePullBackOff) via the same pod-status path, matching the
six-variant TerminalState taxonomy the server's Round-2 endpoint
already accepted.
Design + implement the callback pattern that lets the Rust worker dispatch a container tool kind as a Kubernetes Job and resume the playbook when the Job's container completes — without holding a worker slot for the duration of the container run.
This is the canonical instance of the Callback / hook rule from execution-model.md:
A block must not hold a worker slot waiting for an external operation that takes more than a few seconds.
The Python noetl-worker has tool: kind: container
(noetl/tools/container/executor.py,
1146 LoC across 4 files) that creates a K8s Job + waits for completion
via a polling loop:
def _wait_for_job_completion(batch_api, namespace, job_name, timeout_seconds):
deadline = time.time() + timeout_seconds
while time.time() < deadline:
job = batch_api.read_namespaced_job_status(name=job_name, namespace=namespace)
# ... check status ...
time.sleep(POLL_INTERVAL_SECONDS)A K8s Job can run for minutes or hours; the worker slot is blocked for the full duration. Porting verbatim to Rust would propagate the anti-pattern.
Settled 2026-06-07. Three options were considered (see noetl/ai-meta#43 body for full rationale):
- Option A (rejected) — sidecar / final command in the Job POSTs completion to noetl-server. Brittle to Job failure modes (the callback might never fire if the Job crashes before reaching the final command).
-
Option B (chosen) — separate
noetl-k8s-watcherdeployment uses the K8s watch API to observe Job state transitions across thenoetlnamespace; on terminal-state transition it POSTs the resume event to noetl-server. Handles all Job termination modes (success, failure, eviction); no Job-spec changes required. -
Option C (rejected) — K8s 1.31+
successPolicy/ built-in event watchers hooked into noetl-server directly. Requires K8s version pin + adds K8s API surface to the server; deferred.
The watcher is small (~200 LoC Rust); RBAC + K8s watch are
well-trodden territory. The Python container tool keeps the
slot-holding poll loop as the "legacy" path until both worker
pools dispatch via the new callback shape, then it gets deleted
in a follow-up cleanup.
Six concrete acceptance criteria, mapped to per-repo sub-issues:
| # | Sub-issue | Repo | Round shape |
|---|---|---|---|
| 1 |
noetl-k8s-watcher deployment manifest + RBAC + dev kind validation (noetl/ops#166) |
noetl/ops |
Round 1 — small Rust binary or shell-wrapped kubectl get --watch; manifests under ci/manifests/k8s-watcher/; cluster-scoped RBAC for Jobs/watch
|
| 2 |
Server-side callback handler at POST /api/internal/container-callback/<execution_id>/<step> consuming the watcher's POST + emitting the resume event (noetl/server#140) |
noetl/server |
Round 2 — single new handler + sub-route + parser for the standard Job-terminal-state payload (success / failure / oomkilled / image-pull-back-off / timeout) |
| 3 |
Tool::Container in noetl-tools that creates the Job (kube client) + labels it noetl.execution-id=<id> + noetl.step-name=<name> + returns immediately; worker slot frees on the create-Job RPC return (noetl/tools#36) |
noetl/tools |
Round 3 — new tool kind alongside result_fetch/artifact; depends on existing keychain / catalog resolution for image refs |
| 4 |
Python tools/container/executor.py updated to follow the same callback shape — drops _wait_for_job_completion, keeps the create-Job side, returns immediately |
noetl/noetl |
Round 4 — Python work; off-limits per the Rust-only standing direction. Defer this round; track separately under a Python-revival umbrella if/when needed. The Rust path (rounds 1–3) is sufficient for new playbooks |
| 5 |
E2E kind validation: a playbook with a kind: container step runs on the Rust worker; Job creation completes within a few seconds; the resume event fires when the Job terminates (noetl/e2e#29) |
noetl/e2e |
Round 5 — kind_validate_container_callback.sh rig; trivial sleep + echo Job; validates both happy path + an OOMKilled failure mode |
| 6 |
Wiki updates on noetl-tools (Container tool kind page), noetl-server (callback endpoint page + deployment-specification env vars if any), noetl-ops (k8s-watcher deployment page) |
wikis | Lands per-round with the matching code change set (Rule 0a / Rule 2) |
Round 1 (ops watcher) and Round 3 (tools tool kind) can ship in either order — they communicate through the server's callback handler in Round 2. Round 2 should land first so the watcher has somewhere to POST. Recommended:
- Round 2 — server callback endpoint (handler returns 202 even if no execution is in flight; just emits the resume event when one matches).
-
Round 1 — k8s-watcher deployment + RBAC, configured to
POST to the Round-2 endpoint. Kind-validate by manually
kubectl apply-ing a Job with the labels; watcher fires; server logs the resume event. -
Round 3 — Rust
Tool::Containerthat creates labeled Jobs. Now end-to-end works. - Round 5 — e2e kind validation rig.
- Round 4 (Python) — deferred per standing Rust-only direction.
Settled. YAML:
tool:
kind: container
image: gcr.io/my-project/long-running:v1.2.3
command: ["./run.sh"]
args: ["--input", "{{ start.input_path }}"]
env:
- name: API_KEY
valueFrom: "{{ auth_alias }}" # resolves via the keychain
resources:
requests: { cpu: 500m, memory: 1Gi }
limits: { cpu: 2, memory: 4Gi }
timeout_seconds: 3600
service_account: noetl-container-job # optional; per-step overrideThe catalog's existing image-ref pinning + signing logic (used
by noetl/ops and noetl/script tool kinds today) is reused
unchanged.
The watcher POSTs the K8s-reported terminal state mapped to a
structured call.done status:
| K8s Job condition |
call.done status |
Notes |
|---|---|---|
Complete (exit 0) |
succeeded |
Standard happy path |
Failed with BackoffLimitExceeded
|
failed |
Container exited non-zero N times |
Failed with ImagePullBackOff (init never reached) |
failed_image_pull |
Distinguished — alert-worthy when sustained |
Pod OOMKilled
|
failed_oom |
Memory limit too low — playbook author actionable |
Pod Evicted (node lost) |
failed_node_lost |
Transient; orchestrator may retry |
Job.spec.activeDeadlineSeconds exceeded |
failed_timeout |
Distinguishes from per-step retry timeout |
Each status fingerprint flows into the Job's call.done event;
playbook authors can branch on the specific value.
| Date | Event |
|---|---|
| 2026-06-02 | Issue filed as part of R-3 phase C-2 (after #42 closed with the agent-tool-kind routing decision). |
| 2026-06-07 | Design settled — Option B (external watcher) chosen; sub-issue tree mapped across noetl/ops + noetl/server + noetl/tools + noetl/noetl + noetl/e2e. Round ordering documented. Failure-mode taxonomy specified. Python round (4) deferred per Rust-only standing direction. |
| 2026-06-07 |
Round 2 shipped — server#141 (closes server#140; v2.48.0). POST /api/internal/container-callback/{execution_id}/{step} consumes watcher payloads, checks staleness via a single indexed SELECT on noetl.event, emits call.done on match (or bumps noetl_container_callback_stale_total{state} + returns 202 on stale). 6 TerminalState variants survive in meta.terminal_state. 7 new unit tests; lib 487/0. Unblocks Round 1 (watcher Deployment) + Round 3 (Tool::Container). |
| 2026-06-07 |
Round 1 shipped — ops#167 (closes ops#166; ops@8892043). noetl-k8s-watcher Deployment + RBAC + shell watcher script in ci/manifests/k8s-watcher/. 5 files (527 lines). Shell MVP — kubectl get jobs --watch -o json piped through jq + curl; a pure-Rust binary is a clean follow-up. Cluster-scoped read-only RBAC (Jobs/Pods get,list,watch). Single-replica Deployment, Recreate strategy. jq classifier maps Job conditions to the six TerminalState enum variants. 3× retry with backoff on 5xx/transport; never on 4xx. In-memory dedup by Job UID. Authorization: Bearer from the existing noetl-internal-api-token Secret. kubectl kustomize renders 327 lines of valid YAML; sh -n watcher.sh clean; jq classification dry-run resolves Complete → succeeded. Rounds 1 + 2 chain can be kind-validated end-to-end by manually kubectl apply-ing a labeled Job. |
| 2026-06-07 |
Round 3 shipped — tools#37 (closes tools#36; v2.21.0). Tool::Container creates a labeled K8s Job and returns immediately — worker slot frees as soon as api.create() returns. ContainerConfig mirrors the catalog YAML shape; default ns noetl, default backoffLimit: 0, default restartPolicy: Never; generateName: noetl-container-<slug>-<eid>- (DNS-1123-safe). Additive ToolResult.pending_callback: Option<bool> marker — set by Container to suppress the worker's own call.done emit. Worker-side adoption is a coordinated follow-up; until then the watcher's callback is treated as stale (recorded by noetl_container_callback_stale_total), which is a harmless race during the transition. 10 existing struct-literal sites backfilled with pending_callback: None. 17 new unit tests; lib 258/0. This closes the last code round. Only Round 5 remains. |
| 2026-06-10 |
Chain unblocked — container-tool command type contradiction fixed (noetl/ai-meta#81; server#172; v3.0.2). Validating the chain on kind surfaced that the container tool kind couldn't execute with ANY command value: server ToolSpec.command was Option<String> (an array command: ["/bin/sh","-c"] failed the ToolDefinition untagged-enum match → 400) while the worker's ContainerConfig.command is Option<Vec<String>> (a scalar → expected a sequence). Fix types server-side command as Option<serde_json::Value> (same as args); ToolCall::from_spec forwards the value verbatim — scalar stays a JSON string for shell/db tools, array passes through to the worker's Vec<String>. Kind-val GREEN end-to-end: server accepts the array, worker dispatches Tool::Container, K8s Job noetl-container-dispatchcontainer-… reaches Complete 1/1 (pre-fix kubectl get jobs stayed empty). The chain's terminal-state counter-bump validation (succeeded / failed_oom) still depends on the resume path + #79 runner-CLI refresh. |
| 2026-06-07 |
Round 5 shipped — umbrella CLOSES — e2e#30 (closes e2e#29; commit 17de21d). Two fixtures (container_callback_happy_path/ + container_callback_oom/) + scripts/kind_validate_container_callback.sh. Happy-path uses alpine:3.19 sleep+echo; OOM uses python:3.12-alpine with a 40 MiB bytes() allocation under a 32Mi memory limit. Rig preflights (kubectl + noetl + curl; watcher Deployment exists + rolled out), registers + executes each fixture, scrapes the server's /metrics BEFORE + AFTER, asserts the sum of noetl_container_callback_total{state=...} + noetl_container_callback_stale_total{state=...} moved by ≥ 1 (the sum-both-counters strategy handles the worker-side pending_callback adoption transition). On failure: dumps watcher logs (tail 50) + server logs filtered to /container-callback/. Returns 0 if both green; 1 otherwise. Bash + YAML validation clean. |
All four Rust rounds done. Umbrella CLOSED. See the Closing summary at the top of this page for the per-round landing inventory.
Status (2026-06-07): worker-side adoption shipped; kind-val against a fresh worker image is the only remaining housekeeping step.
| Repo | Sub-issue | PR | Status |
|---|---|---|---|
noetl/cli |
cli#55 (closed) | cli#56 MERGED (cli@77be8be, v4.10.0) | noetl-executor 0.4.1 patch propagates pending_callback through reshape_duckdb_result; 102/0 unit. Released via the release-cli workflow to crates.io. |
noetl/worker |
worker#59 (closed) | worker#60 MERGED (worker@f96da71, v5.14.0) |
executor::command checks tool_result.pending_callback after success: when Some(true) logs INFO + bumps noetl_worker_call_done_skipped_pending_callback_total{tool_kind} + skips own call.done emit; when None the existing emit path is preserved bit-for-bit. Cargo.toml: noetl-tools = "2.18" → "2.21", noetl-executor = "0.3" → "0.4" (Cargo.lock resolves the published 0.4.1). 126/0 worker lib tests against published deps. |
Done sequence:
- ✅ Merged cli#56.
- ✅ noetl-executor 0.4.1 published to crates.io.
- ✅ worker#60 CI green; un-drafted.
- ✅ Merged worker#60.
- ⏳ Cloud Build + load a fresh worker image into kind, re-run
e2e/scripts/kind_validate_container_callback.sh. Expected dashboard fingerprint after Round 4 lands: server'snoetl_container_callback_stale_total{state}stops moving (the race window closes — the worker no longer emits earlycall.done); worker'snoetl_worker_call_done_skipped_pending_callback_total{tool_kind="container"}≈noetl_container_callback_total{state=...}becomes the healthy-steady-state signal. - ✅ Bumped worker pointer in ai-meta (
repos/worker→f96da71).
The worker-side adoption was the LAST code piece of this umbrella. The umbrella was closed earlier when all four Rust rounds shipped; this section is the follow-up trail that wraps the post-close coordination work.
- Umbrella: Rust Worker Migration — parent.
- Execution model rule — the design constraint.
- noetl-worker wiki: https://github.com/noetl/worker/wiki.
- noetl-tools wiki: https://github.com/noetl/tools/wiki.
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture