Skip to content

Umbrella Container Tool Callback

Kadyapam edited this page Jun 10, 2026 · 13 revisions

Umbrella — Container Tool Kind: K8s Job Callback Design

ai-task: noetl/ai-meta#43 · Opened: 2026-06-02 · Closed: 2026-06-07 · Status: CLOSED — all four Rust rounds shipped (1 ops watcher / 2 server callback endpoint / 3 Tool::Container / 5 e2e kind-val rig). Round 4 (Python parity) parked per Rust-only standing direction. · Parent umbrella: Rust Worker Migration (specifically R-3 Phase C-2)

Closing summary (2026-06-07)

All four Rust rounds in. The container tool kind now ships as a self-contained chain — Tool::Container creates a labeled K8s Job and returns immediately; the noetl-k8s-watcher Deployment observes Job state transitions and POSTs to noetl-server's container-callback endpoint; the server emits call.done (matched in-flight) or bumps the stale-counter (transition state until the worker adopts the pending_callback marker).

Round Repo Landing Sub-issue
1 noetl/ops #1678892043 (ci/manifests/k8s-watcher/) ops#166
2 noetl/server #141v2.48.0 server#140
3 noetl/tools #37v2.21.0 tools#36
4 noetl/noetl Parked per Rust-only direction
5 noetl/e2e #3017de21d e2e#29

Worker-side pending_callback adoption (suppressing the worker's own call.done emit when the marker is set) remains as the only coordinated follow-up. Harmless during the transition: the watcher's callback is recorded by noetl_container_callback_stale_total, which is the migration dashboard signal. Tracked as a comment on this umbrella issue until the worker repo's catalog is ready for a sub-issue.

Validation closeout (2026-06-10) — chain green end to end

Acceptance criterion #5 (kind validation) is now GREEN both probes: repos/e2e/scripts/kind_validate_container_callback.sh asserts happy_path → succeeded and oom → failed_oom, and both counter-delta probes pass on the local kind cluster.

Reaching green past the watcher-deploy point (#80) surfaced and fixed three layered bugs that the Round-1/2/5 wiring had left latent:

Layer Symptom Fix
Watcher image watcher.sh: curl: not found → callback POST HTTP 000 (the retired bitnami/kubectl:1.30.3 + a runtime install that never put curl on PATH) image → alpine/k8s:1.30.3 (kubectl + jq + curl baked in); drop the install hack — ops#168cacc513
Server insert every callback 500'd: column "attempt" of relation "event" does not exist — the handler inserted call.done via a stale query targeting columns the deployed noetl.event doesn't have inline INSERT matching handlers::events; outcome in a chk_event_result_shape-shaped resultserver#173 (sub-issue server#174) → v3.0.3
OOM path watcher could never emit failed_oom (only read Job-level conditions); fixture's bytes(40MiB) was calloc-lazy so it never OOM'd; completed_at fallback now → HTTP 422 pod-level OOMKilledfailed_oom classification + RFC3339 now | todate (ops#168); fixture dirties pages via bytearray (e2e#386aaf06e)

Final run:

kind-val: PASS — happy_path   (state=succeeded  counter delta = 1)
kind-val: PASS — oom          (state=failed_oom counter delta = 1)
kind-val: ALL PROBES PASS — Container Tool Callback chain green

The watcher now also classifies failed_image_pull (ImagePullBackOff) via the same pod-status path, matching the six-variant TerminalState taxonomy the server's Round-2 endpoint already accepted.

Goal

Design + implement the callback pattern that lets the Rust worker dispatch a container tool kind as a Kubernetes Job and resume the playbook when the Job's container completes — without holding a worker slot for the duration of the container run.

This is the canonical instance of the Callback / hook rule from execution-model.md:

A block must not hold a worker slot waiting for an external operation that takes more than a few seconds.

Why the existing Python tool violates the rule

The Python noetl-worker has tool: kind: container (noetl/tools/container/executor.py, 1146 LoC across 4 files) that creates a K8s Job + waits for completion via a polling loop:

def _wait_for_job_completion(batch_api, namespace, job_name, timeout_seconds):
    deadline = time.time() + timeout_seconds
    while time.time() < deadline:
        job = batch_api.read_namespaced_job_status(name=job_name, namespace=namespace)
        # ... check status ...
        time.sleep(POLL_INTERVAL_SECONDS)

A K8s Job can run for minutes or hours; the worker slot is blocked for the full duration. Porting verbatim to Rust would propagate the anti-pattern.

Design decision — Option B (external K8s watcher)

Settled 2026-06-07. Three options were considered (see noetl/ai-meta#43 body for full rationale):

  • Option A (rejected) — sidecar / final command in the Job POSTs completion to noetl-server. Brittle to Job failure modes (the callback might never fire if the Job crashes before reaching the final command).
  • Option B (chosen) — separate noetl-k8s-watcher deployment uses the K8s watch API to observe Job state transitions across the noetl namespace; on terminal-state transition it POSTs the resume event to noetl-server. Handles all Job termination modes (success, failure, eviction); no Job-spec changes required.
  • Option C (rejected) — K8s 1.31+ successPolicy / built-in event watchers hooked into noetl-server directly. Requires K8s version pin + adds K8s API surface to the server; deferred.

The watcher is small (~200 LoC Rust); RBAC + K8s watch are well-trodden territory. The Python container tool keeps the slot-holding poll loop as the "legacy" path until both worker pools dispatch via the new callback shape, then it gets deleted in a follow-up cleanup.

Sub-issue tree (Option B implementation)

Six concrete acceptance criteria, mapped to per-repo sub-issues:

# Sub-issue Repo Round shape
1 noetl-k8s-watcher deployment manifest + RBAC + dev kind validation (noetl/ops#166) noetl/ops Round 1 — small Rust binary or shell-wrapped kubectl get --watch; manifests under ci/manifests/k8s-watcher/; cluster-scoped RBAC for Jobs/watch
2 Server-side callback handler at POST /api/internal/container-callback/<execution_id>/<step> consuming the watcher's POST + emitting the resume event (noetl/server#140) noetl/server Round 2 — single new handler + sub-route + parser for the standard Job-terminal-state payload (success / failure / oomkilled / image-pull-back-off / timeout)
3 Tool::Container in noetl-tools that creates the Job (kube client) + labels it noetl.execution-id=<id> + noetl.step-name=<name> + returns immediately; worker slot frees on the create-Job RPC return (noetl/tools#36) noetl/tools Round 3 — new tool kind alongside result_fetch/artifact; depends on existing keychain / catalog resolution for image refs
4 Python tools/container/executor.py updated to follow the same callback shape — drops _wait_for_job_completion, keeps the create-Job side, returns immediately noetl/noetl Round 4 — Python work; off-limits per the Rust-only standing direction. Defer this round; track separately under a Python-revival umbrella if/when needed. The Rust path (rounds 1–3) is sufficient for new playbooks
5 E2E kind validation: a playbook with a kind: container step runs on the Rust worker; Job creation completes within a few seconds; the resume event fires when the Job terminates (noetl/e2e#29) noetl/e2e Round 5 — kind_validate_container_callback.sh rig; trivial sleep + echo Job; validates both happy path + an OOMKilled failure mode
6 Wiki updates on noetl-tools (Container tool kind page), noetl-server (callback endpoint page + deployment-specification env vars if any), noetl-ops (k8s-watcher deployment page) wikis Lands per-round with the matching code change set (Rule 0a / Rule 2)

Round ordering

Round 1 (ops watcher) and Round 3 (tools tool kind) can ship in either order — they communicate through the server's callback handler in Round 2. Round 2 should land first so the watcher has somewhere to POST. Recommended:

  1. Round 2 — server callback endpoint (handler returns 202 even if no execution is in flight; just emits the resume event when one matches).
  2. Round 1 — k8s-watcher deployment + RBAC, configured to POST to the Round-2 endpoint. Kind-validate by manually kubectl apply-ing a Job with the labels; watcher fires; server logs the resume event.
  3. Round 3 — Rust Tool::Container that creates labeled Jobs. Now end-to-end works.
  4. Round 5 — e2e kind validation rig.
  5. Round 4 (Python) — deferred per standing Rust-only direction.

Catalog entry shape

Settled. YAML:

tool:
  kind: container
  image: gcr.io/my-project/long-running:v1.2.3
  command: ["./run.sh"]
  args: ["--input", "{{ start.input_path }}"]
  env:
    - name: API_KEY
      valueFrom: "{{ auth_alias }}"  # resolves via the keychain
  resources:
    requests: { cpu: 500m, memory: 1Gi }
    limits:   { cpu: 2, memory: 4Gi }
  timeout_seconds: 3600
  service_account: noetl-container-job  # optional; per-step override

The catalog's existing image-ref pinning + signing logic (used by noetl/ops and noetl/script tool kinds today) is reused unchanged.

Failure-mode taxonomy

The watcher POSTs the K8s-reported terminal state mapped to a structured call.done status:

K8s Job condition call.done status Notes
Complete (exit 0) succeeded Standard happy path
Failed with BackoffLimitExceeded failed Container exited non-zero N times
Failed with ImagePullBackOff (init never reached) failed_image_pull Distinguished — alert-worthy when sustained
Pod OOMKilled failed_oom Memory limit too low — playbook author actionable
Pod Evicted (node lost) failed_node_lost Transient; orchestrator may retry
Job.spec.activeDeadlineSeconds exceeded failed_timeout Distinguishes from per-step retry timeout

Each status fingerprint flows into the Job's call.done event; playbook authors can branch on the specific value.

Recent activity

Date Event
2026-06-02 Issue filed as part of R-3 phase C-2 (after #42 closed with the agent-tool-kind routing decision).
2026-06-07 Design settled — Option B (external watcher) chosen; sub-issue tree mapped across noetl/ops + noetl/server + noetl/tools + noetl/noetl + noetl/e2e. Round ordering documented. Failure-mode taxonomy specified. Python round (4) deferred per Rust-only standing direction.
2026-06-07 Round 2 shippedserver#141 (closes server#140; v2.48.0). POST /api/internal/container-callback/{execution_id}/{step} consumes watcher payloads, checks staleness via a single indexed SELECT on noetl.event, emits call.done on match (or bumps noetl_container_callback_stale_total{state} + returns 202 on stale). 6 TerminalState variants survive in meta.terminal_state. 7 new unit tests; lib 487/0. Unblocks Round 1 (watcher Deployment) + Round 3 (Tool::Container).
2026-06-07 Round 1 shippedops#167 (closes ops#166; ops@8892043). noetl-k8s-watcher Deployment + RBAC + shell watcher script in ci/manifests/k8s-watcher/. 5 files (527 lines). Shell MVP — kubectl get jobs --watch -o json piped through jq + curl; a pure-Rust binary is a clean follow-up. Cluster-scoped read-only RBAC (Jobs/Pods get,list,watch). Single-replica Deployment, Recreate strategy. jq classifier maps Job conditions to the six TerminalState enum variants. 3× retry with backoff on 5xx/transport; never on 4xx. In-memory dedup by Job UID. Authorization: Bearer from the existing noetl-internal-api-token Secret. kubectl kustomize renders 327 lines of valid YAML; sh -n watcher.sh clean; jq classification dry-run resolves Completesucceeded. Rounds 1 + 2 chain can be kind-validated end-to-end by manually kubectl apply-ing a labeled Job.
2026-06-07 Round 3 shippedtools#37 (closes tools#36; v2.21.0). Tool::Container creates a labeled K8s Job and returns immediately — worker slot frees as soon as api.create() returns. ContainerConfig mirrors the catalog YAML shape; default ns noetl, default backoffLimit: 0, default restartPolicy: Never; generateName: noetl-container-<slug>-<eid>- (DNS-1123-safe). Additive ToolResult.pending_callback: Option<bool> marker — set by Container to suppress the worker's own call.done emit. Worker-side adoption is a coordinated follow-up; until then the watcher's callback is treated as stale (recorded by noetl_container_callback_stale_total), which is a harmless race during the transition. 10 existing struct-literal sites backfilled with pending_callback: None. 17 new unit tests; lib 258/0. This closes the last code round. Only Round 5 remains.
2026-06-10 Chain unblocked — container-tool command type contradiction fixed (noetl/ai-meta#81; server#172; v3.0.2). Validating the chain on kind surfaced that the container tool kind couldn't execute with ANY command value: server ToolSpec.command was Option<String> (an array command: ["/bin/sh","-c"] failed the ToolDefinition untagged-enum match → 400) while the worker's ContainerConfig.command is Option<Vec<String>> (a scalar → expected a sequence). Fix types server-side command as Option<serde_json::Value> (same as args); ToolCall::from_spec forwards the value verbatim — scalar stays a JSON string for shell/db tools, array passes through to the worker's Vec<String>. Kind-val GREEN end-to-end: server accepts the array, worker dispatches Tool::Container, K8s Job noetl-container-dispatchcontainer-… reaches Complete 1/1 (pre-fix kubectl get jobs stayed empty). The chain's terminal-state counter-bump validation (succeeded / failed_oom) still depends on the resume path + #79 runner-CLI refresh.
2026-06-07 Round 5 shipped — umbrella CLOSESe2e#30 (closes e2e#29; commit 17de21d). Two fixtures (container_callback_happy_path/ + container_callback_oom/) + scripts/kind_validate_container_callback.sh. Happy-path uses alpine:3.19 sleep+echo; OOM uses python:3.12-alpine with a 40 MiB bytes() allocation under a 32Mi memory limit. Rig preflights (kubectl + noetl + curl; watcher Deployment exists + rolled out), registers + executes each fixture, scrapes the server's /metrics BEFORE + AFTER, asserts the sum of noetl_container_callback_total{state=...} + noetl_container_callback_stale_total{state=...} moved by ≥ 1 (the sum-both-counters strategy handles the worker-side pending_callback adoption transition). On failure: dumps watcher logs (tail 50) + server logs filtered to /container-callback/. Returns 0 if both green; 1 otherwise. Bash + YAML validation clean.

Next concrete steps

All four Rust rounds done. Umbrella CLOSED. See the Closing summary at the top of this page for the per-round landing inventory.

Remaining follow-up — worker-side pending_callback adoption ✅ DONE

Status (2026-06-07): worker-side adoption shipped; kind-val against a fresh worker image is the only remaining housekeeping step.

Repo Sub-issue PR Status
noetl/cli cli#55 (closed) cli#56 MERGED (cli@77be8be, v4.10.0) noetl-executor 0.4.1 patch propagates pending_callback through reshape_duckdb_result; 102/0 unit. Released via the release-cli workflow to crates.io.
noetl/worker worker#59 (closed) worker#60 MERGED (worker@f96da71, v5.14.0) executor::command checks tool_result.pending_callback after success: when Some(true) logs INFO + bumps noetl_worker_call_done_skipped_pending_callback_total{tool_kind} + skips own call.done emit; when None the existing emit path is preserved bit-for-bit. Cargo.toml: noetl-tools = "2.18""2.21", noetl-executor = "0.3""0.4" (Cargo.lock resolves the published 0.4.1). 126/0 worker lib tests against published deps.

Done sequence:

  1. ✅ Merged cli#56.
  2. ✅ noetl-executor 0.4.1 published to crates.io.
  3. ✅ worker#60 CI green; un-drafted.
  4. ✅ Merged worker#60.
  5. ⏳ Cloud Build + load a fresh worker image into kind, re-run e2e/scripts/kind_validate_container_callback.sh. Expected dashboard fingerprint after Round 4 lands: server's noetl_container_callback_stale_total{state} stops moving (the race window closes — the worker no longer emits early call.done); worker's noetl_worker_call_done_skipped_pending_callback_total{tool_kind="container"}noetl_container_callback_total{state=...} becomes the healthy-steady-state signal.
  6. ✅ Bumped worker pointer in ai-meta (repos/workerf96da71).

The worker-side adoption was the LAST code piece of this umbrella. The umbrella was closed earlier when all four Rust rounds shipped; this section is the follow-up trail that wraps the post-close coordination work.

Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally