Umbrella Container Tool Callback

Umbrella — Container Tool Kind: K8s Job Callback Design

ai-task: noetl/ai-meta#43 · Opened: 2026-06-02 · Last update: 2026-06-02 · Status: In flight (design conversation; no implementation) · Parent umbrella: Rust Worker Migration (specifically R-3 Phase C-2)

Goal

Design the callback pattern that lets the Rust worker dispatch a container tool kind as a Kubernetes Job and resume the playbook when the Job's container completes — without holding a worker slot for the duration of the container run.

This is the canonical instance of the Callback / hook rule from execution-model.md:

A block must not hold a worker slot waiting for an external operation that takes more than a few seconds.

Why this is hard

A container tool kind is fundamentally different from python, http, postgres:

Container startup is 5-30 seconds (image pull + scheduling).
Container runtime can be seconds to hours (training jobs, long-running tasks).
Holding a worker slot for that duration breaks the atomic-block model and starves real-time playbooks.

The fix: dispatch the K8s Job, release the worker slot immediately, capture execution_id + step in the Job's annotations, and arrange a callback that fires when the Job's container exits.

Open design questions

Who watches the Job? Options:
- A separate operator pod with watch access on Jobs in the cluster.
- A sidecar in the worker that subscribes to a K8s informer.
- The Job itself emits the completion event back to the server (e.g. via a wait-and-post init container or a postStop hook).
What's the resume signal? The server's /api/events POST with the Job's outcome → orchestrator picks up and continues the playbook.
How to handle Job failure modes? OOMKilled, image pull error, node lost, timeout. Each maps to a structured call.done status.
Image source. Catalog stores image reference; how is it pinned / signed / cached?
Resource requests + limits. Per-step config or per-tool defaults?

Recent activity

Date	Event
2026-06-02	Issue filed as part of R-3 phase C-2 (after #42 closed with the agent-tool-kind routing decision).
2026-06-02	No implementation yet; design conversation ongoing. Connects to Umbrella: Python Services to Rust — the container watcher is a candidate for the four-binary `noetl/server` crate layout.

Next concrete steps

Pick the "who watches" model from the four options above. Recommended: a separate noetl-container-watcher Deployment that's part of noetl/server crate (same image, --mode=container-watcher flag, similar to the four-binary shape proposed in Umbrella: Python Services to Rust).

Sketch the catalog entry shape for a container tool kind:

tool:
  kind: container
  image: gcr.io/my-project/long-running:v1.2.3
  command: ["./run.sh"]
  env: [...]
  resources:
    requests: { cpu: 500m, memory: 1Gi }
    limits:   { cpu: 2, memory: 4Gi }
  timeout_seconds: 3600

Define the event-shape contract for callback resume.
Validate on kind cluster with a trivial sleep + echo container playbook before tackling real workloads.

Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
Secrets Wallet (#61) — SECURITY (design)
Rust Server Port (#49) — PRIMARY
Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
System Pool Design (#46) — PRIMARY
Regression Baseline Migration (#98) — e2e
Subscription / Listener Tool (#90) — RFC
Container Tool Callback (#43)
Rust Worker Parity Gaps (#47 · #48)
Event Envelope Reconciliation (#51 in TaskList)

Closed Umbrellas

Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
Rust Worker Migration (#30)
Python Services → Rust (#45)

Conventions

Per-repo wikis

noetl/noetl wiki — app + DSL
noetl/server wiki — Rust control plane
noetl/worker wiki — Rust pull worker
noetl/tools wiki — tool registry crate
noetl/cli wiki — CLI + local mode
noetl/gateway wiki — gatekeeper
noetl/ops wiki — Helm + manifests
noetl/travel wiki — domain SPA reference
Docs site — engineer-facing architecture

Uh oh!

Umbrella Container Tool Callback

Umbrella — Container Tool Kind: K8s Job Callback Design

Goal

Why this is hard

Open design questions

Recent activity

Next concrete steps

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally