Skip to content

Umbrella Container Tool Callback

Kadyapam edited this page Jun 2, 2026 · 13 revisions

Umbrella — Container Tool Kind: K8s Job Callback Design

ai-task: noetl/ai-meta#43 · Opened: 2026-06-02 · Last update: 2026-06-02 · Status: In flight (design conversation; no implementation) · Parent umbrella: Rust Worker Migration (specifically R-3 Phase C-2)

Goal

Design the callback pattern that lets the Rust worker dispatch a container tool kind as a Kubernetes Job and resume the playbook when the Job's container completes — without holding a worker slot for the duration of the container run.

This is the canonical instance of the Callback / hook rule from execution-model.md:

A block must not hold a worker slot waiting for an external operation that takes more than a few seconds.

Why this is hard

A container tool kind is fundamentally different from python, http, postgres:

  • Container startup is 5-30 seconds (image pull + scheduling).
  • Container runtime can be seconds to hours (training jobs, long-running tasks).
  • Holding a worker slot for that duration breaks the atomic-block model and starves real-time playbooks.

The fix: dispatch the K8s Job, release the worker slot immediately, capture execution_id + step in the Job's annotations, and arrange a callback that fires when the Job's container exits.

Open design questions

  1. Who watches the Job? Options:
    • A separate operator pod with watch access on Jobs in the cluster.
    • A sidecar in the worker that subscribes to a K8s informer.
    • The Job itself emits the completion event back to the server (e.g. via a wait-and-post init container or a postStop hook).
  2. What's the resume signal? The server's /api/events POST with the Job's outcome → orchestrator picks up and continues the playbook.
  3. How to handle Job failure modes? OOMKilled, image pull error, node lost, timeout. Each maps to a structured call.done status.
  4. Image source. Catalog stores image reference; how is it pinned / signed / cached?
  5. Resource requests + limits. Per-step config or per-tool defaults?

Recent activity

Date Event
2026-06-02 Issue filed as part of R-3 phase C-2 (after #42 closed with the agent-tool-kind routing decision).
2026-06-02 No implementation yet; design conversation ongoing. Connects to Umbrella: Python Services to Rust — the container watcher is a candidate for the four-binary noetl/server crate layout.

Next concrete steps

  1. Pick the "who watches" model from the four options above. Recommended: a separate noetl-container-watcher Deployment that's part of noetl/server crate (same image, --mode=container-watcher flag, similar to the four-binary shape proposed in Umbrella: Python Services to Rust).
  2. Sketch the catalog entry shape for a container tool kind:
    tool:
      kind: container
      image: gcr.io/my-project/long-running:v1.2.3
      command: ["./run.sh"]
      env: [...]
      resources:
        requests: { cpu: 500m, memory: 1Gi }
        limits:   { cpu: 2, memory: 4Gi }
      timeout_seconds: 3600
  3. Define the event-shape contract for callback resume.
  4. Validate on kind cluster with a trivial sleep + echo container playbook before tackling real workloads.

Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally