Umbrella Domain SLM Platform

Umbrella — Domain-Specific SLM Platform (RFC, design)

ai-task: noetl/ai-meta#139 · Opened: 2026-06-25 · Last update: 2026-06-25 (Phase A #140 landed — dataset_build + eval generic templates proven on travel + a 2nd domain; review-only PRs) · Status: Phase A in progress (un-gated stages built + validated; no model trained, no infra, no prod change) · RFC: docs/rfc/domain-slm-platform.md · Reference implementation: noetl/travel#63 (the travel-domain SLM) · Generalizes: repos/travel/docs/rfc/travel-slm.md · Built on: #104 (result tier → registry substrate), #107 (distributed-OS program → execution substrate)

Tooling for organizations to build AND continuously improve their own domain-specific Small Language Models — entirely orchestrated by NoETL playbooks (MLOps-as-playbooks). Travel (#63) proves it end-to-end; this umbrella extracts the generic, config-driven framework from it so a second domain stands up its SLM by writing one config object, not a pipeline.

Goal

Generalize the travel SLM's MLOps-as-playbooks pattern into a reusable, multi-domain framework with three parts:

A parameterized template pack automation/mlops/slm/<stage> (dataset_build, finetune, eval, shadow_eval, package, deploy, retrain_orchestrator + new traffic_capture, drift_monitor) instantiated by one org-facing slm.config.yaml (I/O contract + data sources + teacher(s) + base-model/recipe + eval metrics/targets + serving target + improvement cadence/gates). The org provides config + schemas + prompts + seed data; the framework supplies every stage's step DAG.
A continuous-improvement loop: production-traffic capture → quality/drift monitoring → teacher-assisted / human-in-the-loop relabel → scheduled retrain → shadow-eval vs incumbent + a deterministic floor → gated auto-promotion with model-registry versioning + instant rollback. All stages are NoETL playbooks on a schedule.
Domain-agnostic platform foundations generalizing travel #70/#71/#72: G1 container/GPU job dispatch, G2 long-async orchestration, G3 artifact storage + a versioned model/dataset/eval registry (on #104), plus G4 experiment store, G5 lineage, G6 cost controls.

The success test: a second domain runs the same templates by config only, zero framework edits.

The org-facing config surface (what makes it "tooling")

An org writes one declarative config — no pipeline code. Seven blocks: I/O contract (role schemas + prompts + decoding grammar + optional deterministic oracle), data sources (event-log replay scope + seed/adversarial corpora + redaction policy), teacher(s) (labeling ceiling, keychain alias), model/train (base family/size + recipe + role layout), eval (floor + ceiling + metrics/targets + eval sets), serving (CPU/GPU target + the consuming playbook's engine flags), continuous improvement (capture rate + schedule + retrain/promotion gates + governance/budget). The framework supplies the stage DAGs, teacher fan-out + oracle cross-check + schema-validation, GPU-job dispatch + long-async, registry + metric computation + gating, deploy + flag-flip + rollback, capture + drift + scheduled retrain, and the observability/data-access/secrets compliance. See RFC §2.2 for the full annotated config.

The continuous-improvement loop + its gates

live traffic → traffic_capture → drift_monitor → [retrain TRIGGER?]
                                                        │ yes
   relabel (teacher + oracle, human-in-loop) ──────────┘
        → finetune (challenger) → eval (vs floor + incumbent)
        → shadow_eval (live) → [PROMOTE gates] → deploy (registry version + flag flip)
                                     │ fail → registered, human notified
   rollback = re-flip flag to prior registry version (instant)

Retrain triggers (RFC §4.1) — quality regression (> 2pts), outcome regression (> 1.5× baseline corrections/errors), schema-validity dip (< 99.5%), distribution drift (> 5% new intents/tools/entities), data volume (≥ N new labels), contract change (forces retrain), cadence floor (≥ max staleness) — AND a G6 budget check (over-budget defers + alerts, never silently skips).
Promotion gates (RFC §4.2) — floor (≥ deterministic baseline on every metric), incumbent-offline (≥ incumbent, no regression > tolerance), schema-validity == 100%, shadow-live (match-rate + p95 ≤ incumbent), latency/cost (≤ budget + ≤ incumbent cost), no-harm-slice (no critical slice regresses). Per-role promotion; every promote/rollback is an auditable event; rollback is a flag re-flip because every version stays in the registry.

Recent activity

Date	What	Pointers
2026-06-26	G1/G2/G3 foundation merged + released (review-only); SLM teacher swapped OpenAI→Vertex Gemini. tools v3.19.0 (G1 GPU placement + G2 poll helper), worker v5.47.0 (G2 watcher-free poll completion fallback, `NOETL_CONTAINER_COMPLETION_POLL` default-off), server v3.46.0 (G3 versioned model/dataset/eval registry, `NOETL_REGISTRY_ENABLED` default-off) — all three merged, released, and ai-meta pointers bumped (`2e79af6`). The SLM teacher engine became a pluggable provider seam + `VertexGeminiProvider` (gemini-2.5-pro→flash, GKE-metadata WI token mint, no key/secret; OpenAI keychain dropped from `dataset_build.yaml`); travel config repointed at it. No prod deploy (flags default-off); no on-cluster Vertex run yet (needs Vertex AI enabled + worker WI binding).	tools #82 `f36d020` · worker #139 `32b0e96` · server #266 `1d86464` · teacher PRs ops #216 + travel #75 · ops#214 held
2026-06-25	Platform foundations G1 (#144) + G2 (#145) — first increment (design + flag-gated/additive code, review-only). Major finding: the Container Tool Callback umbrella (#43) already ships most of G1+G2 (the `container` tool + the watcher→server-callback→worker-suppress async path, deployed on kind). This increment fills the SLM gaps: G1 GPU placement (`node_selector`/`tolerations`/`volumes`) on the container tool; G2 a watcher-free poll completion fallback (detached poller emits the resume `call.done`, slot stays free; `NOETL_CONTAINER_COMPLETION_POLL` default off). Artifact in/out refs defer to G3 (#146) via `noetl://` env URNs. Worker depends on tools v3.19.0 release. Live-kind deferred (tools-release ordering + a degraded dev cluster); unit-tested, clippy clean.	RFC `docs/rfc/g1-g2-container-job-async.md` · tools PR noetl/tools#82 · worker PR noetl/worker#139 · #144 · #145
2026-06-25	Phase A (#140) — dataset_build + eval shipped as generic, config-driven NoETL playbooks; baseline produced. The two un-gated stages run from one org `slm.config.yaml`. Travel instance = config #1 (deterministic oracle as zero-cost label source — no OpenAI spend). Config-only proven: a 2nd toy domain (support-triage, no tools/widgets) runs the same playbooks with zero framework edits (RFC §2.2 extraction test). Validated via `noetl exec -r local`. Baseline (45 turns → 29/16): 100% schema validity (extract/render/48 widget envelopes), eval gate passed, floor latency sub-ms, 9 intents/4 tools/10 widget types covered.	ops PR #212 (generic pack) · travel PR noetl/travel#73 (travel instance) · #140 · noetl/travel#64
2026-06-25	RFC + umbrella + 11 children filed (design only).	RFC `docs/rfc/domain-slm-platform.md` · #139

Baseline reading (de-risks the model choice): schema validity is already 100% at the floor (the SLM must match, not improve it); the floor latency is essentially free, so an SLM must justify itself on accuracy over the deterministic floor, not latency. The candidate-ranking number — agreement with the OpenAI ceiling — needs teacher labels (Phase 1 next). finetune/package/serve stay gated on G1/G2/G3; model-choice deferred until this baseline informs it.

Children

#	Title	Role
#140	Phase A: prove the framework on travel via the generic templates	phase
#141	Phase B: extract + parameterize the template pack + land G1/G2/G3	phase
#142	Phase C: continuous-improvement loop + registry/experiment/lineage	phase
#143	Phase D: packaging / plugin + setup skill for external orgs	phase
#144	G1 container/GPU k8s-Job dispatch tool kind (generalizes travel#70)	foundation
#145	G2 long-running async job orchestration (generalizes travel#71)	foundation
#146	G3 artifact storage + model/dataset/eval registry (generalizes travel#72; on #104)	foundation
#147	G4 experiment / eval-metrics store	foundation
#148	G5 model lineage / provenance	foundation
#149	G6 cost / quota controls	foundation
#150	continuous-improvement-loop engine	loop

Relationship to travel (the reference implementation)

Travel#63 is the already-in-flight worked example that proves the framework, not a sub-project built later. The generic slm/<stage> templates are extracted from travel's travel-slm/<stage> playbooks (travel is config instance #1); G1/G2/G3 are the generalization of travel#70/#71/#72 (those become the seeds; the platform issues are where the domain-agnostic feature lands; travel consumes it). Travel's contract is the first filled-in config surface, validating the §2.2 schema. Travel's issues + wiki get a "reference implementation of #139" note; no scope change to travel. RFC §7.

Next concrete steps

Platform owner answers the open decisions (RFC §10): internal-capability vs external-marketplace, self-hosted vs managed, supported base-model families, build-vs-buy the registry/experiment store, human-in-the-loop scope, teacher policy + budget, auto-promotion default, GPU-pool ownership.
Phase A (#140) — dataset_build + eval done (ops PR #212 + travel PR #73, review-only). Next within Phase A: enable the OpenAI teacher (decision #6) for the ceiling + better labels; wire event-log replay (decision #8) for a golden-replay eval set; merge the PRs + bump pointers.
Phase B (#141) — extract the remaining stages + land G1/G2/G3. The extraction test is already passing (the examples/support_triage toy domain runs the Phase-A templates config-only); B finishes finetune/package on G1/G2/G3.
G1/G2/G3 (#144/#145/#146) — foundation MERGED + RELEASED (review-only): tools#82 → noetl-tools v3.19.0 (crates.io), worker#139 → noetl-worker v5.47.0 (its CI resolved noetl-tools ^3.19 after the tools release; crate + multi-arch image published), server#266 → noetl-server v3.46.0; ai-meta pointers bumped (2e79af6). Remaining: (a) run the poll-path kind validation (RFC "Poll-path kind validation steps") once the dev cluster is drained; (b) G1 — GPU node pool on real GKE; G2 — durable resume across worker restart, terminal-dedup guard (watcher+poll coexistence), cancellation (kube delete), log streaming, bounded poller concurrency; (c) G3 — the registry playbook client (ops#214, held open, stacked on the SLM Phase-A branch).
SLM teacher OpenAI→Vertex Gemini (re-derived, review-only). The teacher engine is now a pluggable provider seam + VertexGeminiProvider (Vertex generateContent, gemini-2.5-pro→flash, GKE-metadata WI token mint, no key/secret); the travel teacher config points at it (ops #216 + travel #75, Phase-A track). Next: the on-cluster ceiling run needs Vertex AI enabled on noetl-demo-19700101 + the worker pod's WI bound to a Vertex-enabled service account (operator step), then dataset_build produces the oracle↔teacher gap that ranks SLM candidates.

Open decisions (for the platform owner)

Internal capability vs external product/marketplace offering.
Self-hosted-only vs managed control plane.
Which base-model families to support (pin a set vs open per domain).
Build vs buy the registry + experiment tracking (native catalog on #104 vs MLflow/W&B/DVC behind the playbook interface).
Human-in-the-loop labeling scope.
Teacher model policy + per-domain token budget.
Auto-promotion default on (gated) vs human-approves-the-flip.
GPU pool ownership + serving-target default (CPU-only vs GPU).

Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
Secrets Wallet (#61) — SECURITY (design)
Rust Server Port (#49) — PRIMARY
Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
System Pool Design (#46) — PRIMARY
Regression Baseline Migration (#98) — e2e
Subscription / Listener Tool (#90) — RFC
Container Tool Callback (#43)
Rust Worker Parity Gaps (#47 · #48)
Event Envelope Reconciliation (#51 in TaskList)

Closed Umbrellas

Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
Rust Worker Migration (#30)
Python Services → Rust (#45)

Conventions

Per-repo wikis

noetl/noetl wiki — app + DSL
noetl/server wiki — Rust control plane
noetl/worker wiki — Rust pull worker
noetl/tools wiki — tool registry crate
noetl/cli wiki — CLI + local mode
noetl/gateway wiki — gatekeeper
noetl/ops wiki — Helm + manifests
noetl/travel wiki — domain SPA reference
Docs site — engineer-facing architecture

Uh oh!

Umbrella Domain SLM Platform

Umbrella — Domain-Specific SLM Platform (RFC, design)

Goal

The org-facing config surface (what makes it "tooling")

The continuous-improvement loop + its gates

Recent activity

Children

Relationship to travel (the reference implementation)

Next concrete steps

Open decisions (for the platform owner)

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally