Skip to content

Umbrella Domain SLM Platform

Kadyapam edited this page Jun 26, 2026 · 5 revisions

Umbrella — Domain-Specific SLM Platform (RFC, design)

ai-task: noetl/ai-meta#139 · Opened: 2026-06-25 · Last update: 2026-06-25 (Phase A #140 landed — dataset_build + eval generic templates proven on travel + a 2nd domain; review-only PRs) · Status: Phase A in progress (un-gated stages built + validated; no model trained, no infra, no prod change) · RFC: docs/rfc/domain-slm-platform.md · Reference implementation: noetl/travel#63 (the travel-domain SLM) · Generalizes: repos/travel/docs/rfc/travel-slm.md · Built on: #104 (result tier → registry substrate), #107 (distributed-OS program → execution substrate)

Tooling for organizations to build AND continuously improve their own domain-specific Small Language Models — entirely orchestrated by NoETL playbooks (MLOps-as-playbooks). Travel (#63) proves it end-to-end; this umbrella extracts the generic, config-driven framework from it so a second domain stands up its SLM by writing one config object, not a pipeline.

Goal

Generalize the travel SLM's MLOps-as-playbooks pattern into a reusable, multi-domain framework with three parts:

  1. A parameterized template pack automation/mlops/slm/<stage> (dataset_build, finetune, eval, shadow_eval, package, deploy, retrain_orchestrator + new traffic_capture, drift_monitor) instantiated by one org-facing slm.config.yaml (I/O contract + data sources + teacher(s) + base-model/recipe + eval metrics/targets + serving target + improvement cadence/gates). The org provides config + schemas + prompts + seed data; the framework supplies every stage's step DAG.
  2. A continuous-improvement loop: production-traffic capture → quality/drift monitoring → teacher-assisted / human-in-the-loop relabel → scheduled retrain → shadow-eval vs incumbent + a deterministic floor → gated auto-promotion with model-registry versioning + instant rollback. All stages are NoETL playbooks on a schedule.
  3. Domain-agnostic platform foundations generalizing travel #70/#71/#72: G1 container/GPU job dispatch, G2 long-async orchestration, G3 artifact storage + a versioned model/dataset/eval registry (on #104), plus G4 experiment store, G5 lineage, G6 cost controls.

The success test: a second domain runs the same templates by config only, zero framework edits.

The org-facing config surface (what makes it "tooling")

An org writes one declarative config — no pipeline code. Seven blocks: I/O contract (role schemas + prompts + decoding grammar + optional deterministic oracle), data sources (event-log replay scope + seed/adversarial corpora + redaction policy), teacher(s) (labeling ceiling, keychain alias), model/train (base family/size + recipe + role layout), eval (floor + ceiling + metrics/targets + eval sets), serving (CPU/GPU target + the consuming playbook's engine flags), continuous improvement (capture rate + schedule + retrain/promotion gates + governance/budget). The framework supplies the stage DAGs, teacher fan-out + oracle cross-check + schema-validation, GPU-job dispatch + long-async, registry + metric computation + gating, deploy + flag-flip + rollback, capture + drift + scheduled retrain, and the observability/data-access/secrets compliance. See RFC §2.2 for the full annotated config.

The continuous-improvement loop + its gates

live traffic → traffic_capture → drift_monitor → [retrain TRIGGER?]
                                                        │ yes
   relabel (teacher + oracle, human-in-loop) ──────────┘
        → finetune (challenger) → eval (vs floor + incumbent)
        → shadow_eval (live) → [PROMOTE gates] → deploy (registry version + flag flip)
                                     │ fail → registered, human notified
   rollback = re-flip flag to prior registry version (instant)
  • Retrain triggers (RFC §4.1) — quality regression (> 2pts), outcome regression (> 1.5× baseline corrections/errors), schema-validity dip (< 99.5%), distribution drift (> 5% new intents/tools/entities), data volume (≥ N new labels), contract change (forces retrain), cadence floor (≥ max staleness) — AND a G6 budget check (over-budget defers + alerts, never silently skips).
  • Promotion gates (RFC §4.2) — floor (≥ deterministic baseline on every metric), incumbent-offline (≥ incumbent, no regression > tolerance), schema-validity == 100%, shadow-live (match-rate + p95 ≤ incumbent), latency/cost (≤ budget + ≤ incumbent cost), no-harm-slice (no critical slice regresses). Per-role promotion; every promote/rollback is an auditable event; rollback is a flag re-flip because every version stays in the registry.

Recent activity

Date What Pointers
2026-06-26 G1/G2/G3 foundation merged + released (review-only); SLM teacher swapped OpenAI→Vertex Gemini. tools v3.19.0 (G1 GPU placement + G2 poll helper), worker v5.47.0 (G2 watcher-free poll completion fallback, NOETL_CONTAINER_COMPLETION_POLL default-off), server v3.46.0 (G3 versioned model/dataset/eval registry, NOETL_REGISTRY_ENABLED default-off) — all three merged, released, and ai-meta pointers bumped (2e79af6). The SLM teacher engine became a pluggable provider seam + VertexGeminiProvider (gemini-2.5-pro→flash, GKE-metadata WI token mint, no key/secret; OpenAI keychain dropped from dataset_build.yaml); travel config repointed at it. No prod deploy (flags default-off); no on-cluster Vertex run yet (needs Vertex AI enabled + worker WI binding). tools #82 f36d020 · worker #139 32b0e96 · server #266 1d86464 · teacher PRs ops #216 + travel #75 · ops#214 held
2026-06-25 Platform foundations G1 (#144) + G2 (#145) — first increment (design + flag-gated/additive code, review-only). Major finding: the Container Tool Callback umbrella (#43) already ships most of G1+G2 (the container tool + the watcher→server-callback→worker-suppress async path, deployed on kind). This increment fills the SLM gaps: G1 GPU placement (node_selector/tolerations/volumes) on the container tool; G2 a watcher-free poll completion fallback (detached poller emits the resume call.done, slot stays free; NOETL_CONTAINER_COMPLETION_POLL default off). Artifact in/out refs defer to G3 (#146) via noetl:// env URNs. Worker depends on tools v3.19.0 release. Live-kind deferred (tools-release ordering + a degraded dev cluster); unit-tested, clippy clean. RFC docs/rfc/g1-g2-container-job-async.md · tools PR noetl/tools#82 · worker PR noetl/worker#139 · #144 · #145
2026-06-25 Phase A (#140) — dataset_build + eval shipped as generic, config-driven NoETL playbooks; baseline produced. The two un-gated stages run from one org slm.config.yaml. Travel instance = config #1 (deterministic oracle as zero-cost label source — no OpenAI spend). Config-only proven: a 2nd toy domain (support-triage, no tools/widgets) runs the same playbooks with zero framework edits (RFC §2.2 extraction test). Validated via noetl exec -r local. Baseline (45 turns → 29/16): 100% schema validity (extract/render/48 widget envelopes), eval gate passed, floor latency sub-ms, 9 intents/4 tools/10 widget types covered. ops PR #212 (generic pack) · travel PR noetl/travel#73 (travel instance) · #140 · noetl/travel#64
2026-06-25 RFC + umbrella + 11 children filed (design only). RFC docs/rfc/domain-slm-platform.md · #139

Baseline reading (de-risks the model choice): schema validity is already 100% at the floor (the SLM must match, not improve it); the floor latency is essentially free, so an SLM must justify itself on accuracy over the deterministic floor, not latency. The candidate-ranking number — agreement with the OpenAI ceiling — needs teacher labels (Phase 1 next). finetune/package/serve stay gated on G1/G2/G3; model-choice deferred until this baseline informs it.

Children

# Title Role
#140 Phase A: prove the framework on travel via the generic templates phase
#141 Phase B: extract + parameterize the template pack + land G1/G2/G3 phase
#142 Phase C: continuous-improvement loop + registry/experiment/lineage phase
#143 Phase D: packaging / plugin + setup skill for external orgs phase
#144 G1 container/GPU k8s-Job dispatch tool kind (generalizes travel#70) foundation
#145 G2 long-running async job orchestration (generalizes travel#71) foundation
#146 G3 artifact storage + model/dataset/eval registry (generalizes travel#72; on #104) foundation
#147 G4 experiment / eval-metrics store foundation
#148 G5 model lineage / provenance foundation
#149 G6 cost / quota controls foundation
#150 continuous-improvement-loop engine loop

Relationship to travel (the reference implementation)

Travel#63 is the already-in-flight worked example that proves the framework, not a sub-project built later. The generic slm/<stage> templates are extracted from travel's travel-slm/<stage> playbooks (travel is config instance #1); G1/G2/G3 are the generalization of travel#70/#71/#72 (those become the seeds; the platform issues are where the domain-agnostic feature lands; travel consumes it). Travel's contract is the first filled-in config surface, validating the §2.2 schema. Travel's issues + wiki get a "reference implementation of #139" note; no scope change to travel. RFC §7.

Next concrete steps

  1. Platform owner answers the open decisions (RFC §10): internal-capability vs external-marketplace, self-hosted vs managed, supported base-model families, build-vs-buy the registry/experiment store, human-in-the-loop scope, teacher policy + budget, auto-promotion default, GPU-pool ownership.
  2. Phase A (#140) — dataset_build + eval done (ops PR #212 + travel PR #73, review-only). Next within Phase A: enable the OpenAI teacher (decision #6) for the ceiling + better labels; wire event-log replay (decision #8) for a golden-replay eval set; merge the PRs + bump pointers.
  3. Phase B (#141) — extract the remaining stages + land G1/G2/G3. The extraction test is already passing (the examples/support_triage toy domain runs the Phase-A templates config-only); B finishes finetune/package on G1/G2/G3.
  4. G1/G2/G3 (#144/#145/#146) — foundation MERGED + RELEASED (review-only): tools#82 → noetl-tools v3.19.0 (crates.io), worker#139 → noetl-worker v5.47.0 (its CI resolved noetl-tools ^3.19 after the tools release; crate + multi-arch image published), server#266 → noetl-server v3.46.0; ai-meta pointers bumped (2e79af6). Remaining: (a) run the poll-path kind validation (RFC "Poll-path kind validation steps") once the dev cluster is drained; (b) G1 — GPU node pool on real GKE; G2 — durable resume across worker restart, terminal-dedup guard (watcher+poll coexistence), cancellation (kube delete), log streaming, bounded poller concurrency; (c) G3 — the registry playbook client (ops#214, held open, stacked on the SLM Phase-A branch).
  5. SLM teacher OpenAI→Vertex Gemini (re-derived, review-only). The teacher engine is now a pluggable provider seam + VertexGeminiProvider (Vertex generateContent, gemini-2.5-pro→flash, GKE-metadata WI token mint, no key/secret); the travel teacher config points at it (ops #216 + travel #75, Phase-A track). Next: the on-cluster ceiling run needs Vertex AI enabled on noetl-demo-19700101 + the worker pod's WI bound to a Vertex-enabled service account (operator step), then dataset_build produces the oracle↔teacher gap that ranks SLM candidates.

Open decisions (for the platform owner)

  1. Internal capability vs external product/marketplace offering.
  2. Self-hosted-only vs managed control plane.
  3. Which base-model families to support (pin a set vs open per domain).
  4. Build vs buy the registry + experiment tracking (native catalog on #104 vs MLflow/W&B/DVC behind the playbook interface).
  5. Human-in-the-loop labeling scope.
  6. Teacher model policy + per-domain token budget.
  7. Auto-promotion default on (gated) vs human-approves-the-flip.
  8. GPU pool ownership + serving-target default (CPU-only vs GPU).

Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally