-
Notifications
You must be signed in to change notification settings - Fork 0
Umbrella Domain SLM Platform
ai-task: noetl/ai-meta#139
· Opened: 2026-06-25
· Last update: 2026-06-25 (Phase A #140 landed — dataset_build + eval generic templates proven on travel + a 2nd domain; review-only PRs)
· Status: Phase A in progress (un-gated stages built + validated; no model trained, no infra, no prod change)
· RFC: docs/rfc/domain-slm-platform.md
· Reference implementation: noetl/travel#63 (the travel-domain SLM)
· Generalizes: repos/travel/docs/rfc/travel-slm.md
· Built on: #104 (result tier → registry substrate), #107 (distributed-OS program → execution substrate)
Tooling for organizations to build AND continuously improve their own domain-specific Small Language Models — entirely orchestrated by NoETL playbooks (MLOps-as-playbooks). Travel (#63) proves it end-to-end; this umbrella extracts the generic, config-driven framework from it so a second domain stands up its SLM by writing one config object, not a pipeline.
Generalize the travel SLM's MLOps-as-playbooks pattern into a reusable, multi-domain framework with three parts:
-
A parameterized template pack
automation/mlops/slm/<stage>(dataset_build, finetune, eval, shadow_eval, package, deploy, retrain_orchestrator + new traffic_capture, drift_monitor) instantiated by one org-facingslm.config.yaml(I/O contract + data sources + teacher(s) + base-model/recipe + eval metrics/targets + serving target + improvement cadence/gates). The org provides config + schemas + prompts + seed data; the framework supplies every stage's step DAG. - A continuous-improvement loop: production-traffic capture → quality/drift monitoring → teacher-assisted / human-in-the-loop relabel → scheduled retrain → shadow-eval vs incumbent + a deterministic floor → gated auto-promotion with model-registry versioning + instant rollback. All stages are NoETL playbooks on a schedule.
- Domain-agnostic platform foundations generalizing travel #70/#71/#72: G1 container/GPU job dispatch, G2 long-async orchestration, G3 artifact storage + a versioned model/dataset/eval registry (on #104), plus G4 experiment store, G5 lineage, G6 cost controls.
The success test: a second domain runs the same templates by config only, zero framework edits.
An org writes one declarative config — no pipeline code. Seven blocks: I/O contract (role schemas + prompts + decoding grammar + optional deterministic oracle), data sources (event-log replay scope + seed/adversarial corpora + redaction policy), teacher(s) (labeling ceiling, keychain alias), model/train (base family/size + recipe + role layout), eval (floor + ceiling + metrics/targets + eval sets), serving (CPU/GPU target + the consuming playbook's engine flags), continuous improvement (capture rate + schedule + retrain/promotion gates + governance/budget). The framework supplies the stage DAGs, teacher fan-out + oracle cross-check + schema-validation, GPU-job dispatch + long-async, registry + metric computation + gating, deploy + flag-flip + rollback, capture + drift + scheduled retrain, and the observability/data-access/secrets compliance. See RFC §2.2 for the full annotated config.
live traffic → traffic_capture → drift_monitor → [retrain TRIGGER?]
│ yes
relabel (teacher + oracle, human-in-loop) ──────────┘
→ finetune (challenger) → eval (vs floor + incumbent)
→ shadow_eval (live) → [PROMOTE gates] → deploy (registry version + flag flip)
│ fail → registered, human notified
rollback = re-flip flag to prior registry version (instant)
- Retrain triggers (RFC §4.1) — quality regression (> 2pts), outcome regression (> 1.5× baseline corrections/errors), schema-validity dip (< 99.5%), distribution drift (> 5% new intents/tools/entities), data volume (≥ N new labels), contract change (forces retrain), cadence floor (≥ max staleness) — AND a G6 budget check (over-budget defers + alerts, never silently skips).
- Promotion gates (RFC §4.2) — floor (≥ deterministic baseline on every metric), incumbent-offline (≥ incumbent, no regression > tolerance), schema-validity == 100%, shadow-live (match-rate + p95 ≤ incumbent), latency/cost (≤ budget + ≤ incumbent cost), no-harm-slice (no critical slice regresses). Per-role promotion; every promote/rollback is an auditable event; rollback is a flag re-flip because every version stays in the registry.
| Date | What | Pointers |
|---|---|---|
| 2026-06-26 |
G1/G2/G3 foundation merged + released (review-only); SLM teacher swapped OpenAI→Vertex Gemini. tools v3.19.0 (G1 GPU placement + G2 poll helper), worker v5.47.0 (G2 watcher-free poll completion fallback, NOETL_CONTAINER_COMPLETION_POLL default-off), server v3.46.0 (G3 versioned model/dataset/eval registry, NOETL_REGISTRY_ENABLED default-off) — all three merged, released, and ai-meta pointers bumped (2e79af6). The SLM teacher engine became a pluggable provider seam + VertexGeminiProvider (gemini-2.5-pro→flash, GKE-metadata WI token mint, no key/secret; OpenAI keychain dropped from dataset_build.yaml); travel config repointed at it. No prod deploy (flags default-off); no on-cluster Vertex run yet (needs Vertex AI enabled + worker WI binding). |
tools #82 f36d020 · worker #139 32b0e96 · server #266 1d86464 · teacher PRs ops #216 + travel #75 · ops#214 held |
| 2026-06-25 |
Platform foundations G1 (#144) + G2 (#145) — first increment (design + flag-gated/additive code, review-only). Major finding: the Container Tool Callback umbrella (#43) already ships most of G1+G2 (the container tool + the watcher→server-callback→worker-suppress async path, deployed on kind). This increment fills the SLM gaps: G1 GPU placement (node_selector/tolerations/volumes) on the container tool; G2 a watcher-free poll completion fallback (detached poller emits the resume call.done, slot stays free; NOETL_CONTAINER_COMPLETION_POLL default off). Artifact in/out refs defer to G3 (#146) via noetl:// env URNs. Worker depends on tools v3.19.0 release. Live-kind deferred (tools-release ordering + a degraded dev cluster); unit-tested, clippy clean. |
RFC docs/rfc/g1-g2-container-job-async.md · tools PR noetl/tools#82 · worker PR noetl/worker#139 · #144 · #145
|
| 2026-06-25 |
Phase A (#140) — dataset_build + eval shipped as generic, config-driven NoETL playbooks; baseline produced. The two un-gated stages run from one org slm.config.yaml. Travel instance = config #1 (deterministic oracle as zero-cost label source — no OpenAI spend). Config-only proven: a 2nd toy domain (support-triage, no tools/widgets) runs the same playbooks with zero framework edits (RFC §2.2 extraction test). Validated via noetl exec -r local. Baseline (45 turns → 29/16): 100% schema validity (extract/render/48 widget envelopes), eval gate passed, floor latency sub-ms, 9 intents/4 tools/10 widget types covered. |
ops PR #212 (generic pack) · travel PR noetl/travel#73 (travel instance) · #140 · noetl/travel#64 |
| 2026-06-25 | RFC + umbrella + 11 children filed (design only). | RFC docs/rfc/domain-slm-platform.md · #139
|
Baseline reading (de-risks the model choice): schema validity is already 100% at the floor (the SLM must match, not improve it); the floor latency is essentially free, so an SLM must justify itself on accuracy over the deterministic floor, not latency. The candidate-ranking number — agreement with the OpenAI ceiling — needs teacher labels (Phase 1 next). finetune/package/serve stay gated on G1/G2/G3; model-choice deferred until this baseline informs it.
| # | Title | Role |
|---|---|---|
| #140 | Phase A: prove the framework on travel via the generic templates | phase |
| #141 | Phase B: extract + parameterize the template pack + land G1/G2/G3 | phase |
| #142 | Phase C: continuous-improvement loop + registry/experiment/lineage | phase |
| #143 | Phase D: packaging / plugin + setup skill for external orgs | phase |
| #144 | G1 container/GPU k8s-Job dispatch tool kind (generalizes travel#70) | foundation |
| #145 | G2 long-running async job orchestration (generalizes travel#71) | foundation |
| #146 | G3 artifact storage + model/dataset/eval registry (generalizes travel#72; on #104) | foundation |
| #147 | G4 experiment / eval-metrics store | foundation |
| #148 | G5 model lineage / provenance | foundation |
| #149 | G6 cost / quota controls | foundation |
| #150 | continuous-improvement-loop engine | loop |
Travel#63 is the already-in-flight worked example that proves the framework, not a sub-project built later. The generic slm/<stage> templates are extracted from travel's travel-slm/<stage> playbooks (travel is config instance #1); G1/G2/G3 are the generalization of travel#70/#71/#72 (those become the seeds; the platform issues are where the domain-agnostic feature lands; travel consumes it). Travel's contract is the first filled-in config surface, validating the §2.2 schema. Travel's issues + wiki get a "reference implementation of #139" note; no scope change to travel. RFC §7.
- Platform owner answers the open decisions (RFC §10): internal-capability vs external-marketplace, self-hosted vs managed, supported base-model families, build-vs-buy the registry/experiment store, human-in-the-loop scope, teacher policy + budget, auto-promotion default, GPU-pool ownership.
- Phase A (#140) — dataset_build + eval done (ops PR #212 + travel PR #73, review-only). Next within Phase A: enable the OpenAI teacher (decision #6) for the ceiling + better labels; wire event-log replay (decision #8) for a golden-replay eval set; merge the PRs + bump pointers.
- Phase B (#141) — extract the remaining stages + land G1/G2/G3. The extraction test is already passing (the
examples/support_triagetoy domain runs the Phase-A templates config-only); B finishes finetune/package on G1/G2/G3. -
G1/G2/G3 (#144/#145/#146) — foundation MERGED + RELEASED (review-only): tools#82 → noetl-tools v3.19.0 (crates.io), worker#139 → noetl-worker v5.47.0 (its CI resolved
noetl-tools ^3.19after the tools release; crate + multi-arch image published), server#266 → noetl-server v3.46.0; ai-meta pointers bumped (2e79af6). Remaining: (a) run the poll-path kind validation (RFC "Poll-path kind validation steps") once the dev cluster is drained; (b) G1 — GPU node pool on real GKE; G2 — durable resume across worker restart, terminal-dedup guard (watcher+poll coexistence), cancellation (kube delete), log streaming, bounded poller concurrency; (c) G3 — the registry playbook client (ops#214, held open, stacked on the SLM Phase-A branch). -
SLM teacher OpenAI→Vertex Gemini (re-derived, review-only). The teacher engine is now a pluggable provider seam +
VertexGeminiProvider(VertexgenerateContent, gemini-2.5-pro→flash, GKE-metadata WI token mint, no key/secret); the travel teacher config points at it (ops #216 + travel #75, Phase-A track). Next: the on-cluster ceiling run needs Vertex AI enabled onnoetl-demo-19700101+ the worker pod's WI bound to a Vertex-enabled service account (operator step), thendataset_buildproduces the oracle↔teacher gap that ranks SLM candidates.
- Internal capability vs external product/marketplace offering.
- Self-hosted-only vs managed control plane.
- Which base-model families to support (pin a set vs open per domain).
- Build vs buy the registry + experiment tracking (native catalog on #104 vs MLflow/W&B/DVC behind the playbook interface).
- Human-in-the-loop labeling scope.
- Teacher model policy + per-domain token budget.
- Auto-promotion default on (gated) vs human-approves-the-flip.
- GPU pool ownership + serving-target default (CPU-only vs GPU).
- RFC:
docs/rfc/domain-slm-platform.md - Reference impl RFC:
repos/travel/docs/rfc/travel-slm.md· travel wiki travel-slm - Umbrella: Event WAL Storage (#104 — registry substrate)
- Execution Model · Data Access Boundary · Observability
- Home — overview
- Repo Map
- Releases
- Sessions Log
- Domain-Specific SLM Platform (#139) — RFC (design); travel#63 is the reference impl
- Secrets Wallet (#61) — SECURITY (design)
- Rust Server Port (#49) — PRIMARY
- Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
- Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
- Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
- WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
- System Pool Design (#46) — PRIMARY
- Regression Baseline Migration (#98) — e2e
- Subscription / Listener Tool (#90) — RFC
- Container Tool Callback (#43)
- Rust Worker Parity Gaps (#47 · #48)
- Event Envelope Reconciliation (#51 in TaskList)
- Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
- Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
- Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
- Rust Worker Migration (#30)
- Python Services → Rust (#45)
- Issue Tracking
- Wiki Convention
- Handoffs
- Deployment Validation
- Execution Model
- Data Access Boundary
- Observability
- noetl/noetl wiki — app + DSL
- noetl/server wiki — Rust control plane
- noetl/worker wiki — Rust pull worker
- noetl/tools wiki — tool registry crate
- noetl/cli wiki — CLI + local mode
- noetl/gateway wiki — gatekeeper
- noetl/ops wiki — Helm + manifests
- noetl/travel wiki — domain SPA reference
- Docs site — engineer-facing architecture