-
Notifications
You must be signed in to change notification settings - Fork 0
travel slm
Status (2026-06-26): Phases A + B executed. The MLOps pipeline (dataset → finetune → eval → package, with G3 registry lineage) is validated end-to-end, and the first real fine-tune (Apple-Silicon MLX, Qwen2.5-1.5B LoRA) trains and holds 100% schema validity — but does not yet beat the deterministic floor on per-field accuracy (root cause: 58 training examples). The planner is unchanged; nothing shipped to prod.
Two companion pages now carry the live detail:
- Travel SLM journey — the track record: chronology, the teacher-ceiling finding, the dataset, the local LoRA run, the honest verdict, and what's next.
- Training the Travel SLM — the reproducible engineer guide: dataset build, local MLX + GKE GPU training paths, eval, registry lineage, exact commands.
This page below is the original Phase-0 design summary, kept for context.
This page tracks the initiative to replace the two OpenAI calls in the
Muno itinerary-planner with a local, travel-domain Small Language
Model (SLM). Full design lives in the repo RFC:
docs/rfc/travel-slm.md.
Reference implementation of the NoETL Domain-Specific SLM platform. This travel SLM is the worked example that proves the generalized, multi-domain Domain-Specific SLM platform umbrella noetl/ai-meta#139 end-to-end. That platform extracts this work's MLOps-as-playbooks pattern (
automation/mlops/travel-slm/<stage>+ the RFC §6A capability gaps) into a reusable, config-driven framework any org instantiates by writing oneslm.config.yaml. The genericautomation/mlops/slm/<stage>templates are extracted from travel's playbooks; the §6A capability gaps #70 / #71 / #72 are the seeds of the domain-agnostic platform foundations G1 / G2 / G3 (ai-meta #144 / #145 / #146). No scope change here — travel stays the travel SLM. See the platform RFCdocs/rfc/domain-slm-platform.md.
-
Umbrella issue: noetl/travel#63 (
epic/slm/ml) · children #64–#68 (one per phase) - Platform umbrella (reference impl): noetl/ai-meta#139 — Domain-Specific SLM platform
- Related: Python → Rust migration · noetl/ai-meta#137 (provider auth) · noetl/ai-meta#130 / #136 (per-hop latency)
The planner is designed around two OpenAI passes. A domain SLM replaces them to cut cost, remove the external dependency + secret from the critical path, own the latency budget in-cluster, and specialize on the narrow travel intent→tool→widget task. It is a swap, not a rewrite: the SLM emits byte-compatible JSON for the two contracts below.
| Pass | Model today | In → Out |
|---|---|---|
extract_turn |
gpt-4o, JSON mode, temp 0 |
(user event, slot_state, thread) → {slot_updates, tool_requests, render_intent}
|
render_widget_chat |
gpt-4o-mini |
(slot_state, extraction, tool_summary, render_intent) → {bot_message, widgets[]}
|
The exact I/O contract (tool catalog, argument keys, slot-state schema,
render_intent enum, the 24 widget types and their selection rules) is
specified in the RFC §2, derived from playbooks/agent/system_prompt_extraction.md,
playbooks/agent/system_prompt_chat.md, the widget-contract/*.schema.json
set, and the deterministic Python reference in itinerary-planner.yaml.
Useful fact for the dataset plan: the current planner branch runs a deterministic Python implementation of both contracts (the step marks
llm_contract.fallback_used: true). That gives the SLM project a working reference baseline and a free second labeling oracle alongside the OpenAI teacher.
-
Model: hybrid — a fine-tuned small instruct model (~0.5–3B; default
Qwen2.5-1.5B / Llama-3.2-1B to confirm after a Phase-1 baseline) plus
grammar-guided decoding so output is always schema-valid and tool/widget
fields are always in-enum. A deterministic post-processor fills mechanical
fields (geo coords,
duffel_env: test, CTA ids) and is the safety fallback. -
Serving: mirror the MCP-provider pattern — an in-cluster inference
service over HTTP, called from a new
automation/agents/mcp/travel-slmplaybook, so it slots into the catalog like another provider. (Worker- embedded model is a fallback only if the HTTP hop proves to be the bottleneck.) -
Integration: the planner calls the SLM playbook behind a flag
(
extraction_engine/render_engine∈openai|slm|deterministic, plusslm_shadow) so we A/B and shadow-eval before any cutover. Routing arcs and widget contracts stay unchanged.
Hard requirement: every MLOps stage — not just serving — runs as a NoETL
playbook, never an external script. The SLM dogfoods the platform; its own
lifecycle is event-sourced, replayable, and observable like everything else.
Playbooks live under automation/mlops/travel-slm/.
| Stage | Playbook | Capability needed | Gap |
|---|---|---|---|
| Dataset | dataset_build |
teacher API (http + keychain) · event-log replay · large-artifact storage · dataset registry | G3 |
| Train | finetune |
GPU container/k8s-Job dispatch · long-running async (hours) · artifact storage + model registry | G1, G2, G3 |
| Eval | eval |
http/python metrics vs OpenAI ceiling + deterministic floor · model registry | G3 |
| Shadow | shadow_eval |
event-log read via server API (planner slm_shadow flag) |
— |
| Package | package |
quantize/build job · artifact storage + model registry | G1, G3 |
| Deploy + A/B flip | deploy |
ops deploy automation · catalog flag-flip · http smoke | — |
| Cron | retrain_orchestrator |
schedule/cron · playbook composition | — |
Playbook-based MLOps at training scale needs three platform capabilities NoETL doesn't have today — flagged, not built by Phase 0:
-
G1 (#70) — GPU container / k8s-Job dispatch tool (
noetl/tools+worker+ops). -
G2 (#71) — long-running async job orchestration (callback/poll for hours-long jobs;
noetl/server+worker). -
G3 (#72) — large binary artifact storage + model/dataset registry in the catalog (
noetl/server+noetl; builds on the result tier, ai-meta#104).
dataset_build, eval, shadow_eval run on existing tool kinds today, so
Phase 1 is not blocked; finetune/package are gated on G1–G3 (buildable in
parallel with Phase 1).
| Phase | Playbook(s) | What | Touches prod? |
|---|---|---|---|
| 0 — RFC | — | This design + contract + decisions | No |
| 1 — Dataset + baseline (#64) |
dataset_build, eval
|
Teacher-bootstrap dataset (OpenAI + deterministic oracle), eval harness, measure ceiling/floor/baseline | No |
| 2 — Train / fine-tune (#65) |
finetune, package
|
LoRA/QLoRA chosen model + grammar decoding; hit offline targets — gated on G1–G3 | No |
| 3 — Serve + integrate (#66) |
deploy + mcp/travel-slm
|
In-cluster inference service + playbook + planner flags; kind-validate | Flagged, default OpenAI |
| 4 — Shadow-eval (#67) | shadow_eval |
slm_shadow: true vs OpenAI; match-rate + latency + widget-validity |
Shadow only |
| 5 — Gated cutover (#68) |
deploy (flag-flip) |
Flip engine flags per-pass on metric bar; instant rollback | Gated |
| Ongoing — retrain/eval cron | retrain_orchestrator |
Scheduled drift-check → conditional retrain → eval → shadow | No (until cutover) |
Tool-selection match ≥ 98% · argument fidelity ≥ 95% (100% schema-valid) · slot-update correctness ≥ 95% · render-intent match ≥ 98% · widget-type selection ≥ 98% · widget-schema validity 100% (grammar-enforced) · latency p50/p95 ≤ OpenAI · end-to-end turn equivalence ≥ 95%.
Model size/family · CPU vs GPU serving · fine-tune vs prompt-small-instruct · one model two roles vs two models · teacher token budget + seed-corpus size · cutover aggressiveness · serving stack (vLLM/TGI vs llama.cpp/Ollama) · event-log replay access for the dataset. See RFC §10.
Travel SPA
Architecture
- Architecture
- Widget contract
- Business data via playbooks
- Playbook: itinerary-planner
- Playbook: calendar/list
- Python → Rust migration
- Travel-domain SLM
- Travel SLM journey
- Training the Travel SLM
Integration
Operations
See also
- noetl wiki (app)
- ops wiki (deploy)
- Ephemeral Blueprints