-
Notifications
You must be signed in to change notification settings - Fork 0
Travel SLM Journey
This page is the chronological record of the travel-domain Small Language Model (SLM) effort: the goal, the decisions, the dead ends, the numbers, and the honest current verdict. It is the companion to the step-by-step Training the Travel SLM guide — read this page for why and what happened, that page for how to do it.
For the original design, see the RFC
docs/rfc/travel-slm.md
and the design-stage Travel-domain SLM page.
-
Umbrella issue: noetl/travel#63 (
epic/slm/ml) - Platform umbrella (this is the reference impl): noetl/ai-meta#139
- Phase tracking: dataset+baseline #64 · train/fine-tune #65 · serve #66 · shadow #67 · cutover #68
| Pipeline (dataset → finetune → eval → package, with G3 registry lineage) | Validated end-to-end — offline CPU stub + real local LoRA |
| First real fine-tune (Apple-Silicon MLX, Qwen2.5-1.5B LoRA) | Done — holds 100% schema validity |
| Per-field accuracy vs the deterministic floor | Not beaten yet (tool 0.56 / intent 0.56 / widget_type 0.38) |
| Muno-ready (can replace the OpenAI/Gemini path) | No — blocked on training data volume (58 examples) |
| Next lever | event-log replay to ~1000 turns, richer render conditioning, true grammar-constrained decode |
Nothing here touched prod. The planner still runs the OpenAI/deterministic path; no GKE, IAM, secret, or flag change shipped.
The Muno itinerary-planner is built around two LLM calls:
| Pass | Model today | In → Out |
|---|---|---|
extract_turn |
gpt-4o, JSON mode, temp 0 |
(user event, slot_state, thread) → {slot_updates, tool_requests, render_intent}
|
render_widget_chat |
gpt-4o-mini |
(slot_state, extraction, tool_summary, render_intent) → {bot_message, widgets[]}
|
A fine-tuned, in-cluster SLM replaces those two passes to cut cost,
remove the external dependency and its secret from the critical path,
own the latency budget in-cluster, and specialize on the narrow
travel intent → tool → widget task. It is a swap, not a rewrite: the
SLM emits byte-compatible JSON for the same two contracts, behind engine
flags (extraction_engine / render_engine ∈ openai|slm|deterministic)
so we can A/B and shadow-eval before any cutover.
The exact I/O contract (4 tools, 12 render intents, 24 widget types, the slot-state schema) is pinned in the RFC §2 and reproduced as a deterministic Python module — which becomes the floor (next section).
Before any model, we built a deterministic oracle:
automation/mlops/slm/travel/oracle.py.
It is a pure-stdlib, rule-based reimplementation of both passes, derived
from the planner's own deterministic fallback logic (the steps that emit
llm_contract.fallback_used: true) and the two system prompts
(playbooks/agent/system_prompt_extraction.md, system_prompt_chat.md).
The oracle plays two roles:
- Label source — it produces training labels at zero token cost.
- Quality floor — every model candidate is scored against the oracle's decisions, and the oracle is the safety fallback a model must not regress below.
The floor is 100% schema-valid by construction and effectively instant (deterministic Python, sub-millisecond per turn). That sets the bar precisely: a model wins only by matching more of the oracle's field-level decisions than a trivial baseline does, while holding the 100% validity — it can't win on latency or validity, only on accuracy of the harder, context-dependent turns.
The plan was to bootstrap a richer dataset with a strong teacher model. We chose Vertex Gemini (token minted from the worker pod's Workload Identity via the GKE metadata server — no API key, no Secret Manager, OpenAI dropped as the teacher; RFC decision #6).
The first ceiling run was the turning point. Raw gemini-2.5-pro,
with no output-schema enforcement (exec 328967313246134272), scored:
- 0% valid widget envelopes
- 49% valid extractions
— below the deterministic floor. Not because the model was weak: it
picked sensible widget types, but emitted the wrong tool-request keys
(tool_id instead of tool) and empty payloads, so the envelopes failed
the contract schema.
The conclusion that reframed the whole effort: the lever is not a
bigger model. It is schema/grammar-constrained decoding. When we
handed the teacher a Vertex generationConfig.responseSchema derived from
the contract schemas (lib/slm_schema.py converts draft-07 → Vertex
responseSchema), the cheaper gemini-2.5-flash became schema-valid
by construction — 100% valid widget envelopes — at a fraction of the
cost. The Phase-1 dataset build spent $0.068 total (85 flash calls, 5
transport timeouts) to label 40 of 45 turns.
This is the design center of everything downstream: constrain the output, don't enlarge the model.
The constrained teacher is augmentation, not the authoritative target. Each teacher label is validated and repaired toward the oracle contract before inclusion (3 render envelopes were repaired from the oracle on schema failure). The authoritative training target stays the deterministic oracle.
Built by the generic dataset_build playbook from the travel seed corpus,
versioned as v1_constrained:
| Total turns |
45 (29 train / 16 eval, 30% holdout, split seed muno-travel-13) |
| Multitask examples | 58 (one extract + one render per turn) |
| Authoritative labels | deterministic oracle |
| Augmentation | constrained gemini-2.5-flash (40 turns labeled, repaired toward oracle) |
| Schema validity (final) | 100% widget envelopes + extracts + vocab |
| Vocab | 4 tools, 12 render intents, 10 widget types observed |
Per-record fields carry all three label sets side by side: labels
(oracle, authoritative), labels_teacher (raw constrained teacher),
labels_teacher_repaired (teacher render payloads repaired to the oracle
on schema failure).
The dataset also revealed the floor↔ceiling gap that ranks candidates: floor and teacher agree on widget_type 93% of the time but on tool only 36% and render_intent only 29% — i.e. the context-dependent turns ("that flight", "those", multi-turn slot fills) are exactly where a model must earn its keep.
A hard requirement of the platform: every MLOps stage runs as a NoETL playbook, event-sourced and replayable like everything else — no external training scripts. Phase B (noetl/ai-meta#141, ops #219) shipped the three training-side stages:
| Playbook | Role |
|---|---|
finetune.yaml |
Train a single multitask LoRA → write the adapter → register a G3 model (lineage → dataset) |
eval.yaml (candidate=slm) |
Pull the model, score it under schema-constrained decoding vs the floor → register a G3 eval (lineage → model) |
package.yaml |
Export the model, write a model card, bundle → register a G3 release (lineage → model + eval) |
These sit on the three platform foundations: G1 (GPU container / k8s-Job dispatch), G2 (long-running async poll/callback), G3 (artifact store + model/dataset/eval/release registry; ops #217).
Three finetune backends share one artifact contract:
-
stub(CPU, pure stdlib) — a nearest-prototype retrieval "model" that proves the whole orchestration end-to-end with no GPU. It holds 100% schema validity and honestly fails the match gate (tool_match≈0.69) — the failing gate is the signal a real model must close, not faked to 1.0. -
mlx(Apple Silicon) — the real local LoRA (next section). -
peft(CUDA) — the real GPU LoRA, gated on the operator provisioning the GPU pool + trainer image + flags.
ops #220 added a mode=mlx
backend so a 1–1.5B LoRA trains in unified memory on a Mac — no GPU node
pool, no container — and ran it on the travel dataset. Committed artifacts:
examples/travel-mlx-v1/
(RESULTS.md, MODEL_CARD.md, eval_report.json).
Training
| Host | Mac Studio (Mac13,1), Apple M1 Max (10 cores 8P/2E), 32 GB unified, macOS 26.3.1 |
| Runtime |
mlx-lm 0.31.3 in an arm64 Python 3.12 venv |
| Base model |
Qwen/Qwen2.5-1.5B-Instruct (bf16, ≈3 GB from HF) |
| Recipe | single multitask LoRA (extract + render in one adapter) |
| Trainable params | 5.276 M / 1543.714 M (0.342%) |
| Hyperparams | iters 800, batch 1, top-16 layers, lr 1e-4, max-seq 2048, --mask-prompt
|
| Peak memory | 9.7 GB |
| Wall-clock | ≈25 min |
| Val loss curve | 0.811 → 0.077 → 0.079 → 0.088 → 0.089 |
It trained on the oracle labels only (49 train / 9 valid examples),
because the eval scores per-field exact match against the oracle labels
— training directly on that target is the cleanest test of "can the SLM
learn the floor's decisions".
Eval (16-example v1_constrained holdout, schema-constrained decoding):
| metric | SLM (mlx) | oracle floor | CPU stub |
|---|---|---|---|
| widget_schema_validity | 1.0000 | 1.0 | 1.0 |
| extract_schema_validity | 1.0000 | 1.0 | 1.0 |
| tool_vocab_validity | 1.0000 | 1.0 | 1.0 |
| render_intent_vocab_validity | 1.0000 | 1.0 | 1.0 |
| tool_match | 0.5625 | 1.0 | 0.6875 |
| render_intent_match | 0.5625 | 1.0 | 0.6250 |
| arg_fidelity | 0.5625 | 1.0 | 0.4375 |
| slot_update_match | 0.6250 | 1.0 | 0.5000 |
| widget_type_match | 0.3750 | 1.0 | 0.5000 |
Candidate latency (unoptimized local serving, 2 generations/turn, greedy): p50 9.2 s / p95 13.1 s. Gate: FAIL (match targets 0.95–0.98 not met).
The fine-tune holds the floor's 100% schema validity — every emitted
extract and widget envelope is contract-valid by construction — and the
generations are genuinely sensible: exact oracle match on clear turns
("plan a trip" → collect_missing, "confirm and purchase" →
create_order/order_confirmation), correct tool+intent on "what flights
are available".
But on per-field match it does not beat the deterministic floor, and
against the retrieval CPU stub it only trades wins — better on
arg_fidelity and slot_update (it generalizes structure rather than copying
a neighbour), worse on tool/intent/widget (where the stub copies a
memorised label that happens to match on this distribution-overlapping
holdout). The weakest pass is render (widget_type_match 0.375): on turns
with no tool data yet (e.g. show_flights before offers exist) the model
under-produces the widget the oracle emits.
Root cause: 58 examples is tiny. This is a data-volume problem, not a model-architecture problem. The bounded vocab (4 tools, 12 intents) generalizes — but only with coverage of the context-dependent turns the model currently misroutes.
So: pipeline proven, local training proven, model not yet Muno-ready.
The run wrote a full dataset → model → eval → release lineage into the G3 registry (local file-backed backend for this offline run):
dataset registry://muno/travel/dataset/travel_v1_constrained/1
model registry://muno/travel/model/travel_slm_multitask/1 (mlx LoRA adapter, ≈39 MB tar)
eval registry://muno/travel/eval/travel_slm_multitask/1 (lineage → model)
release registry://muno/travel/release/travel_slm_multitask/1 (lineage → [model, eval])
In priority order (from RESULTS.md §"What a next iteration needs"):
-
More data. Wire the config's
event_log_replay(cap 1000 turns of real Muno planner turns, alreadystatus: enabledinslm.config.yaml) into the train set. This is the single biggest lever. -
Richer render conditioning. The render pass needs the
tool_summary/ slot context the oracle'srenderfn sees, not justturn+extraction, so it can emit the right widget on data-bearing turns. - True grammar-constrained generation at decode time (logit masking / outlines over the contract enums) rather than only post-hoc repair — this lifts the in-vocab-but-wrong tool/intent cases the repair lever can't fix.
- Teacher-augmented + curriculum training once (1)–(3) land, and a larger candidate (e.g. qwen2.5-3B) if 1.5B plateaus below the gate.
The next-iteration list above was executed on the local Apple-Silicon
mode=mlx path. Every match field climbed while schema validity stayed
pinned at 100%:
| field | v1 (45 turns) | v2 (701 turns) | v3 (950 turns + constrained decode) | gate |
|---|---|---|---|---|
| tool_match | 0.5625 | 0.8017 | 0.9444 | 0.98 |
| render_intent_match | 0.5625 | 0.8595 | 0.9236 | 0.98 |
| widget_type_match | 0.3750 | 0.6529 | 0.7917 | 0.98 |
| arg_fidelity | 0.5625 | 0.7934 | 0.9444 | 0.95 |
| slot_update_match | 0.6250 | 0.9091 | 0.9375 | 0.95 |
| schema validity (widget/extract) | 1.0 | 1.0 | 1.0 | 1.0 ✅ |
-
v2 — data scaling. A 701-turn oracle-routed synthetic corpus
(leak-free disjoint train/eval) + render conditioning on the
tool_summarythe oracle sees. Lifted every field +0.23–0.30; widget_type nearly doubled. -
v3 — targeted data + grammar-constrained decode. Over-sampled the
v2-weak
show_placesslice + contrast neighbours + the multi-widgetsummarysequence, plus true logit-level constrained decoding on the extract pass (lm-format-enforcer,lib/slm_constrain.py). An A/B isolates the levers: the data lever is the big mover (show_places3/14 → 27/27 correct tool), the decode lever guarantees 100% extract validity + a small tool_match bump but adds no semantic gain. - 3B probe aborted — capacity was not the bottleneck (the v3 1.5B result confirmed data/decode were).
Verdict: not yet Muno-ready by the strict gate — widget_type_match
0.79 (render generation for data-bearing widgets) is the ~0.19-short blocker;
the other four match fields are within ~0.006–0.057 of target. Teacher spend
stayed $0 (the constrained Gemini teacher only runs on the in-cluster
worker WI, not locally). Full write-ups:
v2
·
v3.
| When | What | Where |
|---|---|---|
| Phase 0 | RFC + contract + I/O spec + deterministic-oracle design |
travel#69, RFC docs/rfc/travel-slm.md
|
| Phase A | Travel instance: oracle + seed corpus + slm.config.yaml
|
travel#73 (merged) |
| Phase A | Teacher ceiling + pluggable Vertex Gemini provider | ops#216 (merged) |
| Phase A | Point teacher at vertex_gemini (WI token, OpenAI dropped) |
travel#75 (merged) |
| Foundation | G3 registry surface for the SLM template pack | ops#217 (merged) |
| Phase B | finetune + eval(SLM) + package playbooks + libs | ops#219 (merged) |
| Phase B |
mode=mlx real LoRA on Apple Silicon + first (v1) results |
ops#220 (open) |
| Phase B | v2 data scaling (701 turns) + render tool_summary conditioning |
ops#220 (open) |
| Phase B | v3 targeted data (950 turns) + logit-level constrained decode | ops#220 (open) |
- Training the Travel SLM — the reproducible how-to.
- Travel-domain SLM — the design-stage RFC summary.
- Playbook: itinerary-planner — the consuming playbook.
- Widget contract — the render-output schema the SLM must satisfy.
- Python → Rust migration — the runtime the SLM serves into.
- Model cards: v1 · v3
- Results: v2 · v3
- Template pack README:
automation/mlops/slm/README.md
Travel SPA
Architecture
- Architecture
- Widget contract
- Business data via playbooks
- Playbook: itinerary-planner
- Playbook: calendar/list
- Python → Rust migration
- Travel-domain SLM
- Travel SLM journey
- Training the Travel SLM
Integration
Operations
See also
- noetl wiki (app)
- ops wiki (deploy)
- Ephemeral Blueprints