Travel SLM Journey

Travel SLM — track record (what we built and what we learned)

This page is the chronological record of the travel-domain Small Language Model (SLM) effort: the goal, the decisions, the dead ends, the numbers, and the honest current verdict. It is the companion to the step-by-step Training the Travel SLM guide — read this page for why and what happened, that page for how to do it.

For the original design, see the RFC docs/rfc/travel-slm.md and the design-stage Travel-domain SLM page.

Umbrella issue: noetl/travel#63 (epic/slm/ml)
Platform umbrella (this is the reference impl): noetl/ai-meta#139
Phase tracking: dataset+baseline #64 · train/fine-tune #65 · serve #66 · shadow #67 · cutover #68

Status at a glance


Pipeline (dataset → finetune → eval → package, with G3 registry lineage)	Validated end-to-end — offline CPU stub + real local LoRA
First real fine-tune (Apple-Silicon MLX, Qwen2.5-1.5B LoRA)	Done — holds 100% schema validity
Per-field accuracy vs the deterministic floor	Not beaten yet (tool 0.56 / intent 0.56 / widget_type 0.38)
Muno-ready (can replace the OpenAI/Gemini path)	No — blocked on training data volume (58 examples)
Next lever	event-log replay to ~1000 turns, richer render conditioning, true grammar-constrained decode

Nothing here touched prod. The planner still runs the OpenAI/deterministic path; no GKE, IAM, secret, or flag change shipped.

Why a domain SLM

The Muno itinerary-planner is built around two LLM calls:

Pass	Model today	In → Out
`extract_turn`	`gpt-4o`, JSON mode, temp 0	(user event, slot_state, thread) → `{slot_updates, tool_requests, render_intent}`
`render_widget_chat`	`gpt-4o-mini`	(slot_state, extraction, tool_summary, render_intent) → `{bot_message, widgets[]}`

A fine-tuned, in-cluster SLM replaces those two passes to cut cost, remove the external dependency and its secret from the critical path, own the latency budget in-cluster, and specialize on the narrow travel intent → tool → widget task. It is a swap, not a rewrite: the SLM emits byte-compatible JSON for the same two contracts, behind engine flags (extraction_engine / render_engine ∈ openai|slm|deterministic) so we can A/B and shadow-eval before any cutover.

The exact I/O contract (4 tools, 12 render intents, 24 widget types, the slot-state schema) is pinned in the RFC §2 and reproduced as a deterministic Python module — which becomes the floor (next section).

The deterministic oracle floor

Before any model, we built a deterministic oracle: automation/mlops/slm/travel/oracle.py. It is a pure-stdlib, rule-based reimplementation of both passes, derived from the planner's own deterministic fallback logic (the steps that emit llm_contract.fallback_used: true) and the two system prompts (playbooks/agent/system_prompt_extraction.md, system_prompt_chat.md).

The oracle plays two roles:

Label source — it produces training labels at zero token cost.
Quality floor — every model candidate is scored against the oracle's decisions, and the oracle is the safety fallback a model must not regress below.

The floor is 100% schema-valid by construction and effectively instant (deterministic Python, sub-millisecond per turn). That sets the bar precisely: a model wins only by matching more of the oracle's field-level decisions than a trivial baseline does, while holding the 100% validity — it can't win on latency or validity, only on accuracy of the harder, context-dependent turns.

The teacher ceiling — and the finding that reframed the project

The plan was to bootstrap a richer dataset with a strong teacher model. We chose Vertex Gemini (token minted from the worker pod's Workload Identity via the GKE metadata server — no API key, no Secret Manager, OpenAI dropped as the teacher; RFC decision #6).

The first ceiling run was the turning point. Raw gemini-2.5-pro, with no output-schema enforcement (exec 328967313246134272), scored:

0% valid widget envelopes
49% valid extractions

— below the deterministic floor. Not because the model was weak: it picked sensible widget types, but emitted the wrong tool-request keys (tool_id instead of tool) and empty payloads, so the envelopes failed the contract schema.

The conclusion that reframed the whole effort: the lever is not a bigger model. It is schema/grammar-constrained decoding. When we handed the teacher a Vertex generationConfig.responseSchema derived from the contract schemas (lib/slm_schema.py converts draft-07 → Vertex responseSchema), the cheaper gemini-2.5-flash became schema-valid by construction — 100% valid widget envelopes — at a fraction of the cost. The Phase-1 dataset build spent $0.068 total (85 flash calls, 5 transport timeouts) to label 40 of 45 turns.

This is the design center of everything downstream: constrain the output, don't enlarge the model.

The constrained teacher is augmentation, not the authoritative target. Each teacher label is validated and repaired toward the oracle contract before inclusion (3 render envelopes were repaired from the oracle on schema failure). The authoritative training target stays the deterministic oracle.

Phase 1 — the dataset

Built by the generic dataset_build playbook from the travel seed corpus, versioned as v1_constrained:


Total turns	45 (29 train / 16 eval, 30% holdout, split seed `muno-travel-13`)
Multitask examples	58 (one extract + one render per turn)
Authoritative labels	deterministic oracle
Augmentation	constrained `gemini-2.5-flash` (40 turns labeled, repaired toward oracle)
Schema validity (final)	100% widget envelopes + extracts + vocab
Vocab	4 tools, 12 render intents, 10 widget types observed

Per-record fields carry all three label sets side by side: labels (oracle, authoritative), labels_teacher (raw constrained teacher), labels_teacher_repaired (teacher render payloads repaired to the oracle on schema failure).

The dataset also revealed the floor↔ceiling gap that ranks candidates: floor and teacher agree on widget_type 93% of the time but on tool only 36% and render_intent only 29% — i.e. the context-dependent turns ("that flight", "those", multi-turn slot fills) are exactly where a model must earn its keep.

Phase B — the MLOps pipeline (MLOps-as-playbooks)

A hard requirement of the platform: every MLOps stage runs as a NoETL playbook, event-sourced and replayable like everything else — no external training scripts. Phase B (noetl/ai-meta#141, ops #219) shipped the three training-side stages:

Playbook	Role
`finetune.yaml`	Train a single multitask LoRA → write the adapter → register a G3 model (lineage → dataset)
`eval.yaml` (`candidate=slm`)	Pull the model, score it under schema-constrained decoding vs the floor → register a G3 eval (lineage → model)
`package.yaml`	Export the model, write a model card, bundle → register a G3 release (lineage → model + eval)

These sit on the three platform foundations: G1 (GPU container / k8s-Job dispatch), G2 (long-running async poll/callback), G3 (artifact store + model/dataset/eval/release registry; ops #217).

Three finetune backends share one artifact contract:

stub (CPU, pure stdlib) — a nearest-prototype retrieval "model" that proves the whole orchestration end-to-end with no GPU. It holds 100% schema validity and honestly fails the match gate (tool_match≈0.69) — the failing gate is the signal a real model must close, not faked to 1.0.
mlx (Apple Silicon) — the real local LoRA (next section).
peft (CUDA) — the real GPU LoRA, gated on the operator provisioning the GPU pool + trainer image + flags.

The first real fine-tune — local Apple-Silicon (MLX)

ops #220 added a mode=mlx backend so a 1–1.5B LoRA trains in unified memory on a Mac — no GPU node pool, no container — and ran it on the travel dataset. Committed artifacts: examples/travel-mlx-v1/ (RESULTS.md, MODEL_CARD.md, eval_report.json).

Training


Host	Mac Studio (Mac13,1), Apple M1 Max (10 cores 8P/2E), 32 GB unified, macOS 26.3.1
Runtime	`mlx-lm` 0.31.3 in an arm64 Python 3.12 venv
Base model	`Qwen/Qwen2.5-1.5B-Instruct` (bf16, ≈3 GB from HF)
Recipe	single multitask LoRA (extract + render in one adapter)
Trainable params	5.276 M / 1543.714 M (0.342%)
Hyperparams	iters 800, batch 1, top-16 layers, lr 1e-4, max-seq 2048, `--mask-prompt`
Peak memory	9.7 GB
Wall-clock	≈25 min
Val loss curve	0.811 → 0.077 → 0.079 → 0.088 → 0.089

It trained on the oracle labels only (49 train / 9 valid examples), because the eval scores per-field exact match against the oracle labels — training directly on that target is the cleanest test of "can the SLM learn the floor's decisions".

Eval (16-example v1_constrained holdout, schema-constrained decoding):

metric	SLM (mlx)	oracle floor	CPU stub
widget_schema_validity	1.0000	1.0	1.0
extract_schema_validity	1.0000	1.0	1.0
tool_vocab_validity	1.0000	1.0	1.0
render_intent_vocab_validity	1.0000	1.0	1.0
tool_match	0.5625	1.0	0.6875
render_intent_match	0.5625	1.0	0.6250
arg_fidelity	0.5625	1.0	0.4375
slot_update_match	0.6250	1.0	0.5000
widget_type_match	0.3750	1.0	0.5000

Candidate latency (unoptimized local serving, 2 generations/turn, greedy): p50 9.2 s / p95 13.1 s. Gate: FAIL (match targets 0.95–0.98 not met).

The honest verdict

The fine-tune holds the floor's 100% schema validity — every emitted extract and widget envelope is contract-valid by construction — and the generations are genuinely sensible: exact oracle match on clear turns ("plan a trip" → collect_missing, "confirm and purchase" → create_order/order_confirmation), correct tool+intent on "what flights are available".

But on per-field match it does not beat the deterministic floor, and against the retrieval CPU stub it only trades wins — better on arg_fidelity and slot_update (it generalizes structure rather than copying a neighbour), worse on tool/intent/widget (where the stub copies a memorised label that happens to match on this distribution-overlapping holdout). The weakest pass is render (widget_type_match 0.375): on turns with no tool data yet (e.g. show_flights before offers exist) the model under-produces the widget the oracle emits.

Root cause: 58 examples is tiny. This is a data-volume problem, not a model-architecture problem. The bounded vocab (4 tools, 12 intents) generalizes — but only with coverage of the context-dependent turns the model currently misroutes.

So: pipeline proven, local training proven, model not yet Muno-ready.

Registry lineage

The run wrote a full dataset → model → eval → release lineage into the G3 registry (local file-backed backend for this offline run):

dataset  registry://muno/travel/dataset/travel_v1_constrained/1
model    registry://muno/travel/model/travel_slm_multitask/1     (mlx LoRA adapter, ≈39 MB tar)
eval     registry://muno/travel/eval/travel_slm_multitask/1      (lineage → model)
release  registry://muno/travel/release/travel_slm_multitask/1   (lineage → [model, eval])

What the next iteration needs

In priority order (from RESULTS.md §"What a next iteration needs"):

More data. Wire the config's event_log_replay (cap 1000 turns of real Muno planner turns, already status: enabled in slm.config.yaml) into the train set. This is the single biggest lever.
Richer render conditioning. The render pass needs the tool_summary / slot context the oracle's render fn sees, not just turn + extraction, so it can emit the right widget on data-bearing turns.
True grammar-constrained generation at decode time (logit masking / outlines over the contract enums) rather than only post-hoc repair — this lifts the in-vocab-but-wrong tool/intent cases the repair lever can't fix.
Teacher-augmented + curriculum training once (1)–(3) land, and a larger candidate (e.g. qwen2.5-3B) if 1.5B plateaus below the gate.

v2 + v3 results — the data + decode levers landed

The next-iteration list above was executed on the local Apple-Silicon mode=mlx path. Every match field climbed while schema validity stayed pinned at 100%:

field	v1 (45 turns)	v2 (701 turns)	v3 (950 turns + constrained decode)	gate
tool_match	0.5625	0.8017	0.9444	0.98
render_intent_match	0.5625	0.8595	0.9236	0.98
widget_type_match	0.3750	0.6529	0.7917	0.98
arg_fidelity	0.5625	0.7934	0.9444	0.95
slot_update_match	0.6250	0.9091	0.9375	0.95
schema validity (widget/extract)	1.0	1.0	1.0	1.0 ✅

v2 — data scaling. A 701-turn oracle-routed synthetic corpus (leak-free disjoint train/eval) + render conditioning on the tool_summary the oracle sees. Lifted every field +0.23–0.30; widget_type nearly doubled.
v3 — targeted data + grammar-constrained decode. Over-sampled the v2-weak show_places slice + contrast neighbours + the multi-widget summary sequence, plus true logit-level constrained decoding on the extract pass (lm-format-enforcer, lib/slm_constrain.py). An A/B isolates the levers: the data lever is the big mover (show_places 3/14 → 27/27 correct tool), the decode lever guarantees 100% extract validity + a small tool_match bump but adds no semantic gain.
3B probe aborted — capacity was not the bottleneck (the v3 1.5B result confirmed data/decode were).

Verdict: not yet Muno-ready by the strict gate — widget_type_match 0.79 (render generation for data-bearing widgets) is the ~0.19-short blocker; the other four match fields are within ~0.006–0.057 of target. Teacher spend stayed $0 (the constrained Gemini teacher only runs on the in-cluster worker WI, not locally). Full write-ups: v2 · v3.

Timeline of decisions and PRs

When	What	Where
Phase 0	RFC + contract + I/O spec + deterministic-oracle design	travel#69, RFC `docs/rfc/travel-slm.md`
Phase A	Travel instance: oracle + seed corpus + `slm.config.yaml`	travel#73 (merged)
Phase A	Teacher ceiling + pluggable Vertex Gemini provider	ops#216 (merged)
Phase A	Point teacher at `vertex_gemini` (WI token, OpenAI dropped)	travel#75 (merged)
Foundation	G3 registry surface for the SLM template pack	ops#217 (merged)
Phase B	finetune + eval(SLM) + package playbooks + libs	ops#219 (merged)
Phase B	`mode=mlx` real LoRA on Apple Silicon + first (v1) results	ops#220 (open)
Phase B	v2 data scaling (701 turns) + render `tool_summary` conditioning	ops#220 (open)
Phase B	v3 targeted data (950 turns) + logit-level constrained decode	ops#220 (open)

Training the Travel SLM — the reproducible how-to.
Travel-domain SLM — the design-stage RFC summary.
Playbook: itinerary-planner — the consuming playbook.
Widget contract — the render-output schema the SLM must satisfy.
Python → Rust migration — the runtime the SLM serves into.
Model cards: v1 · v3
Results: v2 · v3
Template pack README: automation/mlops/slm/README.md

Travel SPA

Home

Architecture

Integration

Operations

See also

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Travel SLM Journey

Travel SLM — track record (what we built and what we learned)

Status at a glance

Why a domain SLM

The deterministic oracle floor

The teacher ceiling — and the finding that reframed the project

Phase 1 — the dataset

Phase B — the MLOps pipeline (MLOps-as-playbooks)

The first real fine-tune — local Apple-Silicon (MLX)

The honest verdict

Registry lineage

What the next iteration needs

v2 + v3 results — the data + decode levers landed

Timeline of decisions and PRs

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally