Skip to content

Travel SLM Journey

Kadyapam edited this page Jun 27, 2026 · 2 revisions

Travel SLM — track record (what we built and what we learned)

This page is the chronological record of the travel-domain Small Language Model (SLM) effort: the goal, the decisions, the dead ends, the numbers, and the honest current verdict. It is the companion to the step-by-step Training the Travel SLM guide — read this page for why and what happened, that page for how to do it.

For the original design, see the RFC docs/rfc/travel-slm.md and the design-stage Travel-domain SLM page.

Status at a glance

Pipeline (dataset → finetune → eval → package, with G3 registry lineage) Validated end-to-end — offline CPU stub + real local LoRA
First real fine-tune (Apple-Silicon MLX, Qwen2.5-1.5B LoRA) Done — holds 100% schema validity
Per-field accuracy vs the deterministic floor Not beaten yet (tool 0.56 / intent 0.56 / widget_type 0.38)
Muno-ready (can replace the OpenAI/Gemini path) No — blocked on training data volume (58 examples)
Next lever event-log replay to ~1000 turns, richer render conditioning, true grammar-constrained decode

Nothing here touched prod. The planner still runs the OpenAI/deterministic path; no GKE, IAM, secret, or flag change shipped.

Why a domain SLM

The Muno itinerary-planner is built around two LLM calls:

Pass Model today In → Out
extract_turn gpt-4o, JSON mode, temp 0 (user event, slot_state, thread) → {slot_updates, tool_requests, render_intent}
render_widget_chat gpt-4o-mini (slot_state, extraction, tool_summary, render_intent) → {bot_message, widgets[]}

A fine-tuned, in-cluster SLM replaces those two passes to cut cost, remove the external dependency and its secret from the critical path, own the latency budget in-cluster, and specialize on the narrow travel intent → tool → widget task. It is a swap, not a rewrite: the SLM emits byte-compatible JSON for the same two contracts, behind engine flags (extraction_engine / render_engineopenai|slm|deterministic) so we can A/B and shadow-eval before any cutover.

The exact I/O contract (4 tools, 12 render intents, 24 widget types, the slot-state schema) is pinned in the RFC §2 and reproduced as a deterministic Python module — which becomes the floor (next section).

The deterministic oracle floor

Before any model, we built a deterministic oracle: automation/mlops/slm/travel/oracle.py. It is a pure-stdlib, rule-based reimplementation of both passes, derived from the planner's own deterministic fallback logic (the steps that emit llm_contract.fallback_used: true) and the two system prompts (playbooks/agent/system_prompt_extraction.md, system_prompt_chat.md).

The oracle plays two roles:

  1. Label source — it produces training labels at zero token cost.
  2. Quality floor — every model candidate is scored against the oracle's decisions, and the oracle is the safety fallback a model must not regress below.

The floor is 100% schema-valid by construction and effectively instant (deterministic Python, sub-millisecond per turn). That sets the bar precisely: a model wins only by matching more of the oracle's field-level decisions than a trivial baseline does, while holding the 100% validity — it can't win on latency or validity, only on accuracy of the harder, context-dependent turns.

The teacher ceiling — and the finding that reframed the project

The plan was to bootstrap a richer dataset with a strong teacher model. We chose Vertex Gemini (token minted from the worker pod's Workload Identity via the GKE metadata server — no API key, no Secret Manager, OpenAI dropped as the teacher; RFC decision #6).

The first ceiling run was the turning point. Raw gemini-2.5-pro, with no output-schema enforcement (exec 328967313246134272), scored:

  • 0% valid widget envelopes
  • 49% valid extractions

below the deterministic floor. Not because the model was weak: it picked sensible widget types, but emitted the wrong tool-request keys (tool_id instead of tool) and empty payloads, so the envelopes failed the contract schema.

The conclusion that reframed the whole effort: the lever is not a bigger model. It is schema/grammar-constrained decoding. When we handed the teacher a Vertex generationConfig.responseSchema derived from the contract schemas (lib/slm_schema.py converts draft-07 → Vertex responseSchema), the cheaper gemini-2.5-flash became schema-valid by construction — 100% valid widget envelopes — at a fraction of the cost. The Phase-1 dataset build spent $0.068 total (85 flash calls, 5 transport timeouts) to label 40 of 45 turns.

This is the design center of everything downstream: constrain the output, don't enlarge the model.

The constrained teacher is augmentation, not the authoritative target. Each teacher label is validated and repaired toward the oracle contract before inclusion (3 render envelopes were repaired from the oracle on schema failure). The authoritative training target stays the deterministic oracle.

Phase 1 — the dataset

Built by the generic dataset_build playbook from the travel seed corpus, versioned as v1_constrained:

Total turns 45 (29 train / 16 eval, 30% holdout, split seed muno-travel-13)
Multitask examples 58 (one extract + one render per turn)
Authoritative labels deterministic oracle
Augmentation constrained gemini-2.5-flash (40 turns labeled, repaired toward oracle)
Schema validity (final) 100% widget envelopes + extracts + vocab
Vocab 4 tools, 12 render intents, 10 widget types observed

Per-record fields carry all three label sets side by side: labels (oracle, authoritative), labels_teacher (raw constrained teacher), labels_teacher_repaired (teacher render payloads repaired to the oracle on schema failure).

The dataset also revealed the floor↔ceiling gap that ranks candidates: floor and teacher agree on widget_type 93% of the time but on tool only 36% and render_intent only 29% — i.e. the context-dependent turns ("that flight", "those", multi-turn slot fills) are exactly where a model must earn its keep.

Phase B — the MLOps pipeline (MLOps-as-playbooks)

A hard requirement of the platform: every MLOps stage runs as a NoETL playbook, event-sourced and replayable like everything else — no external training scripts. Phase B (noetl/ai-meta#141, ops #219) shipped the three training-side stages:

Playbook Role
finetune.yaml Train a single multitask LoRA → write the adapter → register a G3 model (lineage → dataset)
eval.yaml (candidate=slm) Pull the model, score it under schema-constrained decoding vs the floor → register a G3 eval (lineage → model)
package.yaml Export the model, write a model card, bundle → register a G3 release (lineage → model + eval)

These sit on the three platform foundations: G1 (GPU container / k8s-Job dispatch), G2 (long-running async poll/callback), G3 (artifact store + model/dataset/eval/release registry; ops #217).

Three finetune backends share one artifact contract:

  • stub (CPU, pure stdlib) — a nearest-prototype retrieval "model" that proves the whole orchestration end-to-end with no GPU. It holds 100% schema validity and honestly fails the match gate (tool_match≈0.69) — the failing gate is the signal a real model must close, not faked to 1.0.
  • mlx (Apple Silicon) — the real local LoRA (next section).
  • peft (CUDA) — the real GPU LoRA, gated on the operator provisioning the GPU pool + trainer image + flags.

The first real fine-tune — local Apple-Silicon (MLX)

ops #220 added a mode=mlx backend so a 1–1.5B LoRA trains in unified memory on a Mac — no GPU node pool, no container — and ran it on the travel dataset. Committed artifacts: examples/travel-mlx-v1/ (RESULTS.md, MODEL_CARD.md, eval_report.json).

Training

Host Mac Studio (Mac13,1), Apple M1 Max (10 cores 8P/2E), 32 GB unified, macOS 26.3.1
Runtime mlx-lm 0.31.3 in an arm64 Python 3.12 venv
Base model Qwen/Qwen2.5-1.5B-Instruct (bf16, ≈3 GB from HF)
Recipe single multitask LoRA (extract + render in one adapter)
Trainable params 5.276 M / 1543.714 M (0.342%)
Hyperparams iters 800, batch 1, top-16 layers, lr 1e-4, max-seq 2048, --mask-prompt
Peak memory 9.7 GB
Wall-clock ≈25 min
Val loss curve 0.811 → 0.077 → 0.079 → 0.088 → 0.089

It trained on the oracle labels only (49 train / 9 valid examples), because the eval scores per-field exact match against the oracle labels — training directly on that target is the cleanest test of "can the SLM learn the floor's decisions".

Eval (16-example v1_constrained holdout, schema-constrained decoding):

metric SLM (mlx) oracle floor CPU stub
widget_schema_validity 1.0000 1.0 1.0
extract_schema_validity 1.0000 1.0 1.0
tool_vocab_validity 1.0000 1.0 1.0
render_intent_vocab_validity 1.0000 1.0 1.0
tool_match 0.5625 1.0 0.6875
render_intent_match 0.5625 1.0 0.6250
arg_fidelity 0.5625 1.0 0.4375
slot_update_match 0.6250 1.0 0.5000
widget_type_match 0.3750 1.0 0.5000

Candidate latency (unoptimized local serving, 2 generations/turn, greedy): p50 9.2 s / p95 13.1 s. Gate: FAIL (match targets 0.95–0.98 not met).

The honest verdict

The fine-tune holds the floor's 100% schema validity — every emitted extract and widget envelope is contract-valid by construction — and the generations are genuinely sensible: exact oracle match on clear turns ("plan a trip" → collect_missing, "confirm and purchase" → create_order/order_confirmation), correct tool+intent on "what flights are available".

But on per-field match it does not beat the deterministic floor, and against the retrieval CPU stub it only trades wins — better on arg_fidelity and slot_update (it generalizes structure rather than copying a neighbour), worse on tool/intent/widget (where the stub copies a memorised label that happens to match on this distribution-overlapping holdout). The weakest pass is render (widget_type_match 0.375): on turns with no tool data yet (e.g. show_flights before offers exist) the model under-produces the widget the oracle emits.

Root cause: 58 examples is tiny. This is a data-volume problem, not a model-architecture problem. The bounded vocab (4 tools, 12 intents) generalizes — but only with coverage of the context-dependent turns the model currently misroutes.

So: pipeline proven, local training proven, model not yet Muno-ready.

Registry lineage

The run wrote a full dataset → model → eval → release lineage into the G3 registry (local file-backed backend for this offline run):

dataset  registry://muno/travel/dataset/travel_v1_constrained/1
model    registry://muno/travel/model/travel_slm_multitask/1     (mlx LoRA adapter, ≈39 MB tar)
eval     registry://muno/travel/eval/travel_slm_multitask/1      (lineage → model)
release  registry://muno/travel/release/travel_slm_multitask/1   (lineage → [model, eval])

What the next iteration needs

In priority order (from RESULTS.md §"What a next iteration needs"):

  1. More data. Wire the config's event_log_replay (cap 1000 turns of real Muno planner turns, already status: enabled in slm.config.yaml) into the train set. This is the single biggest lever.
  2. Richer render conditioning. The render pass needs the tool_summary / slot context the oracle's render fn sees, not just turn + extraction, so it can emit the right widget on data-bearing turns.
  3. True grammar-constrained generation at decode time (logit masking / outlines over the contract enums) rather than only post-hoc repair — this lifts the in-vocab-but-wrong tool/intent cases the repair lever can't fix.
  4. Teacher-augmented + curriculum training once (1)–(3) land, and a larger candidate (e.g. qwen2.5-3B) if 1.5B plateaus below the gate.

v2 + v3 results — the data + decode levers landed

The next-iteration list above was executed on the local Apple-Silicon mode=mlx path. Every match field climbed while schema validity stayed pinned at 100%:

field v1 (45 turns) v2 (701 turns) v3 (950 turns + constrained decode) gate
tool_match 0.5625 0.8017 0.9444 0.98
render_intent_match 0.5625 0.8595 0.9236 0.98
widget_type_match 0.3750 0.6529 0.7917 0.98
arg_fidelity 0.5625 0.7934 0.9444 0.95
slot_update_match 0.6250 0.9091 0.9375 0.95
schema validity (widget/extract) 1.0 1.0 1.0 1.0 ✅
  • v2 — data scaling. A 701-turn oracle-routed synthetic corpus (leak-free disjoint train/eval) + render conditioning on the tool_summary the oracle sees. Lifted every field +0.23–0.30; widget_type nearly doubled.
  • v3 — targeted data + grammar-constrained decode. Over-sampled the v2-weak show_places slice + contrast neighbours + the multi-widget summary sequence, plus true logit-level constrained decoding on the extract pass (lm-format-enforcer, lib/slm_constrain.py). An A/B isolates the levers: the data lever is the big mover (show_places 3/14 → 27/27 correct tool), the decode lever guarantees 100% extract validity + a small tool_match bump but adds no semantic gain.
  • 3B probe aborted — capacity was not the bottleneck (the v3 1.5B result confirmed data/decode were).

Verdict: not yet Muno-ready by the strict gatewidget_type_match 0.79 (render generation for data-bearing widgets) is the ~0.19-short blocker; the other four match fields are within ~0.006–0.057 of target. Teacher spend stayed $0 (the constrained Gemini teacher only runs on the in-cluster worker WI, not locally). Full write-ups: v2 · v3.

Timeline of decisions and PRs

When What Where
Phase 0 RFC + contract + I/O spec + deterministic-oracle design travel#69, RFC docs/rfc/travel-slm.md
Phase A Travel instance: oracle + seed corpus + slm.config.yaml travel#73 (merged)
Phase A Teacher ceiling + pluggable Vertex Gemini provider ops#216 (merged)
Phase A Point teacher at vertex_gemini (WI token, OpenAI dropped) travel#75 (merged)
Foundation G3 registry surface for the SLM template pack ops#217 (merged)
Phase B finetune + eval(SLM) + package playbooks + libs ops#219 (merged)
Phase B mode=mlx real LoRA on Apple Silicon + first (v1) results ops#220 (open)
Phase B v2 data scaling (701 turns) + render tool_summary conditioning ops#220 (open)
Phase B v3 targeted data (950 turns) + logit-level constrained decode ops#220 (open)

Related

Clone this wiki locally