travel slm

Travel-domain SLM — replace OpenAI in the itinerary-planner

Status (2026-06-26): Phases A + B executed. The MLOps pipeline (dataset → finetune → eval → package, with G3 registry lineage) is validated end-to-end, and the first real fine-tune (Apple-Silicon MLX, Qwen2.5-1.5B LoRA) trains and holds 100% schema validity — but does not yet beat the deterministic floor on per-field accuracy (root cause: 58 training examples). The planner is unchanged; nothing shipped to prod.

Two companion pages now carry the live detail:

Travel SLM journey — the track record: chronology, the teacher-ceiling finding, the dataset, the local LoRA run, the honest verdict, and what's next.

Training the Travel SLM — the reproducible engineer guide: dataset build, local MLX + GKE GPU training paths, eval, registry lineage, exact commands.

This page below is the original Phase-0 design summary, kept for context.

This page tracks the initiative to replace the two OpenAI calls in the Muno itinerary-planner with a local, travel-domain Small Language Model (SLM). Full design lives in the repo RFC: docs/rfc/travel-slm.md.

Reference implementation of the NoETL Domain-Specific SLM platform. This travel SLM is the worked example that proves the generalized, multi-domain Domain-Specific SLM platform umbrella noetl/ai-meta#139 end-to-end. That platform extracts this work's MLOps-as-playbooks pattern (automation/mlops/travel-slm/<stage> + the RFC §6A capability gaps) into a reusable, config-driven framework any org instantiates by writing one slm.config.yaml. The generic automation/mlops/slm/<stage> templates are extracted from travel's playbooks; the §6A capability gaps #70 / #71 / #72 are the seeds of the domain-agnostic platform foundations G1 / G2 / G3 (ai-meta #144 / #145 / #146). No scope change here — travel stays the travel SLM. See the platform RFC docs/rfc/domain-slm-platform.md.

Umbrella issue: noetl/travel#63 (epic / slm / ml) · children #64–#68 (one per phase)
Platform umbrella (reference impl): noetl/ai-meta#139 — Domain-Specific SLM platform
Related: Python → Rust migration · noetl/ai-meta#137 (provider auth) · noetl/ai-meta#130 / #136 (per-hop latency)

Why

The planner is designed around two OpenAI passes. A domain SLM replaces them to cut cost, remove the external dependency + secret from the critical path, own the latency budget in-cluster, and specialize on the narrow travel intent→tool→widget task. It is a swap, not a rewrite: the SLM emits byte-compatible JSON for the two contracts below.

The two OpenAI roles the SLM replaces

Pass	Model today	In → Out
`extract_turn`	`gpt-4o`, JSON mode, temp 0	(user event, slot_state, thread) → `{slot_updates, tool_requests, render_intent}`
`render_widget_chat`	`gpt-4o-mini`	(slot_state, extraction, tool_summary, render_intent) → `{bot_message, widgets[]}`

The exact I/O contract (tool catalog, argument keys, slot-state schema, render_intent enum, the 24 widget types and their selection rules) is specified in the RFC §2, derived from playbooks/agent/system_prompt_extraction.md, playbooks/agent/system_prompt_chat.md, the widget-contract/*.schema.json set, and the deterministic Python reference in itinerary-planner.yaml.

Useful fact for the dataset plan: the current planner branch runs a deterministic Python implementation of both contracts (the step marks llm_contract.fallback_used: true). That gives the SLM project a working reference baseline and a free second labeling oracle alongside the OpenAI teacher.

Recommended approach (from the RFC)

Model: hybrid — a fine-tuned small instruct model (~0.5–3B; default Qwen2.5-1.5B / Llama-3.2-1B to confirm after a Phase-1 baseline) plus grammar-guided decoding so output is always schema-valid and tool/widget fields are always in-enum. A deterministic post-processor fills mechanical fields (geo coords, duffel_env: test, CTA ids) and is the safety fallback.
Serving: mirror the MCP-provider pattern — an in-cluster inference service over HTTP, called from a new automation/agents/mcp/travel-slm playbook, so it slots into the catalog like another provider. (Worker- embedded model is a fallback only if the HTTP hop proves to be the bottleneck.)
Integration: the planner calls the SLM playbook behind a flag (extraction_engine / render_engine ∈ openai|slm|deterministic, plus slm_shadow) so we A/B and shadow-eval before any cutover. Routing arcs and widget contracts stay unchanged.

MLOps via NoETL playbooks (dogfooding requirement)

Hard requirement: every MLOps stage — not just serving — runs as a NoETL playbook, never an external script. The SLM dogfoods the platform; its own lifecycle is event-sourced, replayable, and observable like everything else. Playbooks live under automation/mlops/travel-slm/.

Stage	Playbook	Capability needed	Gap
Dataset	`dataset_build`	teacher API (http + keychain) · event-log replay · large-artifact storage · dataset registry	G3
Train	`finetune`	GPU container/k8s-Job dispatch · long-running async (hours) · artifact storage + model registry	G1, G2, G3
Eval	`eval`	http/python metrics vs OpenAI ceiling + deterministic floor · model registry	G3
Shadow	`shadow_eval`	event-log read via server API (planner `slm_shadow` flag)	—
Package	`package`	quantize/build job · artifact storage + model registry	G1, G3
Deploy + A/B flip	`deploy`	ops deploy automation · catalog flag-flip · http smoke	—
Cron	`retrain_orchestrator`	schedule/cron · playbook composition	—

NoETL runtime capability gaps (design-flagged, tracked)

Playbook-based MLOps at training scale needs three platform capabilities NoETL doesn't have today — flagged, not built by Phase 0:

G1 (#70) — GPU container / k8s-Job dispatch tool (noetl/tools+worker+ops).
G2 (#71) — long-running async job orchestration (callback/poll for hours-long jobs; noetl/server+worker).
G3 (#72) — large binary artifact storage + model/dataset registry in the catalog (noetl/server+noetl; builds on the result tier, ai-meta#104).

dataset_build, eval, shadow_eval run on existing tool kinds today, so Phase 1 is not blocked; finetune/package are gated on G1–G3 (buildable in parallel with Phase 1).

Phases (each is a child issue)

Phase	Playbook(s)	What	Touches prod?
0 — RFC	—	This design + contract + decisions	No
1 — Dataset + baseline (#64)	`dataset_build`, `eval`	Teacher-bootstrap dataset (OpenAI + deterministic oracle), eval harness, measure ceiling/floor/baseline	No
2 — Train / fine-tune (#65)	`finetune`, `package`	LoRA/QLoRA chosen model + grammar decoding; hit offline targets — gated on G1–G3	No
3 — Serve + integrate (#66)	`deploy` + `mcp/travel-slm`	In-cluster inference service + playbook + planner flags; kind-validate	Flagged, default OpenAI
4 — Shadow-eval (#67)	`shadow_eval`	`slm_shadow: true` vs OpenAI; match-rate + latency + widget-validity	Shadow only
5 — Gated cutover (#68)	`deploy` (flag-flip)	Flip engine flags per-pass on metric bar; instant rollback	Gated
Ongoing — retrain/eval cron	`retrain_orchestrator`	Scheduled drift-check → conditional retrain → eval → shadow	No (until cutover)

Success metrics (vs OpenAI baseline)

Tool-selection match ≥ 98% · argument fidelity ≥ 95% (100% schema-valid) · slot-update correctness ≥ 95% · render-intent match ≥ 98% · widget-type selection ≥ 98% · widget-schema validity 100% (grammar-enforced) · latency p50/p95 ≤ OpenAI · end-to-end turn equivalence ≥ 95%.

Decisions needed to start Phase 1

Model size/family · CPU vs GPU serving · fine-tune vs prompt-small-instruct · one model two roles vs two models · teacher token budget + seed-corpus size · cutover aggressiveness · serving stack (vLLM/TGI vs llama.cpp/Ollama) · event-log replay access for the dataset. See RFC §10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

travel slm

Travel-domain SLM — replace OpenAI in the itinerary-planner

Why

The two OpenAI roles the SLM replaces

Recommended approach (from the RFC)

MLOps via NoETL playbooks (dogfooding requirement)

NoETL runtime capability gaps (design-flagged, tracked)

Phases (each is a child issue)

Success metrics (vs OpenAI baseline)

Decisions needed to start Phase 1

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally