Skip to content

travel slm

Kadyapam edited this page Jun 27, 2026 · 4 revisions

Travel-domain SLM — replace OpenAI in the itinerary-planner

Status (2026-06-26): Phases A + B executed. The MLOps pipeline (dataset → finetune → eval → package, with G3 registry lineage) is validated end-to-end, and the first real fine-tune (Apple-Silicon MLX, Qwen2.5-1.5B LoRA) trains and holds 100% schema validity — but does not yet beat the deterministic floor on per-field accuracy (root cause: 58 training examples). The planner is unchanged; nothing shipped to prod.

Two companion pages now carry the live detail:

  • Travel SLM journey — the track record: chronology, the teacher-ceiling finding, the dataset, the local LoRA run, the honest verdict, and what's next.
  • Training the Travel SLM — the reproducible engineer guide: dataset build, local MLX + GKE GPU training paths, eval, registry lineage, exact commands.

This page below is the original Phase-0 design summary, kept for context.

This page tracks the initiative to replace the two OpenAI calls in the Muno itinerary-planner with a local, travel-domain Small Language Model (SLM). Full design lives in the repo RFC: docs/rfc/travel-slm.md.

Reference implementation of the NoETL Domain-Specific SLM platform. This travel SLM is the worked example that proves the generalized, multi-domain Domain-Specific SLM platform umbrella noetl/ai-meta#139 end-to-end. That platform extracts this work's MLOps-as-playbooks pattern (automation/mlops/travel-slm/<stage> + the RFC §6A capability gaps) into a reusable, config-driven framework any org instantiates by writing one slm.config.yaml. The generic automation/mlops/slm/<stage> templates are extracted from travel's playbooks; the §6A capability gaps #70 / #71 / #72 are the seeds of the domain-agnostic platform foundations G1 / G2 / G3 (ai-meta #144 / #145 / #146). No scope change here — travel stays the travel SLM. See the platform RFC docs/rfc/domain-slm-platform.md.

  • Umbrella issue: noetl/travel#63 (epic / slm / ml) · children #64–#68 (one per phase)
  • Platform umbrella (reference impl): noetl/ai-meta#139 — Domain-Specific SLM platform
  • Related: Python → Rust migration · noetl/ai-meta#137 (provider auth) · noetl/ai-meta#130 / #136 (per-hop latency)

Why

The planner is designed around two OpenAI passes. A domain SLM replaces them to cut cost, remove the external dependency + secret from the critical path, own the latency budget in-cluster, and specialize on the narrow travel intent→tool→widget task. It is a swap, not a rewrite: the SLM emits byte-compatible JSON for the two contracts below.

The two OpenAI roles the SLM replaces

Pass Model today In → Out
extract_turn gpt-4o, JSON mode, temp 0 (user event, slot_state, thread) → {slot_updates, tool_requests, render_intent}
render_widget_chat gpt-4o-mini (slot_state, extraction, tool_summary, render_intent) → {bot_message, widgets[]}

The exact I/O contract (tool catalog, argument keys, slot-state schema, render_intent enum, the 24 widget types and their selection rules) is specified in the RFC §2, derived from playbooks/agent/system_prompt_extraction.md, playbooks/agent/system_prompt_chat.md, the widget-contract/*.schema.json set, and the deterministic Python reference in itinerary-planner.yaml.

Useful fact for the dataset plan: the current planner branch runs a deterministic Python implementation of both contracts (the step marks llm_contract.fallback_used: true). That gives the SLM project a working reference baseline and a free second labeling oracle alongside the OpenAI teacher.

Recommended approach (from the RFC)

  • Model: hybrid — a fine-tuned small instruct model (~0.5–3B; default Qwen2.5-1.5B / Llama-3.2-1B to confirm after a Phase-1 baseline) plus grammar-guided decoding so output is always schema-valid and tool/widget fields are always in-enum. A deterministic post-processor fills mechanical fields (geo coords, duffel_env: test, CTA ids) and is the safety fallback.
  • Serving: mirror the MCP-provider pattern — an in-cluster inference service over HTTP, called from a new automation/agents/mcp/travel-slm playbook, so it slots into the catalog like another provider. (Worker- embedded model is a fallback only if the HTTP hop proves to be the bottleneck.)
  • Integration: the planner calls the SLM playbook behind a flag (extraction_engine / render_engineopenai|slm|deterministic, plus slm_shadow) so we A/B and shadow-eval before any cutover. Routing arcs and widget contracts stay unchanged.

MLOps via NoETL playbooks (dogfooding requirement)

Hard requirement: every MLOps stage — not just serving — runs as a NoETL playbook, never an external script. The SLM dogfoods the platform; its own lifecycle is event-sourced, replayable, and observable like everything else. Playbooks live under automation/mlops/travel-slm/.

Stage Playbook Capability needed Gap
Dataset dataset_build teacher API (http + keychain) · event-log replay · large-artifact storage · dataset registry G3
Train finetune GPU container/k8s-Job dispatch · long-running async (hours) · artifact storage + model registry G1, G2, G3
Eval eval http/python metrics vs OpenAI ceiling + deterministic floor · model registry G3
Shadow shadow_eval event-log read via server API (planner slm_shadow flag)
Package package quantize/build job · artifact storage + model registry G1, G3
Deploy + A/B flip deploy ops deploy automation · catalog flag-flip · http smoke
Cron retrain_orchestrator schedule/cron · playbook composition

NoETL runtime capability gaps (design-flagged, tracked)

Playbook-based MLOps at training scale needs three platform capabilities NoETL doesn't have today — flagged, not built by Phase 0:

  • G1 (#70) — GPU container / k8s-Job dispatch tool (noetl/tools+worker+ops).
  • G2 (#71) — long-running async job orchestration (callback/poll for hours-long jobs; noetl/server+worker).
  • G3 (#72) — large binary artifact storage + model/dataset registry in the catalog (noetl/server+noetl; builds on the result tier, ai-meta#104).

dataset_build, eval, shadow_eval run on existing tool kinds today, so Phase 1 is not blocked; finetune/package are gated on G1–G3 (buildable in parallel with Phase 1).

Phases (each is a child issue)

Phase Playbook(s) What Touches prod?
0 — RFC This design + contract + decisions No
1 — Dataset + baseline (#64) dataset_build, eval Teacher-bootstrap dataset (OpenAI + deterministic oracle), eval harness, measure ceiling/floor/baseline No
2 — Train / fine-tune (#65) finetune, package LoRA/QLoRA chosen model + grammar decoding; hit offline targets — gated on G1–G3 No
3 — Serve + integrate (#66) deploy + mcp/travel-slm In-cluster inference service + playbook + planner flags; kind-validate Flagged, default OpenAI
4 — Shadow-eval (#67) shadow_eval slm_shadow: true vs OpenAI; match-rate + latency + widget-validity Shadow only
5 — Gated cutover (#68) deploy (flag-flip) Flip engine flags per-pass on metric bar; instant rollback Gated
Ongoing — retrain/eval cron retrain_orchestrator Scheduled drift-check → conditional retrain → eval → shadow No (until cutover)

Success metrics (vs OpenAI baseline)

Tool-selection match ≥ 98% · argument fidelity ≥ 95% (100% schema-valid) · slot-update correctness ≥ 95% · render-intent match ≥ 98% · widget-type selection ≥ 98% · widget-schema validity 100% (grammar-enforced) · latency p50/p95 ≤ OpenAI · end-to-end turn equivalence ≥ 95%.

Decisions needed to start Phase 1

Model size/family · CPU vs GPU serving · fine-tune vs prompt-small-instruct · one model two roles vs two models · teacher token budget + seed-corpus size · cutover aggressiveness · serving stack (vLLM/TGI vs llama.cpp/Ollama) · event-log replay access for the dataset. See RFC §10.

Related

Clone this wiki locally