Training the Travel SLM

A step-by-step, reproducible guide to building the travel-domain dataset, fine-tuning a small model on it, evaluating it against the deterministic floor, and registering the lineage. Written for an engineer who has never seen this code before.

For the why and the result history, read the Travel SLM journey page first. This page is the how.

Two training paths are covered:

Path A — local on Apple Silicon (MLX). No GPU pool, no container. This is what produced the first real fine-tune; start here.
Path B — GKE GPU container (PEFT/CUDA). The production path, gated on infra the operator must provision.

Everything runs as NoETL playbooks (MLOps-as-playbooks). The engine logic lives in repos/ops/automation/mlops/slm/lib/; the playbooks are thin wrappers. You can also call the libs directly (the guide shows both).

0. Mental model

The lifecycle is a four-stage pipeline, each stage a playbook, each writing a versioned entry into the G3 registry with lineage to the previous:

dataset_build   →  finetune     →  eval(candidate=slm)  →  package
(seed+teacher)     (LoRA adapter)   (vs oracle floor)       (model card + release)
   ↓ dataset          ↓ model           ↓ eval                ↓ release
        registry://muno/travel/{dataset,model,eval,release}/<version>

The design center: constrain the output, don't enlarge the model. Both the stub and the real model propose JSON; the contract schemas dispose (_constrain_extract / _constrain_render repair toward a minimal schema-valid form). That is why schema validity stays at 100% regardless of model quality — the match metrics are where real quality shows.

1. Prerequisites

Repos (already present as ai-meta submodules)

Path	Holds
`repos/ops/automation/mlops/slm/`	the generic template pack: playbooks + `lib/` engines
`repos/travel/automation/mlops/slm/travel/`	the travel instance: `slm.config.yaml`, `oracle.py`, contracts, seed corpus, built datasets

The one object the travel domain writes to drive the generic pack is slm.config.yaml. It declares the two roles (extract/render), their schemas, the oracle module, the vocab enums, the data sources, the teacher, the model candidates, and the eval gates. Read it once before starting.

Tooling

noetl CLI (for the playbook path) — -r local runs the embedded Rust interpreter, no server needed.
Python 3.12 — the lib/ engines are pure-stdlib for dataset/eval (they ship their own draft-07 JSON-Schema validator because the runtime has no jsonschema).
For Path A: an arm64 Python venv with mlx-lm (details below).
For Path B: the operator-provisioned GPU infra (§6).

The registry backend switch

Every stage writes to the registry. Pick the backend with one env var:

NOETL_REGISTRY_BACKEND=local + NOETL_REGISTRY_LOCAL_DIR=<dir> — file-backed mirror of the G3 server semantics. Use this for all local work (Path A and the offline smoke). Nothing touches the server.
unset (server) — real G3. Needs NOETL_SERVER_URL + NOETL_INTERNAL_API_TOKEN and a server started with NOETL_REGISTRY_ENABLED=true. This is the Path B / in-cluster mode.

2. Build (or locate) the dataset

The latest built dataset is committed at repos/travel/automation/mlops/slm/travel/datasets/build/travel/v1_constrained/ (train.jsonl, eval.jsonl, manifest.json). If it is present you can skip straight to training.

To rebuild it from the seed corpus:

# from the ai-meta root
noetl exec repos/ops/automation/mlops/slm/dataset_build.yaml -r local \
  --set config=repos/travel/automation/mlops/slm/travel/slm.config.yaml

What this does: reads the seed corpus, labels every turn with the deterministic oracle (authoritative), optionally augments with the constrained teacher (gemini-2.5-flash via Vertex — only if the teacher block is enabled and you have WI/Vertex access), validates every output against the contract schemas, repairs teacher render payloads toward the oracle on schema failure, splits 70/30 (seed muno-travel-13), and writes versioned JSONL + a manifest.json.

The result you want is v1_constrained: 45 turns → 58 multitask examples (one extract + one render per turn), 100% schema validity.

Teacher note. The teacher is optional for training (the model trains on oracle labels). Rebuilding without Vertex access still produces a valid oracle-labeled dataset — the teacher columns just stay empty. The teacher run that produced v1_constrained cost $0.068 (85 flash calls). The token is minted in-python from the worker pod's Workload Identity; there is no API key to set.

Dataset record shape

Each line of train.jsonl / eval.jsonl:

{
  "id": "t002",
  "input": { ...user event + slot_state + thread context... },
  "intent_label": "...",
  "label_source": "deterministic_oracle",
  "labels":                 { "extract": {...}, "render": {...} },   // authoritative target
  "labels_teacher":         { "extract": {...}, "render": {...} },   // raw constrained teacher
  "labels_teacher_repaired":{ "extract": {...}, "render": {...} },   // teacher repaired to oracle
  "teacher_valid": true,
  "valid": true
}

The finetune engine flattens each turn into two prompt/completion training examples — one for the extract role, one for the render role — using the role system prompts as the prompt and the (JSON) oracle label as the completion. Example MLX training line:

{"prompt": "# Muno Itinerary Planner Extraction Prompt\n...", "completion": "{\"render_intent\": {\"kind\": \"calendar_live\"}, \"slot_updates\": {}, \"tool_requests\": []}"}

3. Path A — train locally on Apple Silicon (MLX)

This is the path that produced the first real travel SLM. A 1–1.5B LoRA fits in unified memory, so the whole dataset → finetune → eval → package spine runs on a Mac with the file-backed registry.

3.1 One-time venv

# arm64 Python venv with mlx-lm
python3 -m venv .slm-venv
.slm-venv/bin/pip install mlx-lm

mlx-lm 0.31.3 was used for the reference run. The base weights (Qwen/Qwen2.5-1.5B-Instruct, ≈3 GB) download from Hugging Face on first run and cache locally.

3.2 Environment

export SLM_DATASET_VERSION=v1_constrained
export NOETL_REGISTRY_BACKEND=local
export NOETL_REGISTRY_LOCAL_DIR=$PWD/.slm_registry
CONFIG=$PWD/repos/travel/automation/mlops/slm/travel/slm.config.yaml
LIB=repos/ops/automation/mlops/slm/lib

3.3 Train

Calling the engine directly (the exact reference invocation):

.slm-venv/bin/python $LIB/slm_finetune.py --config "$CONFIG" --backend mlx \
  --tenant muno --project travel --learning-rate 1e-4 \
  --mlx-iters 800 --mlx-batch-size 1 --mlx-num-layers 16 --mlx-max-seq-length 2048

Or via the playbook (python_bin must point at the arm64 venv):

noetl exec repos/ops/automation/mlops/slm/finetune.yaml -r local \
  --set mode=mlx \
  --set python_bin=.slm-venv/bin/python \
  --set dataset_version=v1_constrained \
  --set learning_rate=1e-4 \
  --set mlx_iters=800 --set mlx_batch_size=1 \
  --set mlx_num_layers=16 --set mlx_max_seq_length=2048

Hyperparameters used for the reference run (and why they're a sane default for this dataset size):

Knob	Value	Note
base model	`Qwen/Qwen2.5-1.5B-Instruct`	small enough for unified memory; bounded vocab generalizes
iters	800	the val-loss curve flattens by ~iter 200 (0.811 → 0.089); 800 is comfortably past convergence on 58 examples
batch size	1	tiny dataset; keep it simple
num layers	16 (top-16)	LoRA only the top layers → 5.276 M trainable params (0.342%)
learning rate	1e-4	standard LoRA lr
max seq length	2048	the prompts (system prompt + context) fit comfortably
`--mask-prompt`	on	train on the completion only, not the prompt

Expected cost on an M1 Max / 32 GB: ~25 min wall-clock, ~9.7 GB peak memory, final val loss ≈0.089. The engine writes an adapter-only artifact, tars it, and registers a G3 model (lineage → the dataset entry).

The adapter and train_report.json land under datasets/build/travel/v1_constrained/models/travel_slm_multitask-mlx/ (git-ignored — large binary).

3.4 Eval

.slm-venv/bin/python $LIB/slm_eval.py --config "$CONFIG" --candidate slm \
  --model-ref latest --tenant muno --project travel --register

Or via the playbook:

noetl exec repos/ops/automation/mlops/slm/eval.yaml -r local \
  --set candidate=slm --set model_ref=latest --set register=true \
  --set dataset_version=v1_constrained

This pulls the registered model, runs it on the 16-example holdout under the same schema-constrained decoding the teacher used, computes the match/validity/latency metrics, gates them against the config targets, and (with --register) writes a G3 eval entry (lineage → model). It prints the gate result and writes eval_report.json.

3.5 Package

.slm-venv/bin/python $LIB/slm_package.py --config "$CONFIG" \
  --model-ref latest --eval-ref latest --tenant muno --project travel

Or via the playbook:

noetl exec repos/ops/automation/mlops/slm/package.yaml -r local \
  --set model_ref=latest --set eval_ref=latest

Exports the model, writes a MODEL_CARD.md (metrics included), bundles model + card + eval report, and registers a G3 release (lineage → [model, eval]).

3.6 Offline smoke (sanity check the whole spine without a model)

To verify the orchestration end-to-end using the CPU stub backend (no MLX, no GPU, seconds not minutes):

cd repos/ops/automation/mlops/slm
NOETL_REGISTRY_BACKEND=local \
python3 lib/slm_pipeline_smoke.py \
  --config /abs/path/repos/travel/automation/mlops/slm/travel/slm.config.yaml \
  --dataset-version v1_constrained

It asserts the dataset → model → eval → release lineage DAG and that constrained decoding holds schema validity at 1.0. Exits non-zero on any failure. The stub intentionally fails the match gate — that proves the gate is real, not faked.

4. How eval works (reading the numbers)

The eval scores the candidate's per-field decisions against the oracle labels (the authoritative target) on the holdout split. Two families:

Validity (target 1.0, hard): widget_schema_validity, extract_schema_validity, tool_vocab_validity, render_intent_vocab_validity. These are 100% by construction because of constrained decoding — if one drops below 1.0, the constrain/repair lever is broken, not the model.
Match (targets 0.95–0.98): tool_match, render_intent_match, widget_type_match, arg_fidelity, slot_update_match. This is where real model quality lives.

Interpreting results vs the floor: the deterministic oracle scores 1.0 on every match metric by definition (it is the label source). So a model never "beats" the oracle — it approaches it. The honest question is how much of the 0.95–0.98 gate the model closes, and whether it beats the trivial baselines (the CPU stub's retrieval). The reference run:

gate: FAIL
  tool_match        = 0.5625  < 0.98
  render_intent_match = 0.5625 < 0.98
  widget_type_match = 0.3750  < 0.98
  arg_fidelity      = 0.5625  < 0.95
  slot_update_match = 0.6250  < 0.95
  (all four validity rows = 1.0000, PASS)

eval_report.json also carries divergences[] — the concrete turns where candidate and floor disagree, with diverged_fields. That list is your debugging surface for the next iteration (e.g. t010 "what flights are available" diverged on tool/intent/arg).

5. The registry lineage

Every stage writes a versioned entry addressed by a URN: registry://<kind>/<name>/<version> (or fully-qualified registry://<tenant>/<project>/<kind>/<name>/<version>; <version> is an integer or latest). The reference run produced:

dataset  registry://muno/travel/dataset/travel_v1_constrained/1
model    registry://muno/travel/model/travel_slm_multitask/1     (lineage → dataset)
eval     registry://muno/travel/eval/travel_slm_multitask/1      (lineage → model)
release  registry://muno/travel/release/travel_slm_multitask/1   (lineage → [model, eval])

Large bytes (datasets, adapter weights, eval reports) live in the object store; the registry entry records where (artifact_uri) and how it was produced (metadata + lineage). The client lib is lib/slm_registry.py — it is server-mediated for the real backend (calls /api/internal/registry/*

/api/internal/objects/* with the internal token; never direct DB/object access, per the data-access boundary).

6. Path B — GKE GPU container (PEFT/CUDA)

The mode=container path dispatches the same engine with --backend peft as a G1 GPU k8s Job; the worker frees its slot via G2 poll-to-terminal so the hours-long run doesn't hold a slot; the Job registers the model into G3 from inside the cluster.

It is gated — none of this is on prod. The operator must provision and approve all of the following before flipping mode=container:

GPU node pool — a GKE pool with GPUs (e.g. nvidia-l4), tainted nvidia.com/gpu=present:NoSchedule, labelled cloud.google.com/gke-accelerator=nvidia-l4, with the NVIDIA device-plugin DaemonSet.
Training image ghcr.io/noetl/slm-trainer:<tag> — python3 + CUDA torch + transformers + peft + datasets + accelerate + the lib/ engine at /opt/slm/lib; base-model weights baked or runtime-pullable.
Dataset PVC slm-data — pre-populated with slm.config.yaml + the built dataset under /data/datasets/build/<project>/<version>/, writable for the /data/models output.
ServiceAccount noetl-slm-trainer — Workload-Identity-bound to a GSA with registry + GCS write, carrying the internal-API token Secret noetl-internal-api.
Platform flags (review-only today, NOT on prod) — server NOETL_REGISTRY_ENABLED=true (G3); worker NOETL_CONTAINER_COMPLETION_POLL=true (G2); the G1 container tool (tools ≥ 3.19.0) on the worker image.
Worker RBAC — the worker SA needs batch/jobs.create in the Job namespace.

Dispatch (after the checklist is satisfied):

noetl exec automation/mlops/slm/finetune.yaml -r distributed \
  --set mode=container \
  --set dataset_version=v1_constrained \
  --set train_image=ghcr.io/noetl/slm-trainer:<tag> \
  --set gpu_accelerator=nvidia-l4 --set gpu_count=1 \
  --set data_pvc=slm-data --set service_account=noetl-slm-trainer

The container step's node selector, GPU tolerations, resource requests, PVC mount, and env (server URL, registry backend server, internal token from the Secret) are all declared in finetune.yaml (step: finetune_container) — see the # PROD-GPU CHECKLIST block at the bottom of that file.

7. Reproduce the latest run (copy-paste)

# from the ai-meta root
python3 -m venv .slm-venv && .slm-venv/bin/pip install mlx-lm

export SLM_DATASET_VERSION=v1_constrained
export NOETL_REGISTRY_BACKEND=local
export NOETL_REGISTRY_LOCAL_DIR=$PWD/.slm_registry
CONFIG=$PWD/repos/travel/automation/mlops/slm/travel/slm.config.yaml
LIB=repos/ops/automation/mlops/slm/lib

# train  (≈25 min on M1 Max)
.slm-venv/bin/python $LIB/slm_finetune.py --config "$CONFIG" --backend mlx \
  --tenant muno --project travel --learning-rate 1e-4 \
  --mlx-iters 800 --mlx-batch-size 1 --mlx-num-layers 16 --mlx-max-seq-length 2048

# eval   (writes eval_report.json, prints the gate)
.slm-venv/bin/python $LIB/slm_eval.py --config "$CONFIG" --candidate slm \
  --model-ref latest --tenant muno --project travel --register

# package (model card + release)
.slm-venv/bin/python $LIB/slm_package.py --config "$CONFIG" \
  --model-ref latest --eval-ref latest --tenant muno --project travel

8. Next iteration — where to put the effort

The reference run is data-bound, not architecture-bound. Highest-leverage work, in order:

Event-log replay → ~1000 turns. The config's event_log_replay block is already status: enabled (cap_turns: 1000, reads real Muno planner extract_turn results via the server API, PII-redacted). Wiring that into the train set is the single biggest lever.
Richer render conditioning — feed the render pass the tool_summary / slot context the oracle sees, so it emits the right widget on data-bearing turns (fixes the weak widget_type_match).
True grammar-constrained generation at decode (logit masking / outlines over the contract enums) instead of only post-hoc repair.
Larger candidate (qwen2.5-3B) only if 1.5B plateaus below the gate after (1)–(3).

Travel SLM journey — the track record + result analysis.
Travel-domain SLM — the design-stage RFC summary.
Widget contract — the render-output schema the SLM emits.
Playbook: itinerary-planner — the consuming playbook.
Template pack README: automation/mlops/slm/README.md
Playbooks: finetune.yaml · eval.yaml · package.yaml · dataset_build.yaml
Model card: examples/travel-mlx-v1/MODEL_CARD.md

Travel SPA

Home

Architecture

Integration

Operations

See also

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training the Travel SLM

Training the Travel SLM

0. Mental model

1. Prerequisites

Repos (already present as ai-meta submodules)

Tooling

The registry backend switch

2. Build (or locate) the dataset

Dataset record shape

3. Path A — train locally on Apple Silicon (MLX)

3.1 One-time venv

3.2 Environment

3.3 Train

3.4 Eval

3.5 Package

3.6 Offline smoke (sanity check the whole spine without a model)

4. How eval works (reading the numbers)

5. The registry lineage

6. Path B — GKE GPU container (PEFT/CUDA)

7. Reproduce the latest run (copy-paste)

8. Next iteration — where to put the effort

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally