-
Notifications
You must be signed in to change notification settings - Fork 0
Training the Travel SLM
A step-by-step, reproducible guide to building the travel-domain dataset, fine-tuning a small model on it, evaluating it against the deterministic floor, and registering the lineage. Written for an engineer who has never seen this code before.
For the why and the result history, read the Travel SLM journey page first. This page is the how.
Two training paths are covered:
- Path A — local on Apple Silicon (MLX). No GPU pool, no container. This is what produced the first real fine-tune; start here.
- Path B — GKE GPU container (PEFT/CUDA). The production path, gated on infra the operator must provision.
Everything runs as NoETL playbooks (MLOps-as-playbooks). The engine
logic lives in repos/ops/automation/mlops/slm/lib/; the playbooks are
thin wrappers. You can also call the libs directly (the guide shows both).
The lifecycle is a four-stage pipeline, each stage a playbook, each writing a versioned entry into the G3 registry with lineage to the previous:
dataset_build → finetune → eval(candidate=slm) → package
(seed+teacher) (LoRA adapter) (vs oracle floor) (model card + release)
↓ dataset ↓ model ↓ eval ↓ release
registry://muno/travel/{dataset,model,eval,release}/<version>
The design center: constrain the output, don't enlarge the model. Both
the stub and the real model propose JSON; the contract schemas dispose
(_constrain_extract / _constrain_render repair toward a minimal
schema-valid form). That is why schema validity stays at 100% regardless of
model quality — the match metrics are where real quality shows.
| Path | Holds |
|---|---|
repos/ops/automation/mlops/slm/ |
the generic template pack: playbooks + lib/ engines |
repos/travel/automation/mlops/slm/travel/ |
the travel instance: slm.config.yaml, oracle.py, contracts, seed corpus, built datasets |
The one object the travel domain writes to drive the generic pack is
slm.config.yaml.
It declares the two roles (extract/render), their schemas, the oracle
module, the vocab enums, the data sources, the teacher, the model
candidates, and the eval gates. Read it once before starting.
-
noetlCLI (for the playbook path) —-r localruns the embedded Rust interpreter, no server needed. -
Python 3.12 — the
lib/engines are pure-stdlib for dataset/eval (they ship their own draft-07 JSON-Schema validator because the runtime has nojsonschema). - For Path A: an arm64 Python venv with
mlx-lm(details below). - For Path B: the operator-provisioned GPU infra (§6).
Every stage writes to the registry. Pick the backend with one env var:
-
NOETL_REGISTRY_BACKEND=local+NOETL_REGISTRY_LOCAL_DIR=<dir>— file-backed mirror of the G3 server semantics. Use this for all local work (Path A and the offline smoke). Nothing touches the server. -
unset (server) — real G3. Needs
NOETL_SERVER_URL+NOETL_INTERNAL_API_TOKENand a server started withNOETL_REGISTRY_ENABLED=true. This is the Path B / in-cluster mode.
The latest built dataset is committed at
repos/travel/automation/mlops/slm/travel/datasets/build/travel/v1_constrained/
(train.jsonl, eval.jsonl, manifest.json). If it is present you can
skip straight to training.
To rebuild it from the seed corpus:
# from the ai-meta root
noetl exec repos/ops/automation/mlops/slm/dataset_build.yaml -r local \
--set config=repos/travel/automation/mlops/slm/travel/slm.config.yamlWhat this does: reads the seed corpus, labels every turn with the
deterministic oracle (authoritative), optionally augments with the
constrained teacher (gemini-2.5-flash via Vertex — only if the
teacher block is enabled and you have WI/Vertex access), validates every
output against the contract schemas, repairs teacher render payloads toward
the oracle on schema failure, splits 70/30 (seed muno-travel-13), and
writes versioned JSONL + a manifest.json.
The result you want is v1_constrained: 45 turns → 58 multitask
examples (one extract + one render per turn), 100% schema validity.
Teacher note. The teacher is optional for training (the model trains on oracle labels). Rebuilding without Vertex access still produces a valid oracle-labeled dataset — the teacher columns just stay empty. The teacher run that produced
v1_constrainedcost $0.068 (85 flash calls). The token is minted in-python from the worker pod's Workload Identity; there is no API key to set.
Each line of train.jsonl / eval.jsonl:
{
"id": "t002",
"input": { ...user event + slot_state + thread context... },
"intent_label": "...",
"label_source": "deterministic_oracle",
"labels": { "extract": {...}, "render": {...} }, // authoritative target
"labels_teacher": { "extract": {...}, "render": {...} }, // raw constrained teacher
"labels_teacher_repaired":{ "extract": {...}, "render": {...} }, // teacher repaired to oracle
"teacher_valid": true,
"valid": true
}The finetune engine flattens each turn into two prompt/completion training examples — one for the extract role, one for the render role — using the role system prompts as the prompt and the (JSON) oracle label as the completion. Example MLX training line:
{"prompt": "# Muno Itinerary Planner Extraction Prompt\n...", "completion": "{\"render_intent\": {\"kind\": \"calendar_live\"}, \"slot_updates\": {}, \"tool_requests\": []}"}This is the path that produced the first real travel SLM. A 1–1.5B LoRA
fits in unified memory, so the whole dataset → finetune → eval → package
spine runs on a Mac with the file-backed registry.
# arm64 Python venv with mlx-lm
python3 -m venv .slm-venv
.slm-venv/bin/pip install mlx-lmmlx-lm 0.31.3 was used for the reference run. The base weights
(Qwen/Qwen2.5-1.5B-Instruct, ≈3 GB) download from Hugging Face on first
run and cache locally.
export SLM_DATASET_VERSION=v1_constrained
export NOETL_REGISTRY_BACKEND=local
export NOETL_REGISTRY_LOCAL_DIR=$PWD/.slm_registry
CONFIG=$PWD/repos/travel/automation/mlops/slm/travel/slm.config.yaml
LIB=repos/ops/automation/mlops/slm/libCalling the engine directly (the exact reference invocation):
.slm-venv/bin/python $LIB/slm_finetune.py --config "$CONFIG" --backend mlx \
--tenant muno --project travel --learning-rate 1e-4 \
--mlx-iters 800 --mlx-batch-size 1 --mlx-num-layers 16 --mlx-max-seq-length 2048Or via the playbook (python_bin must point at the arm64 venv):
noetl exec repos/ops/automation/mlops/slm/finetune.yaml -r local \
--set mode=mlx \
--set python_bin=.slm-venv/bin/python \
--set dataset_version=v1_constrained \
--set learning_rate=1e-4 \
--set mlx_iters=800 --set mlx_batch_size=1 \
--set mlx_num_layers=16 --set mlx_max_seq_length=2048Hyperparameters used for the reference run (and why they're a sane default for this dataset size):
| Knob | Value | Note |
|---|---|---|
| base model | Qwen/Qwen2.5-1.5B-Instruct |
small enough for unified memory; bounded vocab generalizes |
| iters | 800 | the val-loss curve flattens by ~iter 200 (0.811 → 0.089); 800 is comfortably past convergence on 58 examples |
| batch size | 1 | tiny dataset; keep it simple |
| num layers | 16 (top-16) | LoRA only the top layers → 5.276 M trainable params (0.342%) |
| learning rate | 1e-4 | standard LoRA lr |
| max seq length | 2048 | the prompts (system prompt + context) fit comfortably |
--mask-prompt |
on | train on the completion only, not the prompt |
Expected cost on an M1 Max / 32 GB: ~25 min wall-clock, ~9.7 GB peak memory, final val loss ≈0.089. The engine writes an adapter-only artifact, tars it, and registers a G3 model (lineage → the dataset entry).
The adapter and train_report.json land under
datasets/build/travel/v1_constrained/models/travel_slm_multitask-mlx/
(git-ignored — large binary).
.slm-venv/bin/python $LIB/slm_eval.py --config "$CONFIG" --candidate slm \
--model-ref latest --tenant muno --project travel --registerOr via the playbook:
noetl exec repos/ops/automation/mlops/slm/eval.yaml -r local \
--set candidate=slm --set model_ref=latest --set register=true \
--set dataset_version=v1_constrainedThis pulls the registered model, runs it on the 16-example holdout under
the same schema-constrained decoding the teacher used, computes the
match/validity/latency metrics, gates them against the config targets, and
(with --register) writes a G3 eval entry (lineage → model). It prints
the gate result and writes eval_report.json.
.slm-venv/bin/python $LIB/slm_package.py --config "$CONFIG" \
--model-ref latest --eval-ref latest --tenant muno --project travelOr via the playbook:
noetl exec repos/ops/automation/mlops/slm/package.yaml -r local \
--set model_ref=latest --set eval_ref=latestExports the model, writes a MODEL_CARD.md (metrics included), bundles
model + card + eval report, and registers a G3 release (lineage →
[model, eval]).
To verify the orchestration end-to-end using the CPU stub backend (no MLX, no GPU, seconds not minutes):
cd repos/ops/automation/mlops/slm
NOETL_REGISTRY_BACKEND=local \
python3 lib/slm_pipeline_smoke.py \
--config /abs/path/repos/travel/automation/mlops/slm/travel/slm.config.yaml \
--dataset-version v1_constrainedIt asserts the dataset → model → eval → release lineage DAG and that constrained decoding holds schema validity at 1.0. Exits non-zero on any failure. The stub intentionally fails the match gate — that proves the gate is real, not faked.
The eval scores the candidate's per-field decisions against the oracle
labels (the authoritative target) on the holdout split. Two families:
-
Validity (target 1.0, hard):
widget_schema_validity,extract_schema_validity,tool_vocab_validity,render_intent_vocab_validity. These are 100% by construction because of constrained decoding — if one drops below 1.0, the constrain/repair lever is broken, not the model. -
Match (targets 0.95–0.98):
tool_match,render_intent_match,widget_type_match,arg_fidelity,slot_update_match. This is where real model quality lives.
Interpreting results vs the floor: the deterministic oracle scores 1.0 on every match metric by definition (it is the label source). So a model never "beats" the oracle — it approaches it. The honest question is how much of the 0.95–0.98 gate the model closes, and whether it beats the trivial baselines (the CPU stub's retrieval). The reference run:
gate: FAIL
tool_match = 0.5625 < 0.98
render_intent_match = 0.5625 < 0.98
widget_type_match = 0.3750 < 0.98
arg_fidelity = 0.5625 < 0.95
slot_update_match = 0.6250 < 0.95
(all four validity rows = 1.0000, PASS)
eval_report.json also carries divergences[] — the concrete turns where
candidate and floor disagree, with diverged_fields. That list is your
debugging surface for the next iteration (e.g. t010 "what flights are available" diverged on tool/intent/arg).
Every stage writes a versioned entry addressed by a URN:
registry://<kind>/<name>/<version> (or fully-qualified
registry://<tenant>/<project>/<kind>/<name>/<version>; <version> is an
integer or latest). The reference run produced:
dataset registry://muno/travel/dataset/travel_v1_constrained/1
model registry://muno/travel/model/travel_slm_multitask/1 (lineage → dataset)
eval registry://muno/travel/eval/travel_slm_multitask/1 (lineage → model)
release registry://muno/travel/release/travel_slm_multitask/1 (lineage → [model, eval])
Large bytes (datasets, adapter weights, eval reports) live in the object
store; the registry entry records where (artifact_uri) and how it was
produced (metadata + lineage). The client lib is
lib/slm_registry.py
— it is server-mediated for the real backend (calls /api/internal/registry/*
-
/api/internal/objects/*with the internal token; never direct DB/object access, per the data-access boundary).
The mode=container path dispatches the same engine with
--backend peft as a G1 GPU k8s Job; the worker frees its slot via
G2 poll-to-terminal so the hours-long run doesn't hold a slot; the Job
registers the model into G3 from inside the cluster.
It is gated — none of this is on prod. The operator must provision and
approve all of the following before flipping mode=container:
-
GPU node pool — a GKE pool with GPUs (e.g.
nvidia-l4), taintednvidia.com/gpu=present:NoSchedule, labelledcloud.google.com/gke-accelerator=nvidia-l4, with the NVIDIA device-plugin DaemonSet. -
Training image
ghcr.io/noetl/slm-trainer:<tag>— python3 + CUDA torch + transformers + peft + datasets + accelerate + thelib/engine at/opt/slm/lib; base-model weights baked or runtime-pullable. -
Dataset PVC
slm-data— pre-populated withslm.config.yaml+ the built dataset under/data/datasets/build/<project>/<version>/, writable for the/data/modelsoutput. -
ServiceAccount
noetl-slm-trainer— Workload-Identity-bound to a GSA with registry + GCS write, carrying the internal-API token Secretnoetl-internal-api. -
Platform flags (review-only today, NOT on prod) — server
NOETL_REGISTRY_ENABLED=true(G3); workerNOETL_CONTAINER_COMPLETION_POLL=true(G2); the G1 container tool (tools ≥ 3.19.0) on the worker image. -
Worker RBAC — the worker SA needs
batch/jobs.createin the Job namespace.
Dispatch (after the checklist is satisfied):
noetl exec automation/mlops/slm/finetune.yaml -r distributed \
--set mode=container \
--set dataset_version=v1_constrained \
--set train_image=ghcr.io/noetl/slm-trainer:<tag> \
--set gpu_accelerator=nvidia-l4 --set gpu_count=1 \
--set data_pvc=slm-data --set service_account=noetl-slm-trainerThe container step's node selector, GPU tolerations, resource requests,
PVC mount, and env (server URL, registry backend server, internal token
from the Secret) are all declared in
finetune.yaml
(step: finetune_container) — see the # PROD-GPU CHECKLIST block at the
bottom of that file.
# from the ai-meta root
python3 -m venv .slm-venv && .slm-venv/bin/pip install mlx-lm
export SLM_DATASET_VERSION=v1_constrained
export NOETL_REGISTRY_BACKEND=local
export NOETL_REGISTRY_LOCAL_DIR=$PWD/.slm_registry
CONFIG=$PWD/repos/travel/automation/mlops/slm/travel/slm.config.yaml
LIB=repos/ops/automation/mlops/slm/lib
# train (≈25 min on M1 Max)
.slm-venv/bin/python $LIB/slm_finetune.py --config "$CONFIG" --backend mlx \
--tenant muno --project travel --learning-rate 1e-4 \
--mlx-iters 800 --mlx-batch-size 1 --mlx-num-layers 16 --mlx-max-seq-length 2048
# eval (writes eval_report.json, prints the gate)
.slm-venv/bin/python $LIB/slm_eval.py --config "$CONFIG" --candidate slm \
--model-ref latest --tenant muno --project travel --register
# package (model card + release)
.slm-venv/bin/python $LIB/slm_package.py --config "$CONFIG" \
--model-ref latest --eval-ref latest --tenant muno --project travelThe reference run is data-bound, not architecture-bound. Highest-leverage work, in order:
-
Event-log replay → ~1000 turns. The config's
event_log_replayblock is alreadystatus: enabled(cap_turns: 1000, reads real Muno plannerextract_turnresults via the server API, PII-redacted). Wiring that into the train set is the single biggest lever. -
Richer render conditioning — feed the render pass the
tool_summary/ slot context the oracle sees, so it emits the right widget on data-bearing turns (fixes the weakwidget_type_match). - True grammar-constrained generation at decode (logit masking / outlines over the contract enums) instead of only post-hoc repair.
- Larger candidate (qwen2.5-3B) only if 1.5B plateaus below the gate after (1)–(3).
- Travel SLM journey — the track record + result analysis.
- Travel-domain SLM — the design-stage RFC summary.
- Widget contract — the render-output schema the SLM emits.
- Playbook: itinerary-planner — the consuming playbook.
- Template pack README:
automation/mlops/slm/README.md - Playbooks:
finetune.yaml·eval.yaml·package.yaml·dataset_build.yaml - Model card:
examples/travel-mlx-v1/MODEL_CARD.md
Travel SPA
Architecture
- Architecture
- Widget contract
- Business data via playbooks
- Playbook: itinerary-planner
- Playbook: calendar/list
- Python → Rust migration
- Travel-domain SLM
- Travel SLM journey
- Training the Travel SLM
Integration
Operations
See also
- noetl wiki (app)
- ops wiki (deploy)
- Ephemeral Blueprints