Skip to content

refactor: clean up legacy template fields and surface remaining knobs#34

Merged
timzsu merged 11 commits into
mainfrom
zsu/rfc28-cleanup-rewire
May 12, 2026
Merged

refactor: clean up legacy template fields and surface remaining knobs#34
timzsu merged 11 commits into
mainfrom
zsu/rfc28-cleanup-rewire

Conversation

@timzsu
Copy link
Copy Markdown
Collaborator

@timzsu timzsu commented May 11, 2026

Purpose

PR-2 of RFC #28. Removes template fields the runtime ignores, surfaces remaining legacy knobs under names that match the underlying backend, and unifies YAML scalar quoting across examples/templates/. Also fixes a latent crash in the single-GPU PPO save wrapper that surfaced when validating the PPO config wiring end-to-end. PPO target_kl / early_stopping are deferred to PR-3 because they need a real KL-watching early-stop loop; templates keep the fields and the executor carries a TODO marker so the spec doesn't churn between PRs.

Changes

Template cleanup:

  • examples/templates/*.yaml — drop resources.replicas, spec.sloSeconds, and spec.parallel blocks; none of those fields reaches the dispatcher or worker selection.
  • examples/templates/ppo_*.yaml — drop generation.do_sample and training.optimize_cuda_cache; TRL's PPOTrainer.train hardcodes do_sample=True, and the cache flag has no step-based contract under the custom PPO loop.
  • examples/templates/*.yaml — unify YAML scalars on bare form: drop unnecessary quotes, promote quoted ints (e.g. cpu: "8"cpu: 8), keep quotes only where YAML grammar requires them (split: "train[:1%]", save_strategy: "no" so it doesn't parse as False, etc.).
  • examples/templates/inference_vllm_all_args.yaml — move revision from model.vllm to model.source so all vLLM templates use the same canonical revision location.

Executor surfacing:

  • src/worker/executors/agent_executor.py — honour spec.agent.timeout (default 600 s); requires a positive integer, rejecting floats outright so sub-second values like 0.5 don't silently truncate to 0. Threaded through both the single-task and batch streaming paths.
  • src/worker/executors/ppo_executor.py + PPO templates — adopt the canonical training.report_to and training.project field names directly (matching PPOConfig); a small helper normalises null / empty values to TRL's "none" sentinel. training.padding_side was already wired through build_hf_load_kwargs; locked down with a test. A TODO(rfc28-pr3) next to the kl_coef read flags training.target_kl / training.early_stopping as intentionally unwired pending PR-3.
  • src/worker/executors/vllm_executor.py — stamp revision from spec.model_revision (the canonical model.source.revision); drop the redundant model.vllm.revision lookup and the resolver helper, aligning vLLM with the other executors. tokenizer_revision stays under model.vllm since it's a distinct knob.

PPO save fix:

  • src/worker/executors/ppo_executor.py_wrapped_save_model unconditionally read ppo_trainer.deepspeed, but HF Trainer only sets that attribute when DeepSpeed is enabled. Single-GPU PPO therefore crashed on the first _save_checkpoint with AttributeError. The backup/restore now lives inside the existing is_deepspeed_enabled guard.

Tests:

  • tests/worker/test_agent_timeout.py — agent timeout resolution (int-only contract).
  • tests/worker/test_ppo_config_mapping.pyreport_to / project / padding_side wiring.

Design

The fields handled here split into two groups:

  • Dead. Schema accepts them, runtime ignores them. Removing from templates avoids advertising unsupported behaviour to users.
  • Live but unwired (or under the wrong name). The backend (TRL PPOConfig, vLLM engine, AgentExecutor) supports the knob, but the worker never threaded the template value through, or the template used a non-canonical name. The fix is a small, local change; where the legacy template names diverged from the backend (log_with, tracker_project_name, vLLM-scoped revision), we align on the backend name rather than maintaining an alias.

target_kl / early_stopping are a separate case — TRL doesn't take them directly, and FlowMesh-owned KL early-stopping is a non-trivial change that warrants its own PR. The TODO marker keeps the connection between templates and executor explicit without dropping the fields.

Test Plan

  • uv run pytest tests/server tests/shared tests/sdk tests/cli plus the focused unit tests for the executor wiring and the existing template-parse coverage in tests/server/task/test_template_validation.py.
  • Live spot-check on the local stack: echo_local, inference_hf_tiny, inference_vllm_tiny, inference_vllm_all_args (after the revision relocation), lora_sft_llama, ppo_training_llama_1b end-to-end on a single GPU worker.

Test Result

  • 474 passed (server + shared + sdk + cli) + unit tests on the touched executor paths.
  • Live e2e on local stack (1 GPU worker):
    • echo_local — DONE.
    • inference_hf_tiny — DONE; transformers path unchanged.
    • inference_vllm_tiny — DONE; vLLM picked up the revision from model.source.revision.
    • inference_vllm_all_args — DONE; confirms model.source.revision is the only revision source after the helper was removed.
    • lora_sft_llama — DONE; training_successful=True, final_lora_archive produced (padding_side wiring exercised).
    • ppo_training_llama_1b — DONE; training_successful=True, runtime ~122 s, ran past save_steps=25/50, confirming both the PPOConfig wiring and the _save_checkpoint fix.
    • agent_simple_test — skipped; pre-existing requirement on the worker's secrets.yaml (UTU_LLM_API_KEY), unrelated to this PR.

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

timzsu added 8 commits May 11, 2026 12:52
Remove resources.replicas, spec.sloSeconds, and spec.parallel from
templates. Per RFC #28, none of these fields are consumed by the
dispatcher or worker selection at runtime, so leaving them in
example workflows misrepresents supported behavior.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Remove generation.do_sample and training.optimize_cuda_cache from PPO
templates. TRL's PPOTrainer.train() hardcodes do_sample=True, and the
optimize_cuda_cache flag has no step-based contract under the custom
PPO loop, so neither field reaches a backend. Inference templates
keep do_sample because transformers_executor consumes it.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
AgentExecutor resolves spec.agent.timeout (default 600s) once per run
and threads it into both the single-task and batch streaming paths.
Validates that the value is a positive number and surfaces a clear
ExecutionError otherwise.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
PPOExecutor now maps the legacy template knobs onto TRL PPOConfig:
training.log_with -> report_to (null/empty disables logging via
"none"; strings/lists pass through). training.tracker_project_name
-> project. training.padding_side already threaded through the
tokenizer kwargs by build_hf_load_kwargs; locked down with a test.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
VLLMExecutor only forwarded "revision" when set under model.vllm,
silently dropping the common "model.source.revision" templates were
already setting. Pick the vllm-scoped key first, otherwise fall back
to the source revision, otherwise omit the kwarg.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Drop quotes from string scalars where YAML allows, and convert quoted
ints (cpu, count) to bare integers. Quotes are retained where the YAML
parser needs them (structural chars, leading whitespace, embedded
escapes, apostrophes, reserved literals like "no" used as a string
value).

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
The save_model wrapper unconditionally read ppo_trainer.deepspeed to
back it up around save, but HF Trainer only sets that attribute when
deepspeed is actually enabled, so the single-GPU PPO path crashed on
the first _save_checkpoint with AttributeError. Read via getattr with
a None fallback; the restore is already guarded by
is_deepspeed_enabled, so the placeholder is never read back.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
- Templates: rename training.log_with to training.report_to and
  training.tracker_project_name to training.project so the YAML keys
  match the TRL PPOConfig fields verbatim.
- PPO executor: read the canonical keys, simplify _resolve_report_to
  (string mode returns the string, list mode keeps only str items),
  and remove the getattr in the save wrapper by moving the deepspeed
  backup inside the existing is_deepspeed_enabled guard.
- vLLM revision helper: tighten the parameter type to str|None and
  drop redundant str() casts (callers already produce strings).

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu mentioned this pull request May 12, 2026
1 task
timzsu added 2 commits May 12, 2026 06:47
Other executors only read `spec.model_revision` (i.e. model.source.revision);
vLLM kept a redundant model.vllm.revision lookup that no template used.
Stamp the engine kwarg directly from spec.model_revision, drop the
helper and the duplicate accepted_engine_args entry, and move the
revision line in inference_vllm_all_args.yaml under model.source so
it documents the only supported location.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
- AgentExecutor: spec.agent.timeout must be a positive int; reject
  floats outright (including whole-valued floats like 600.0) so we
  don't silently truncate sub-second values like 0.5 to 0.
- PPO _build_ppo_config: add a TODO marker next to the kl_coef read
  noting that training.target_kl and training.early_stopping are
  intentionally unwired pending RFC #28 PR-3; templates keep the
  fields so the spec doesn't churn between PRs.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu marked this pull request as ready for review May 12, 2026 07:10
@timzsu timzsu requested a review from kaiitunnz May 12, 2026 07:11
Comment thread examples/templates/n8n/dag_inference.json Outdated
Comment thread examples/templates/n8n/dag_inference.json Outdated
Comment thread examples/templates/n8n/dag_inference.json Outdated
Comment thread examples/templates/n8n/sft_then_inference.json Outdated
Comment thread examples/templates/n8n/sft_then_inference.json Outdated
Comment thread examples/templates/inference_vllm_all_args.yaml Outdated
Comment thread examples/templates/inference_vllm_all_args.yaml Outdated
Comment thread src/worker/executors/agent_executor.py
Comment thread src/worker/executors/agent_executor.py Outdated
Comment thread src/worker/executors/ppo_executor.py
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu requested a review from kaiitunnz May 12, 2026 09:41
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@timzsu timzsu merged commit e3e1285 into main May 12, 2026
11 checks passed
@timzsu timzsu deleted the zsu/rfc28-cleanup-rewire branch May 12, 2026 09:54
@timzsu timzsu linked an issue May 12, 2026 that may be closed by this pull request
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Cleanup workflow templates.

2 participants