refactor: clean up legacy template fields and surface remaining knobs by timzsu · Pull Request #34 · mlsys-io/FlowMesh

timzsu · 2026-05-11T13:01:25Z

Purpose

PR-2 of RFC #28. Removes template fields the runtime ignores, surfaces remaining legacy knobs under names that match the underlying backend, and unifies YAML scalar quoting across examples/templates/. Also fixes a latent crash in the single-GPU PPO save wrapper that surfaced when validating the PPO config wiring end-to-end. PPO target_kl / early_stopping are deferred to PR-3 because they need a real KL-watching early-stop loop; templates keep the fields and the executor carries a TODO marker so the spec doesn't churn between PRs.

Changes

Template cleanup:

examples/templates/*.yaml — drop resources.replicas, spec.sloSeconds, and spec.parallel blocks; none of those fields reaches the dispatcher or worker selection.
examples/templates/ppo_*.yaml — drop generation.do_sample and training.optimize_cuda_cache; TRL's PPOTrainer.train hardcodes do_sample=True, and the cache flag has no step-based contract under the custom PPO loop.
examples/templates/*.yaml — unify YAML scalars on bare form: drop unnecessary quotes, promote quoted ints (e.g. cpu: "8" → cpu: 8), keep quotes only where YAML grammar requires them (split: "train[:1%]", save_strategy: "no" so it doesn't parse as False, etc.).
examples/templates/inference_vllm_all_args.yaml — move revision from model.vllm to model.source so all vLLM templates use the same canonical revision location.

Executor surfacing:

src/worker/executors/agent_executor.py — honour spec.agent.timeout (default 600 s); requires a positive integer, rejecting floats outright so sub-second values like 0.5 don't silently truncate to 0. Threaded through both the single-task and batch streaming paths.
src/worker/executors/ppo_executor.py + PPO templates — adopt the canonical training.report_to and training.project field names directly (matching PPOConfig); a small helper normalises null / empty values to TRL's "none" sentinel. training.padding_side was already wired through build_hf_load_kwargs; locked down with a test. A TODO(rfc28-pr3) next to the kl_coef read flags training.target_kl / training.early_stopping as intentionally unwired pending PR-3.
src/worker/executors/vllm_executor.py — stamp revision from spec.model_revision (the canonical model.source.revision); drop the redundant model.vllm.revision lookup and the resolver helper, aligning vLLM with the other executors. tokenizer_revision stays under model.vllm since it's a distinct knob.

PPO save fix:

src/worker/executors/ppo_executor.py — _wrapped_save_model unconditionally read ppo_trainer.deepspeed, but HF Trainer only sets that attribute when DeepSpeed is enabled. Single-GPU PPO therefore crashed on the first _save_checkpoint with AttributeError. The backup/restore now lives inside the existing is_deepspeed_enabled guard.

Tests:

tests/worker/test_agent_timeout.py — agent timeout resolution (int-only contract).
tests/worker/test_ppo_config_mapping.py — report_to / project / padding_side wiring.

Design

The fields handled here split into two groups:

Dead. Schema accepts them, runtime ignores them. Removing from templates avoids advertising unsupported behaviour to users.
Live but unwired (or under the wrong name). The backend (TRL PPOConfig, vLLM engine, AgentExecutor) supports the knob, but the worker never threaded the template value through, or the template used a non-canonical name. The fix is a small, local change; where the legacy template names diverged from the backend (log_with, tracker_project_name, vLLM-scoped revision), we align on the backend name rather than maintaining an alias.

target_kl / early_stopping are a separate case — TRL doesn't take them directly, and FlowMesh-owned KL early-stopping is a non-trivial change that warrants its own PR. The TODO marker keeps the connection between templates and executor explicit without dropping the fields.

Test Plan

uv run pytest tests/server tests/shared tests/sdk tests/cli plus the focused unit tests for the executor wiring and the existing template-parse coverage in tests/server/task/test_template_validation.py.
Live spot-check on the local stack: echo_local, inference_hf_tiny, inference_vllm_tiny, inference_vllm_all_args (after the revision relocation), lora_sft_llama, ppo_training_llama_1b end-to-end on a single GPU worker.

Test Result

474 passed (server + shared + sdk + cli) + unit tests on the touched executor paths.
Live e2e on local stack (1 GPU worker):
- echo_local — DONE.
- inference_hf_tiny — DONE; transformers path unchanged.
- inference_vllm_tiny — DONE; vLLM picked up the revision from model.source.revision.
- inference_vllm_all_args — DONE; confirms model.source.revision is the only revision source after the helper was removed.
- lora_sft_llama — DONE; training_successful=True, final_lora_archive produced (padding_side wiring exercised).
- ppo_training_llama_1b — DONE; training_successful=True, runtime ~122 s, ran past save_steps=25/50, confirming both the PPOConfig wiring and the _save_checkpoint fix.
- agent_simple_test — skipped; pre-existing requirement on the worker's secrets.yaml (UTU_LLM_API_KEY), unrelated to this PR.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

Remove resources.replicas, spec.sloSeconds, and spec.parallel from templates. Per RFC #28, none of these fields are consumed by the dispatcher or worker selection at runtime, so leaving them in example workflows misrepresents supported behavior. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Remove generation.do_sample and training.optimize_cuda_cache from PPO templates. TRL's PPOTrainer.train() hardcodes do_sample=True, and the optimize_cuda_cache flag has no step-based contract under the custom PPO loop, so neither field reaches a backend. Inference templates keep do_sample because transformers_executor consumes it. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

AgentExecutor resolves spec.agent.timeout (default 600s) once per run and threads it into both the single-task and batch streaming paths. Validates that the value is a positive number and surfaces a clear ExecutionError otherwise. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

PPOExecutor now maps the legacy template knobs onto TRL PPOConfig: training.log_with -> report_to (null/empty disables logging via "none"; strings/lists pass through). training.tracker_project_name -> project. training.padding_side already threaded through the tokenizer kwargs by build_hf_load_kwargs; locked down with a test. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

VLLMExecutor only forwarded "revision" when set under model.vllm, silently dropping the common "model.source.revision" templates were already setting. Pick the vllm-scoped key first, otherwise fall back to the source revision, otherwise omit the kwarg. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Drop quotes from string scalars where YAML allows, and convert quoted ints (cpu, count) to bare integers. Quotes are retained where the YAML parser needs them (structural chars, leading whitespace, embedded escapes, apostrophes, reserved literals like "no" used as a string value). Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

The save_model wrapper unconditionally read ppo_trainer.deepspeed to back it up around save, but HF Trainer only sets that attribute when deepspeed is actually enabled, so the single-GPU PPO path crashed on the first _save_checkpoint with AttributeError. Read via getattr with a None fallback; the restore is already guarded by is_deepspeed_enabled, so the placeholder is never read back. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

- Templates: rename training.log_with to training.report_to and training.tracker_project_name to training.project so the YAML keys match the TRL PPOConfig fields verbatim. - PPO executor: read the canonical keys, simplify _resolve_report_to (string mode returns the string, list mode keeps only str items), and remove the getattr in the save wrapper by moving the deepspeed backup inside the existing is_deepspeed_enabled guard. - vLLM revision helper: tighten the parameter type to str|None and drop redundant str() casts (callers already produce strings). Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Other executors only read `spec.model_revision` (i.e. model.source.revision); vLLM kept a redundant model.vllm.revision lookup that no template used. Stamp the engine kwarg directly from spec.model_revision, drop the helper and the duplicate accepted_engine_args entry, and move the revision line in inference_vllm_all_args.yaml under model.source so it documents the only supported location. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

- AgentExecutor: spec.agent.timeout must be a positive int; reject floats outright (including whole-valued floats like 600.0) so we don't silently truncate sub-second values like 0.5 to 0. - PPO _build_ppo_config: add a TODO marker next to the kl_coef read noting that training.target_kl and training.early_stopping are intentionally unwired pending RFC #28 PR-3; templates keep the fields so the spec doesn't churn between PRs. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

LGTM.

timzsu added 8 commits May 11, 2026 12:52

timzsu mentioned this pull request May 12, 2026

[RFC]: Cleanup workflow templates. #28

Closed

1 task

timzsu added 2 commits May 12, 2026 06:47

timzsu marked this pull request as ready for review May 12, 2026 07:10

timzsu requested a review from kaiitunnz May 12, 2026 07:11

kaiitunnz requested changes May 12, 2026

View reviewed changes

Address PR comments.

b60f838

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

timzsu requested a review from kaiitunnz May 12, 2026 09:41

kaiitunnz approved these changes May 12, 2026

View reviewed changes

timzsu merged commit e3e1285 into main May 12, 2026
11 checks passed

timzsu deleted the zsu/rfc28-cleanup-rewire branch May 12, 2026 09:54

timzsu mentioned this pull request May 12, 2026

feat: wire training.target_kl + training.early_stopping #37

Merged

8 tasks

timzsu linked an issue May 12, 2026 that may be closed by this pull request

[RFC]: Cleanup workflow templates. #28

Closed

1 task

timzsu mentioned this pull request May 12, 2026

[RFC]: FlowMesh OSS Pre-Release Roadmap (collecting ideas...) #1

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: clean up legacy template fields and surface remaining knobs#34

refactor: clean up legacy template fields and surface remaining knobs#34
timzsu merged 11 commits into
mainfrom
zsu/rfc28-cleanup-rewire

timzsu commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timzsu commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timzsu commented May 11, 2026 •

edited

Loading