Skip to content

[RFC]: Cleanup workflow templates.  #28

@timzsu

Description

@timzsu

Motivation

Some workflow templates include fields that either fail validation or have no runtime effect. Because templates are our main examples, these fields make unsupported behavior look official. This RFC proposes grouping those fields by treatment: removing no-op fields, document metadata-only fields, and deciding whether any should become a supported API.

Proposed change

Group legacy fields by treatment instead of handling each field independently.

Remove no-op scheduler controls from templates

These fields are accepted by schema but not consumed by the server scheduler. They should be removed from templates unless we explicitly decide to implement them.

Field group Active template uses Proposed treatment Reason
resources.replicas 44 Remove Worker matching uses resources.hardware; no runtime path fans out, scales, or duplicates tasks from replicas.
spec.sloSeconds 3 Remove Accepted on inference/LoRA specs, but dispatcher and worker selection do not read it.
spec.parallel and spec.parallel.max_shards 4 / 2 Remove No parser/runtime path expands this into shards; actual dataset sharding support is through spec.shard.

Support legacy executor config keys with explicit mappings

These fields live inside executor-owned config dictionaries, so each one should either map to a real backend/FlowMesh behavior or be removed from templates. In the uv-managed environment, TRL 0.23.0 PPOConfig directly supports report_to and project, which can cover legacy logging keys; it does not directly support target_kl, early_stopping, optimize_cuda_cache, padding_side, or generation.do_sample.

Field group Active template uses Proposed treatment Reason
spec.agent.timeout 5 Support AgentSpec already accepts this field and templates already set it; AgentExecutor can replace hardcoded per-task execution timeouts with a validated value from spec.agent.timeout.
PPO training.target_kl and training.early_stopping 3 each Support TRL support: false, but the behavior is important for real PPO stability. Implement as FlowMesh-owned early stopping based on observed KL; do not alias target_kl to kl_coef.
PPO generation.do_sample 3 Remove TRL PPOConfig support: false. TRL PPOTrainer.train() hardcodes do_sample=True, so there is no direct config mapping.
PPO training.padding_side 3 Support TRL support: false, but FlowMesh owns tokenizer setup.
PPO training.optimize_cuda_cache 3 Remove TRL exact-name support: false. torch_empty_cache_steps exists in PPOConfig, but PPOTrainer's custom loop already calls empty_cache() directly and the boolean template field has no clear step-based contract.
PPO training.log_with and training.tracker_project_name 3 each Support TRL exact-name support: false, but direct replacements exist: map log_with to report_to and tracker_project_name to project.
vLLM model.source.revision 6 Support vLLM templates set the common source revision field, but VLLMExecutor only forwards revision and tokenizer_revision from model.vllm. Other executors do consume model.source.revision, so this is a vLLM-specific mismatch.

Fix schema/runtime mismatches

These cases are not just no-ops; they expose inconsistent contracts between template examples, schema, and executor code.

Field group Active template uses Proposed treatment Reason
spec.stages[].spec.model.adapters[].url 1 Support templates/lora_then_inference.yaml fails validation because adapter schema allows type, path, name, and kwargs, while vllm_lora_executor appears to support URL/task-based LoRA adapters.
metadata.project 1 Remove templates/agent_paper_collector.yaml fails strict workflow metadata validation.

Document metadata-only fields

These fields are acceptable to keep, but docs should state clearly that they do not control scheduling or execution.

Field group Active template uses Proposed treatment Reason
metadata.owner 29 Keep Runtime owner comes from submit/auth context, not template metadata.
metadata.annotations.description 34 Keep Useful for humans; no runtime behavior.
apiVersion and kind 49 each Keep Required by envelope/schema shape, but not used for task dispatch semantics.
model.source.type 35 Keep Reserved discriminator for future model-source backends such as Hugging Face, local paths, object storage, internal registries, or task artifacts; current executors do not route on it yet.

Add guardrails

Add a template validation CI test that runs every templates/*.yaml file through parse_workflow using the uv-managed environment.

Implementation plan

We plan to do the cleanup via four PRs:

  1. Add CI test to verify all templates can be parsed correctly, and fix the existing two unparsable workflows. fix: support LoRA archive URL outputs #29
  2. Move templates under examples. refactor: move templates under examples #32
  3. Cleanup dead configs that are easy to be removed/supported. Also, unify the use of double quotes in all templates. refactor: clean up legacy template fields and surface remaining knobs #34
  4. Support training.target_kl and training.early_stopping in the PPO executor. They change the training dynamic and are worth a standalone PR. feat: wire training.target_kl + training.early_stopping #37

Alternatives considered

No response

Migration / compatibility

No response

Feedback period

No response

CC list

@kaiitunnz

Before submitting

  • I have searched existing issues and confirmed this is not a duplicate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions