Skip to content

feat: wire training.target_kl + training.early_stopping#37

Merged
timzsu merged 3 commits into
mainfrom
zsu/rfc28-ppo-early-stop
May 12, 2026
Merged

feat: wire training.target_kl + training.early_stopping#37
timzsu merged 3 commits into
mainfrom
zsu/rfc28-ppo-early-stop

Conversation

@timzsu
Copy link
Copy Markdown
Collaborator

@timzsu timzsu commented May 12, 2026

Purpose

Closes the TODO from #34: wire training.target_kl and training.early_stopping end-to-end in PPO. When early_stopping=true and per-step KL exceeds target_kl, training halts cleanly and is reported as successful.

Changes

  • src/worker/executors/ppo_executor.py: add _EarlyStopSignal and _EarlyStopPPOTrainer(PPOTrainer). The subclass overrides log() and raises _EarlyStopSignal when logs["objective/kl"] > self.target_kl. Both PPOTrainer(...) sites in build_trainer() now produce the subclass; _install_kl_early_stopping stamps the threshold, and ppo_trainer.train() is wrapped in try/except _EarlyStopSignal.
  • examples/templates/ppo_training_llama_1b.yaml: flip early_stopping: true so the smoke template exercises the new path; the multi-gpu and ministral templates stay false to keep the disabled case demonstrable.
  • tests/worker/test_ppo_early_stopping.py (new): above-target raises; below / equal / missing key / threshold-unset don't.

Design

TRL's PPOTrainer.train() calls self.log(metrics) per step but never checks control.should_training_stop, so a stock TrainerCallback cannot end PPO training. Raising from an overridden log() lets the exception unwind the loop naturally; the executor catches it so the save / metadata path runs as usual.

Activation rules:

  • early_stopping: true + target_kl > 0 → arm.
  • early_stopping: false (or missing) → no-op; non-zero target_kl is logged as ignored.
  • early_stopping: true without a positive target_klExecutionError.

Test Plan

  • uv run pytest tests/worker/test_ppo_early_stopping.py tests/worker/test_ppo_config_mapping.py tests/server tests/shared tests/sdk tests/cli.
  • Live e2e: rebuild flowmesh_worker_gpu, restart one GPU worker, submit examples/templates/ppo_training_llama_1b.yaml, expect the early-stop log lines and a shorter runtime.

Test Result

  • 492 server/shared/sdk/cli + 20 PPO unit tests pass; mypy clean.
  • Live e2e on ppo_training_llama_1b.yaml (1 GPU worker): training_successful=True, runtime 59 s vs ~153 s in refactor: clean up legacy template fields and surface remaining knobs #34. Log lines observed:
    • PPO KL early-stop enabled at target_kl=0.1000
    • PPO early stop: objective/kl=0.4548 > target_kl=0.1000
    • PPO training stopped early by KL threshold

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

PPO-side early stopping via a PPOTrainer subclass. RFC #28 PR-3 closes
the TODO carried over from PR-34.

- ppo_executor: add _EarlyStopSignal exception and _EarlyStopPPOTrainer
  subclass that overrides log(); it calls super().log() then raises
  _EarlyStopSignal when objective/kl exceeds the trainer's target_kl.
  TRL's PPO loop ignores TrainerControl.should_training_stop, so a
  stock TrainerCallback can't end training — letting an exception
  propagate out of log() unwinds the loop cleanly. Both build_trainer
  construction sites now produce the subclass; it behaves identically
  when target_kl is unset.
- _install_kl_early_stopping reads training.early_stopping and
  training.target_kl and stamps target_kl on the trainer. Misconfig
  (early_stopping=true without a positive target_kl) raises
  ExecutionError; target_kl set without the flag is logged as ignored.
- ppo_trainer.train() now runs under try/except _EarlyStopSignal so
  the trip is logged and training_successful stays True.
- examples/templates/ppo_training_llama_1b.yaml: flip early_stopping
  to true so the smoke template exercises the new path; the multi-gpu
  and ministral templates stay false to demonstrate the disabled
  case.
- tests/worker/test_ppo_early_stopping.py: five focused tests
  (above-target raises, below/equal/missing-key/threshold-unset don't)
  that patch PPOTrainer.log so only the override is exercised.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu changed the title feat(ppo): wire training.target_kl + training.early_stopping feat: wire training.target_kl + training.early_stopping May 12, 2026
@timzsu timzsu marked this pull request as ready for review May 12, 2026 10:29
- ppo_executor: bool(training_config.get('early_stopping', False)) read
  string flags like 'false' as truthy. Switch to to_bool from
  shared.utils.parsing so YAML-stringified booleans canonicalise
  correctly.
- tests: add coverage for _install_kl_early_stopping activation —
  flag missing / disabled across False, 'false', 'False', 0, 'no',
  'off'; armed across True, 'true', 'True', 1, 'yes', 'on'; enabled
  with no positive target_kl raises ExecutionError.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu requested a review from kaiitunnz May 12, 2026 10:57
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall with minor comments.

Comment thread src/worker/executors/ppo_executor.py Outdated
Comment thread src/worker/executors/ppo_executor.py Outdated
… log

- _install_kl_early_stopping: drop minimum=0 on the safe_float call;
  the explicit `target_kl <= 0` check below already rejects bad
  values without silently coercing negatives to 0.
- Remove the outer "PPO training stopped early by KL threshold" log;
  the inner log from _EarlyStopPPOTrainer.log ("PPO early stop:
  objective/kl=... > target_kl=...") already records the trip with
  the actual KL value.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu requested a review from kaiitunnz May 12, 2026 13:10
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@timzsu timzsu merged commit b5f80d1 into main May 12, 2026
11 checks passed
@timzsu timzsu deleted the zsu/rfc28-ppo-early-stop branch May 12, 2026 13:15
@timzsu timzsu mentioned this pull request May 12, 2026
1 task
@timzsu timzsu linked an issue May 12, 2026 that may be closed by this pull request
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Cleanup workflow templates.

2 participants