feat: wire training.target_kl + training.early_stopping#37
Merged
Conversation
PPO-side early stopping via a PPOTrainer subclass. RFC #28 PR-3 closes the TODO carried over from PR-34. - ppo_executor: add _EarlyStopSignal exception and _EarlyStopPPOTrainer subclass that overrides log(); it calls super().log() then raises _EarlyStopSignal when objective/kl exceeds the trainer's target_kl. TRL's PPO loop ignores TrainerControl.should_training_stop, so a stock TrainerCallback can't end training — letting an exception propagate out of log() unwinds the loop cleanly. Both build_trainer construction sites now produce the subclass; it behaves identically when target_kl is unset. - _install_kl_early_stopping reads training.early_stopping and training.target_kl and stamps target_kl on the trainer. Misconfig (early_stopping=true without a positive target_kl) raises ExecutionError; target_kl set without the flag is logged as ignored. - ppo_trainer.train() now runs under try/except _EarlyStopSignal so the trip is logged and training_successful stays True. - examples/templates/ppo_training_llama_1b.yaml: flip early_stopping to true so the smoke template exercises the new path; the multi-gpu and ministral templates stay false to demonstrate the disabled case. - tests/worker/test_ppo_early_stopping.py: five focused tests (above-target raises, below/equal/missing-key/threshold-unset don't) that patch PPOTrainer.log so only the override is exercised. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
- ppo_executor: bool(training_config.get('early_stopping', False)) read
string flags like 'false' as truthy. Switch to to_bool from
shared.utils.parsing so YAML-stringified booleans canonicalise
correctly.
- tests: add coverage for _install_kl_early_stopping activation —
flag missing / disabled across False, 'false', 'False', 0, 'no',
'off'; armed across True, 'true', 'True', 1, 'yes', 'on'; enabled
with no positive target_kl raises ExecutionError.
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
kaiitunnz
requested changes
May 12, 2026
Collaborator
kaiitunnz
left a comment
There was a problem hiding this comment.
Looks good overall with minor comments.
… log
- _install_kl_early_stopping: drop minimum=0 on the safe_float call;
the explicit `target_kl <= 0` check below already rejects bad
values without silently coercing negatives to 0.
- Remove the outer "PPO training stopped early by KL threshold" log;
the inner log from _EarlyStopPPOTrainer.log ("PPO early stop:
objective/kl=... > target_kl=...") already records the trip with
the actual KL value.
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Closes the TODO from #34: wire
training.target_klandtraining.early_stoppingend-to-end in PPO. Whenearly_stopping=trueand per-step KL exceedstarget_kl, training halts cleanly and is reported as successful.Changes
src/worker/executors/ppo_executor.py: add_EarlyStopSignaland_EarlyStopPPOTrainer(PPOTrainer). The subclass overrideslog()and raises_EarlyStopSignalwhenlogs["objective/kl"] > self.target_kl. BothPPOTrainer(...)sites inbuild_trainer()now produce the subclass;_install_kl_early_stoppingstamps the threshold, andppo_trainer.train()is wrapped intry/except _EarlyStopSignal.examples/templates/ppo_training_llama_1b.yaml: flipearly_stopping: trueso the smoke template exercises the new path; the multi-gpu and ministral templates stayfalseto keep the disabled case demonstrable.tests/worker/test_ppo_early_stopping.py(new): above-target raises; below / equal / missing key / threshold-unset don't.Design
TRL's
PPOTrainer.train()callsself.log(metrics)per step but never checkscontrol.should_training_stop, so a stockTrainerCallbackcannot end PPO training. Raising from an overriddenlog()lets the exception unwind the loop naturally; the executor catches it so the save / metadata path runs as usual.Activation rules:
early_stopping: true+target_kl > 0→ arm.early_stopping: false(or missing) → no-op; non-zerotarget_klis logged as ignored.early_stopping: truewithout a positivetarget_kl→ExecutionError.Test Plan
uv run pytest tests/worker/test_ppo_early_stopping.py tests/worker/test_ppo_config_mapping.py tests/server tests/shared tests/sdk tests/cli.flowmesh_worker_gpu, restart one GPU worker, submitexamples/templates/ppo_training_llama_1b.yaml, expect the early-stop log lines and a shorter runtime.Test Result
ppo_training_llama_1b.yaml(1 GPU worker):training_successful=True, runtime 59 s vs ~153 s in refactor: clean up legacy template fields and surface remaining knobs #34. Log lines observed:PPO KL early-stop enabled at target_kl=0.1000PPO early stop: objective/kl=0.4548 > target_kl=0.1000PPO training stopped early by KL thresholdPre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen).[BREAKING]and described migration steps above.