refactor: in-process torchrun + DeepSpeed launchers; fix multi-GPU SFT/DPO/PPO spawn#12
Merged
Conversation
…n.main The training executors (SFT / DPO / PPO) launched multi-GPU runs by shelling out through ``subprocess.check_call(["torchrun", ...])``, which required a top-level ``import subprocess`` (B404, suppressed via ``# nosec``) and an implicit dependency on ``torchrun`` being on $PATH. Adds ``worker.executors.utils.distributed.run_torchrun`` — a thin wrapper that calls ``torch.distributed.run.main`` directly, the same entry point ``torchrun`` uses. Worker ranks are still spawned by torch's elastic agent under the hood, so the runtime semantics are unchanged. The ``PYTHONPATH`` and launcher-flag env mutations are scoped to the call so reusing an executor instance does not see the launcher flag pre-set. DPO and PPO now have no subprocess imports at all. SFT still uses subprocess for the DeepSpeed-CLI branch (handled separately under its own nosec/TODO); the torchrun branch routes through the new helper. Closes the torchrun half of the bandit B404 follow-up tracked in #6. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
58c874b to
edadf60
Compare
345c99f to
3bf8389
Compare
7 tasks
…her.runner
Twin to the in-process torchrun migration. The DeepSpeed branch of SFT's
multi-GPU launcher now calls ``deepspeed.launcher.runner.main`` directly
(the canonical entry point — ``.venv/bin/deepspeed`` is literally
``from deepspeed.launcher.runner import main; main()``). Worker ranks are
still spawned by DeepSpeed's launcher under the hood, so the runtime
semantics are unchanged.
``shutil.which("deepspeed")`` is replaced with ``importlib.util.find_spec``
so the capability check matches the actual import path the in-process
call needs. The check tolerates DeepSpeed's CUDA-op probe at package
init: a CUDA-less host raises ``MissingCUDAException`` during spec
resolution, which we treat as "not available" and fall back to torchrun.
After this commit, ``sft_executor.py`` no longer imports ``subprocess``
or ``shutil``, closing the SFT torchrun + DeepSpeed B404 nosec entirely.
Three remaining B404 sites are gone (``sft_executor``, ``dpo_executor``,
``ppo_executor``); ``checkpoints.archive_model_dir`` still has its
pigz/tar accelerator (separate PR will install ``pigz`` in the GPU image
and document the B404 rationale instead of dropping the feature).
Adds three launcher unit tests covering ``run_deepspeed`` argv,
env-scoping, and the ``deepspeed_available()`` capability check on
hosts where DeepSpeed's package init fails.
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
3bf8389 to
63c8482
Compare
Brings the bandit policy in line with how subprocess shellouts are actually governed: - ``B404`` (subprocess module import) joins the project-skip list. The blanket import warning has no follow-up action — the actually-dangerous patterns are caught by ``B602`` (``shell=True``) and ``B607`` (partial executable path), both still enforced. - ``B603`` (subprocess.* called with argv list) leaves the project-skip list. Every call site now needs a per-line ``# nosec B603`` with a one-line written rationale, so each shellout is visible at the call line rather than hidden behind a blanket skip. After this commit, the only remaining ``# nosec`` annotations in ``src/`` are the two on the pigz/tar accelerator in ``checkpoints.archive_model_dir``, both with rationale (no TODO): ``argv list, no shell=True, absolute path via shutil.which()``. The audit no longer carries any "TODO" nosecs — every silenced rule is either a project-skip with rationale in ``[tool.bandit]`` or a per-line nosec with rationale at the call site. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
24da439 to
a3fa6e5
Compare
The pre-existing multi-GPU spawn path was untested ('multi-GPU worker is
not supported officially') and broken in three places that only show up
when the path actually fires.
- SFT spawn dumped the raw `WorkerTaskMessage` pydantic model with
`json.dump(task, fh)`, which raises 'Object of type WorkerTaskMessage
is not JSON serializable'. Match the DPO/PPO line:
`json.dump(task.model_dump(mode='json', by_alias=True), fh)`.
- All three `*_dist_entry.py` modules deserialize the spec back via
vanilla `json.load(fh)` and pass the resulting dict to
`executor.run`, which calls `task.spec` and crashes with
`'dict' object has no attribute 'spec'`. `WorkerTaskMessage`
serializes through `dedup_json` in `mode='json'` and has a
`model_validator(mode='before')` that auto-restores the deduped
form on `model_validate`. Route the loaded JSON through
`WorkerTaskMessage.model_validate` so the dist entry reaches
`executor.run` with a real envelope.
- `run_torchrun` did not pass `--tee`/`--redirects`, so when a
rank crashed the elastic agent reported only an opaque
`ChildFailedError` with `error_file: <N/A>` and the rank stderr
was silently lost. Pass `--tee 3` so rank stdout+stderr land on
the parent console and in the per-rank elastic log dir.
Adds a round-trip test that locks the writer/reader contract and
ships three multi-GPU templates (SFT/DPO/PPO) so the path stays
reachable.
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
a3fa6e5 to
8364221
Compare
15 tasks
kaiitunnz
requested changes
May 3, 2026
Collaborator
kaiitunnz
left a comment
There was a problem hiding this comment.
Minor comments. PTAL.
…riptions Replaces the changelog-style narration with standalone documentation per review feedback (kaiitunnz): the module and function docs now describe what the helpers do rather than what they replace, and ``deepspeed_available`` no longer references "the previous implementation". Renames ``_REPO_ROOT`` to ``_SRC_DIR`` to match the path it resolves to. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Drops the dict round-trip through ``json.dump(task.model_dump(...))`` in favour of ``fh.write(task.model_dump_json(by_alias=True))`` per review suggestion (kaiitunnz). Same on-disk shape, one fewer hop. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
- Add ``gpu.memory`` minimums to the three multi-GPU smoke templates (16Gi for SFT, 24Gi for DPO, 32Gi for PPO) so the scheduler picks a worker with enough VRAM for each trainer's working set (kaiitunnz on dpo_training_llama_1b_multi_gpu.yaml). - Extract the test-only launcher env-var placeholder into a named constant ``_TEST_LAUNCHER_FLAG`` with a comment noting why the value is arbitrary (it just needs to not collide with the runtime ``KV_*_DISTRIBUTED`` flags) (kaiitunnz on test_distributed_launcher.py). Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Bandit follow-up (#6). Migrate the
SFT/DPO/PPOdistributed launchers to in-processtorch.distributed.run.mainanddeepspeed.launcher.runner.main(the canonical entry points thetorchrunanddeepspeedconsole scripts themselves invoke). SwapB404↔B603in the bandit skip list so every subprocess call site has a per-line written rationale instead of a blanket skip. After this PR, no# nosec TODOremains insrc/.Once the launcher swap was in, the multi-GPU spawn path was actually exercised end-to-end for the first time. Three pre-existing bugs surfaced and are fixed in this PR — see the second bullet group under Changes.
Changes
Launcher swap (the original PR scope)
src/worker/executors/utils/distributed.py(new) —run_torchrun(),run_deepspeed(),deepspeed_available(). Lazy DeepSpeed import; capability check tolerates the package's CUDA-op probe at init.run_torchrunpasses--tee 3so rank stdout+stderr land on the parent console and in the per-rank elastic log dir; without it the elastic agent reports rank crashes only as an opaqueChildFailedErrorwitherror_file: <N/A>.sft_executor.py/dpo_executor.py/ppo_executor.py— route the multi-GPU spawn through the new helpers; dropimport subprocess(andimport shutilfrom SFT).pyproject.toml—[tool.bandit].skips: addB404, dropB603. Adddeepspeed.*to mypyfollow_untyped_imports.checkpoints.py— drop the# nosec B404 — TODOon the import (now project-skipped); add per-line# nosec B603 — argv list, no shell=True, absolute path via shutil.which()on the two pigz/tarsubprocess.runcalls.AGENTS.md— addsB603rule bullet.Multi-GPU path fixes (newly tested e2e)
sft_executor.py— SFT spawn was dumping the rawWorkerTaskMessagepydantic model withjson.dump(task, fh), raisingObject of type WorkerTaskMessage is not JSON serializableon every multi-GPU SFT submission. Match the DPO/PPO line:json.dump(task.model_dump(mode="json", by_alias=True), fh).*_dist_entry.py(all three) — deserializer wastask = json.load(fh), thenexecutor.run(task, ...)calledtask.specand crashed with'dict' object has no attribute 'spec'.WorkerTaskMessageserializes throughdedup_jsoninmode="json"and has amodel_validator(mode="before")that auto-restores the deduped form onmodel_validate. Route the loaded JSON throughWorkerTaskMessage.model_validateso the dist entry handsexecutor.runa real envelope.tests/worker/test_distributed_launcher.py— round-trip test that locks the writer/reader contract; updated argv assertion to include--tee 3.templates/{sft,dpo_training,ppo_training}_llama_1b_multi_gpu.yaml— three smoke templates so the multi-GPU path stays reachable fromflowmesh workflow submit. Bring up a single worker with multiplecuda_devicesviaflowmesh stack worker up --config <yaml>(theup gpu --targets a,bshortcut creates one worker per GPU, not one multi-GPU worker).Test Plan
PPOTrainer.train(), and then hang insideaccelerate.synchronize_rng_state(see caveat below).Caveat — PPO hangs upstream
PPO does not finish under
run_torchrunwith our currenttrl/acceleratepin: both ranks reachPPOTrainer.train(), the data loader iterator callsaccelerate.synchronize_rng_state, and that collective never returns. Reproduced on two different GPU pairs ([0,2] and [0,3]) so it is not a NVLink-topology quirk on this host. Stack frompy-spy(both ranks identical):This is not a regression from this PR: DPO uses the same
run_torchrunentry point and finishes cleanly, and SFT (DeepSpeed branch) finishes too. The bug is in TRL'sPPOTrainer.repeat_generatorinteraction withaccelerate.synchronize_rng_stateand is independent of the launcher swap. PR #12's scope (launcher swap + the three multi-GPU bugs the e2e surfaced) is verified end-to-end; the PPO upstream hang is left as a known issue to be addressed in a follow-up dep bump or trainer patch.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.[BREAKING]. Not breaking.B603rule bullet.