Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer#2
Merged
Merged
Conversation
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request enhances the robustness and memory management of the training and inference pipeline by implementing comprehensive error propagation, improving shutdown behavior, optimizing dataset handling with Ray ObjectRefs, and preventing memory leaks in evaluation tracking.
Changes:
- Implemented fatal error propagation from inference engine to training controller to halt training on unrecoverable failures
- Improved shutdown behavior by discarding buffered prompts instead of attempting to process them during shutdown
- Refactored dataset handling to use Ray ObjectRefs for better memory efficiency with large datasets
- Added evaluation pool cleanup to prevent unbounded memory growth from leftover samples
- Configured Mooncake store to prevent trainers from contributing to the global memory pool
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| torchspec/training/trainer.py | Sets global_segment_size=0 in MooncakeConfig to prevent trainer memory contribution |
| torchspec/controller/training_controller.py | Adds error tracking field and set_inference_error method; raises RuntimeError in try_dispatch_batch when inference fails |
| torchspec/controller/inference_manager.py | Catches RayActorError, sets error on controller, improves shutdown behavior by discarding prompts, and makes pool wait interruptible |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
cicirori
pushed a commit
to cicirori/TorchSpec
that referenced
this pull request
Feb 26, 2026
* improve loss to select valid idx Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * reduce compile Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * trigger CI Signed-off-by: Yubo Wang <yubowang2019@gmail.com> --------- Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
zhubohao911
pushed a commit
to zhubohao911/TorchSpec
that referenced
this pull request
Mar 23, 2026
…e memory of trainer (lightseekorg#2) * avoid reserialize, clear eval for ray Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * handle inference error Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * bug fix Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * Update torchspec/controller/inference_manager.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several important improvements and bug fixes to the training and inference management workflow, focusing on robust error handling, memory management, and better integration with Ray object storage. The main changes include enhanced fatal error propagation from the inference manager to the training controller, improved shutdown and buffer handling, refactoring of dataset handling to use Ray ObjectRefs, and cleanup of evaluation tracking to prevent unbounded growth.
Error handling and shutdown improvements:
set_inference_error, and the controller raises aRuntimeErroron subsequent dispatch attempts. This prevents silent failures and ensures the training loop halts on unrecoverable inference errors. [1] [2] [3] [4]_drainmethod inInferenceManagernow logs and discards any buffered prompts on shutdown instead of attempting to process them, avoiding hangs during shutdown._await_pool_capacityis now interruptible by shutdown, and logs if aborted early. [1] [2]Dataset and evaluation handling:
ObjectRef(dataset_ref) instead of a raw list, improving compatibility with distributed setups and large datasets. The code handles both legacy lists andObjectRefs, and requiresdataset_sizefor the latter. [1] [2] [3]dataset_refinstead of the raw dataset. [1] [2]dataset_sizefor consistency with the new dataset handling.Evaluation pool cleanup:
Miscellaneous:
global_segment_size=0to prevent accidental global segmenting, improving reproducibility and debugging. [1] [2]These changes collectively improve the robustness, scalability, and maintainability of the training and inference pipeline.