Skip to content

Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer#2

Merged
yubofredwang merged 5 commits into
mainfrom
ywang/handle-inference-error
Feb 26, 2026
Merged

Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer#2
yubofredwang merged 5 commits into
mainfrom
ywang/handle-inference-error

Conversation

@yubofredwang
Copy link
Copy Markdown
Collaborator

This pull request introduces several important improvements and bug fixes to the training and inference management workflow, focusing on robust error handling, memory management, and better integration with Ray object storage. The main changes include enhanced fatal error propagation from the inference manager to the training controller, improved shutdown and buffer handling, refactoring of dataset handling to use Ray ObjectRefs, and cleanup of evaluation tracking to prevent unbounded growth.

Error handling and shutdown improvements:

  • Fatal errors in the inference engine (e.g., Ray actor crashes) are now propagated to the training controller. The inference manager sets an error message via set_inference_error, and the controller raises a RuntimeError on subsequent dispatch attempts. This prevents silent failures and ensures the training loop halts on unrecoverable inference errors. [1] [2] [3] [4]
  • The _drain method in InferenceManager now logs and discards any buffered prompts on shutdown instead of attempting to process them, avoiding hangs during shutdown.
  • The pool capacity wait loop in _await_pool_capacity is now interruptible by shutdown, and logs if aborted early. [1] [2]

Dataset and evaluation handling:

  • The training loop now expects the dataset as a Ray ObjectRef (dataset_ref) instead of a raw list, improving compatibility with distributed setups and large datasets. The code handles both legacy lists and ObjectRefs, and requires dataset_size for the latter. [1] [2] [3]
  • All places where the dataset is reloaded (e.g., after exhaustion) now use dataset_ref instead of the raw dataset. [1] [2]
  • Steps per epoch are now computed using dataset_size for consistency with the new dataset handling.

Evaluation pool cleanup:

  • After evaluation batches are dispatched, any leftover samples that do not fill a batch are now dropped and logged, and all evaluation tracking state is cleared to prevent unbounded memory growth. [1] [2]

Miscellaneous:

  • The Mooncake store is now always initialized with global_segment_size=0 to prevent accidental global segmenting, improving reproducibility and debugging. [1] [2]
  • Minor import and typing improvements for clarity and correctness.

These changes collectively improve the robustness, scalability, and maintainability of the training and inference pipeline.

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
@yubofredwang yubofredwang marked this pull request as ready for review February 25, 2026 23:38
@yubofredwang yubofredwang changed the title Improve handle inference error, clean evaluation pool, reduce memory of global_segment_size Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer Feb 25, 2026
@yubofredwang yubofredwang requested a review from Copilot February 25, 2026 23:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances the robustness and memory management of the training and inference pipeline by implementing comprehensive error propagation, improving shutdown behavior, optimizing dataset handling with Ray ObjectRefs, and preventing memory leaks in evaluation tracking.

Changes:

  • Implemented fatal error propagation from inference engine to training controller to halt training on unrecoverable failures
  • Improved shutdown behavior by discarding buffered prompts instead of attempting to process them during shutdown
  • Refactored dataset handling to use Ray ObjectRefs for better memory efficiency with large datasets
  • Added evaluation pool cleanup to prevent unbounded memory growth from leftover samples
  • Configured Mooncake store to prevent trainers from contributing to the global memory pool

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
torchspec/training/trainer.py Sets global_segment_size=0 in MooncakeConfig to prevent trainer memory contribution
torchspec/controller/training_controller.py Adds error tracking field and set_inference_error method; raises RuntimeError in try_dispatch_batch when inference fails
torchspec/controller/inference_manager.py Catches RayActorError, sets error on controller, improves shutdown behavior by discarding prompts, and makes pool wait interruptible

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread torchspec/controller/inference_manager.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@yubofredwang yubofredwang merged commit 64beeb5 into main Feb 26, 2026
1 check passed
@yubofredwang yubofredwang deleted the ywang/handle-inference-error branch February 26, 2026 01:54
cicirori pushed a commit to cicirori/TorchSpec that referenced this pull request Feb 26, 2026
* improve loss to select valid idx

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

* reduce compile

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

* trigger CI

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

---------

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
zhubohao911 pushed a commit to zhubohao911/TorchSpec that referenced this pull request Mar 23, 2026
…e memory of trainer (lightseekorg#2)

* avoid reserialize, clear eval for ray

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

* handle inference error

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

* bug fix

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

* Update torchspec/controller/inference_manager.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants