Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer by yubofredwang · Pull Request #2 · lightseekorg/TorchSpec

yubofredwang · 2026-02-25T23:36:26Z

This pull request introduces several important improvements and bug fixes to the training and inference management workflow, focusing on robust error handling, memory management, and better integration with Ray object storage. The main changes include enhanced fatal error propagation from the inference manager to the training controller, improved shutdown and buffer handling, refactoring of dataset handling to use Ray ObjectRefs, and cleanup of evaluation tracking to prevent unbounded growth.

Error handling and shutdown improvements:

Fatal errors in the inference engine (e.g., Ray actor crashes) are now propagated to the training controller. The inference manager sets an error message via set_inference_error, and the controller raises a RuntimeError on subsequent dispatch attempts. This prevents silent failures and ensures the training loop halts on unrecoverable inference errors. [1] [2] [3] [4]
The _drain method in InferenceManager now logs and discards any buffered prompts on shutdown instead of attempting to process them, avoiding hangs during shutdown.
The pool capacity wait loop in _await_pool_capacity is now interruptible by shutdown, and logs if aborted early. [1] [2]

Dataset and evaluation handling:

The training loop now expects the dataset as a Ray ObjectRef (dataset_ref) instead of a raw list, improving compatibility with distributed setups and large datasets. The code handles both legacy lists and ObjectRefs, and requires dataset_size for the latter. [1] [2] [3]
All places where the dataset is reloaded (e.g., after exhaustion) now use dataset_ref instead of the raw dataset. [1] [2]
Steps per epoch are now computed using dataset_size for consistency with the new dataset handling.

Evaluation pool cleanup:

After evaluation batches are dispatched, any leftover samples that do not fill a batch are now dropped and logged, and all evaluation tracking state is cleared to prevent unbounded memory growth. [1] [2]

Miscellaneous:

The Mooncake store is now always initialized with global_segment_size=0 to prevent accidental global segmenting, improving reproducibility and debugging. [1] [2]
Minor import and typing improvements for clarity and correctness.

These changes collectively improve the robustness, scalability, and maintainability of the training and inference pipeline.

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

Copilot

Pull request overview

This pull request enhances the robustness and memory management of the training and inference pipeline by implementing comprehensive error propagation, improving shutdown behavior, optimizing dataset handling with Ray ObjectRefs, and preventing memory leaks in evaluation tracking.

Changes:

Implemented fatal error propagation from inference engine to training controller to halt training on unrecoverable failures
Improved shutdown behavior by discarding buffered prompts instead of attempting to process them during shutdown
Refactored dataset handling to use Ray ObjectRefs for better memory efficiency with large datasets
Added evaluation pool cleanup to prevent unbounded memory growth from leftover samples
Configured Mooncake store to prevent trainers from contributing to the global memory pool

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
torchspec/training/trainer.py	Sets global_segment_size=0 in MooncakeConfig to prevent trainer memory contribution
torchspec/controller/training_controller.py	Adds error tracking field and set_inference_error method; raises RuntimeError in try_dispatch_batch when inference fails
torchspec/controller/inference_manager.py	Catches RayActorError, sets error on controller, improves shutdown behavior by discarding prompts, and makes pool wait interruptible

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* improve loss to select valid idx Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * reduce compile Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * trigger CI Signed-off-by: Yubo Wang <yubowang2019@gmail.com> --------- Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

…e memory of trainer (lightseekorg#2) * avoid reserialize, clear eval for ray Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * handle inference error Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * bug fix Signed-off-by: Yubo Wang <yubowang2019@gmail.com> * Update torchspec/controller/inference_manager.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yubofredwang added 4 commits February 25, 2026 06:59

avoid reserialize, clear eval for ray

766a441

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

handle inference error

64195a6

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

bug fix

939fe5b

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

Merge branch 'main' into ywang/handle-inference-error

96fe015

yubofredwang marked this pull request as ready for review February 25, 2026 23:38

yubofredwang changed the title ~~Improve handle inference error, clean evaluation pool, reduce memory of global_segment_size~~ Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer Feb 25, 2026

yubofredwang requested a review from Copilot February 25, 2026 23:51

Copilot started reviewing on behalf of yubofredwang February 25, 2026 23:52 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Comment thread torchspec/controller/inference_manager.py Outdated

Update torchspec/controller/inference_manager.py

4ce082c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yubofredwang merged commit 64beeb5 into main Feb 26, 2026
1 check passed

yubofredwang deleted the ywang/handle-inference-error branch February 26, 2026 01:54

zhubohao911 mentioned this pull request May 15, 2026

[WIP] Support co-locate training and inference (#81) #92

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer#2

Improve handle inference error, clean evaluation pool, remove mooncake memory of trainer#2
yubofredwang merged 5 commits into
mainfrom
ywang/handle-inference-error

yubofredwang commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yubofredwang commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants