New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rayjob will not resume after preempted #1156
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Just a non-blocking nit
/assign @tenzen-y
LGTM label has been added. Git tree hash: 6edd292534a5db3e477d0162dc94221b9c20e053
|
@kerthcet please also set the release note, maybe "Fix resuming of RayJob after preempted". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add unit tests for RayJob in the case of finished.
However, I think we can work on that at a follow-up.
return condition, rayjobapi.IsJobTerminal(j.Status.JobStatus) | ||
|
||
return condition, j.Status.JobStatus == rayjobapi.JobStatusFailed || j.Status.JobStatus == rayjobapi.JobStatusSucceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is a bug or the specification on the kuberay side.
My concern is this behavior might be changed according to the kuberay versions.
Are there any documentation or API comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds like it should be fixed on the rayjob side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In kuberay, when stopping a job, it will check whether the job is stopped, succeeded or failed, if no, kuberay will stop the job, so IsJobTerminal
includes stop status. But here, we should only consider the terminated status. The name is puzzling anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. The stopped
means suspended
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still sounds like the name or the logic could be fixed in kuberay.
It sounds like a great suggestion! /release-note edit Fix resuming of RayJob after preempted. |
/release-note-edit |
Let me try: |
This command wouldn't be work fine 😞 |
I'm back to this pr later as I'm on kubecon these days. |
/cc |
Signed-off-by: kerthcet <kerthcet@gmail.com>
73b72b4
to
7623656
Compare
Updated based on the comments. |
I think we should cherry-pick this bug fix. |
LGTM label has been added. Git tree hash: 2943b24fbfc79d0c3d964391a2edb2f9ef3e36bc
|
/assign @tenzen-y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
return condition, rayjobapi.IsJobTerminal(j.Status.JobStatus) | ||
|
||
return condition, j.Status.JobStatus == rayjobapi.JobStatusFailed || j.Status.JobStatus == rayjobapi.JobStatusSucceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. The stopped
means suspended
.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kerthcet, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…mpted (#1190) * Fix rayjob will not resume after preempted Signed-off-by: kerthcet <kerthcet@gmail.com> * Fix integration test error Signed-off-by: kerthcet <kerthcet@gmail.com> --------- Signed-off-by: kerthcet <kerthcet@gmail.com>
Title: Fix rayjob will not resume after preempted
Summary: |-
This PR fixes an issue where a RayJob would not resume after it was preempted.
The issue was caused by a bug in the RayJob controller, which prevented the job from being resumed after it was preempted.
This PR introduces a fix for this issue, which ensures that a preempted RayJob is resumed correctly.
Types:
- bugfix
Main Files Walkthrough:
- filename: pkg/controller/jobs/rayjob/rayjob_controller.go
changes in file:
- Fixes a bug in the RayJob controller that prevented preempted RayJobs from being resumed correctly.
- Adds a new function to the RayJob controller that handles the resumption of preempted RayJobs.
- Updates the existing function to use the new function to handle preempted RayJobs.
- filename: test/integration/controller/jobs/rayjob/rayjob_controller_test.go
changes in file:
- Adds a new test case that verifies the fix for the issue.
- Updates the existing test cases to use the new test case. |
@kerthcet what's the above? |
Sorry, I'm benchmarking the code LM, forgot to delete the comment.🥺 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1146
Fixes #1150
Special notes for your reviewer:
Does this PR introduce a user-facing change?