Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rayjob will not resume after preempted #1156

Merged
merged 1 commit into from Oct 9, 2023

Conversation

kerthcet
Copy link
Contributor

@kerthcet kerthcet commented Sep 23, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #1146
Fixes #1150

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix resuming of RayJob after preempted.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 23, 2023
@netlify
Copy link

netlify bot commented Sep 23, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 7623656
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6522d1108019cc000809bc96

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 23, 2023
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Just a non-blocking nit
/assign @tenzen-y

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 25, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 6edd292534a5db3e477d0162dc94221b9c20e053

@mimowo
Copy link
Contributor

mimowo commented Sep 25, 2023

@kerthcet please also set the release note, maybe "Fix resuming of RayJob after preempted".

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add unit tests for RayJob in the case of finished.
However, I think we can work on that at a follow-up.

Comment on lines -166 to +167
return condition, rayjobapi.IsJobTerminal(j.Status.JobStatus)

return condition, j.Status.JobStatus == rayjobapi.JobStatusFailed || j.Status.JobStatus == rayjobapi.JobStatusSucceeded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is a bug or the specification on the kuberay side.
My concern is this behavior might be changed according to the kuberay versions.

Are there any documentation or API comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds like it should be fixed on the rayjob side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kuberay, when stopping a job, it will check whether the job is stopped, succeeded or failed, if no, kuberay will stop the job, so IsJobTerminal includes stop status. But here, we should only consider the terminated status. The name is puzzling anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. The stopped means suspended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still sounds like the name or the logic could be fixed in kuberay.

@tenzen-y
Copy link
Member

Fix resuming of RayJob after preempted"

It sounds like a great suggestion!

/release-note edit Fix resuming of RayJob after preempted.

@tenzen-y
Copy link
Member

/release-note-edit Fix resuming of RayJob after preempted.

@mimowo
Copy link
Contributor

mimowo commented Sep 25, 2023

/release-note edit Fix resuming of RayJob after preempted.

Let me try:
/release-note-edit Fix resuming of RayJob after preempted.

@tenzen-y
Copy link
Member

This command wouldn't be work fine 😞

@kerthcet
Copy link
Contributor Author

I'm back to this pr later as I'm on kubecon these days.

@andrewsykim
Copy link
Member

/cc

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 8, 2023
Signed-off-by: kerthcet <kerthcet@gmail.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2023
@kerthcet
Copy link
Contributor Author

kerthcet commented Oct 8, 2023

Updated based on the comments.

@kerthcet
Copy link
Contributor Author

kerthcet commented Oct 8, 2023

I think we should cherry-pick this bug fix.

@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2023

/lgtm
@assign @tenzen-y

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2943b24fbfc79d0c3d964391a2edb2f9ef3e36bc

@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2023

/assign @tenzen-y

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/lgtm
/approve

Comment on lines -166 to +167
return condition, rayjobapi.IsJobTerminal(j.Status.JobStatus)

return condition, j.Status.JobStatus == rayjobapi.JobStatusFailed || j.Status.JobStatus == rayjobapi.JobStatusSucceeded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. The stopped means suspended.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2023
@k8s-ci-robot k8s-ci-robot merged commit 938d118 into kubernetes-sigs:main Oct 9, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.5 milestone Oct 9, 2023
@kerthcet kerthcet deleted the fix/rayjob-resume branch October 10, 2023 01:42
k8s-ci-robot pushed a commit that referenced this pull request Oct 10, 2023
…mpted (#1190)

* Fix rayjob will not resume after preempted

Signed-off-by: kerthcet <kerthcet@gmail.com>

* Fix integration test error

Signed-off-by: kerthcet <kerthcet@gmail.com>

---------

Signed-off-by: kerthcet <kerthcet@gmail.com>
@kerthcet
Copy link
Contributor Author

Title: Fix rayjob will not resume after preempted
Summary: |-
 This PR fixes an issue where a RayJob would not resume after it was preempted.
 The issue was caused by a bug in the RayJob controller, which prevented the job from being resumed after it was preempted.
 This PR introduces a fix for this issue, which ensures that a preempted RayJob is resumed correctly.
Types:
 - bugfix
Main Files Walkthrough:
 - filename: pkg/controller/jobs/rayjob/rayjob_controller.go
   changes in file:
     - Fixes a bug in the RayJob controller that prevented preempted RayJobs from being resumed correctly.
     - Adds a new function to the RayJob controller that handles the resumption of preempted RayJobs.
     - Updates the existing function to use the new function to handle preempted RayJobs.
 - filename: test/integration/controller/jobs/rayjob/rayjob_controller_test.go
   changes in file:
     - Adds a new test case that verifies the fix for the issue.
     - Updates the existing test cases to use the new test case.

@alculquicondor
Copy link
Contributor

@kerthcet what's the above?

@kerthcet
Copy link
Contributor Author

@kerthcet what's the above?

Sorry, I'm benchmarking the code LM, forgot to delete the comment.🥺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RayJob] Update the workload status when suspended Preempted RayJob will not resume when resource reclaimed
6 participants