-
Notifications
You must be signed in to change notification settings - Fork 73
add job suspend run Policy #193
base: master
Are you sure you want to change the base?
Conversation
/assign @terrytangyuan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is a common use case. Could you elaborate?
The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job. |
/ok-to-test |
What are the changes you are trying to make to training operator? |
add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state. |
add job partial success status
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated. |
This is not about the training job itself. |
add job partial success status