Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

add job suspend run Policy #193

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

PeterChg
Copy link

add job partial success status

@PeterChg
Copy link
Author

/assign @terrytangyuan

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is a common use case. Could you elaborate?

@PeterChg
Copy link
Author

PeterChg commented May 23, 2022

I am not sure if this is a common use case. Could you elaborate?

The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job.
According to the kubeflow/training-operator project architecture, need to modify kubeflow/common project first.

@gaocegege
Copy link
Member

/ok-to-test

@terrytangyuan
Copy link
Member

What are the changes you are trying to make to training operator?

@PeterChg
Copy link
Author

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

add job partial success status
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from gaocegege after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@terrytangyuan
Copy link
Member

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated.

@alculquicondor
Copy link
Contributor

This is not about the training job itself.
This is about a cluster having scarce resources. If there is a higher priority job that needs the resources, suspend provides a way to free those resources. The training job will have a chance to checkpoint, if they have the support for it, otherwise just fail and they will be retried later.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants