add job suspend run Policy #193

PeterChg · 2022-05-17T02:47:29Z

add job partial success status

PeterChg · 2022-05-20T06:30:36Z

terrytangyuan

I am not sure if this is a common use case. Could you elaborate?

PeterChg · 2022-05-23T02:39:45Z

I am not sure if this is a common use case. Could you elaborate?

The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job.
According to the kubeflow/training-operator project architecture, need to modify kubeflow/common project first.

gaocegege · 2022-05-23T02:58:21Z

/ok-to-test

terrytangyuan · 2022-05-23T14:23:45Z

What are the changes you are trying to make to training operator?

PeterChg · 2022-05-24T02:40:19Z

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

add job partial success status

google-oss-prow · 2022-05-24T06:59:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from gaocegege after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan · 2022-05-26T15:09:44Z

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated.

alculquicondor · 2023-01-27T16:19:09Z

This is not about the training job itself.
This is about a cluster having scarce resources. If there is a higher priority job that needs the resources, suspend provides a way to free those resources. The training job will have a chance to checkpoint, if they have the support for it, otherwise just fail and they will be retried later.

google-oss-prow bot requested a review from terrytangyuan May 17, 2022 02:47

google-oss-prow bot added the size/M label May 17, 2022

google-oss-prow bot assigned gaocegege May 17, 2022

google-oss-prow bot assigned terrytangyuan May 20, 2022

terrytangyuan reviewed May 21, 2022

View reviewed changes

add job suspend run Policy

cd3a8ee

add job partial success status

PeterChg force-pushed the pr-master branch from 66db21b to cd3a8ee Compare May 24, 2022 06:59

mimowo mentioned this pull request Jan 27, 2023

Support suspend semantics for MPIJob kubeflow/mpi-operator#511

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add job suspend run Policy #193

add job suspend run Policy #193

PeterChg commented May 17, 2022

PeterChg commented May 20, 2022

terrytangyuan left a comment

PeterChg commented May 23, 2022 •

edited

Loading

gaocegege commented May 23, 2022

terrytangyuan commented May 23, 2022

PeterChg commented May 24, 2022

google-oss-prow bot commented May 24, 2022

terrytangyuan commented May 26, 2022

alculquicondor commented Jan 27, 2023

add job suspend run Policy #193

Are you sure you want to change the base?

add job suspend run Policy #193

Conversation

PeterChg commented May 17, 2022

PeterChg commented May 20, 2022

terrytangyuan left a comment

Choose a reason for hiding this comment

PeterChg commented May 23, 2022 • edited Loading

gaocegege commented May 23, 2022

terrytangyuan commented May 23, 2022

PeterChg commented May 24, 2022

google-oss-prow bot commented May 24, 2022

terrytangyuan commented May 26, 2022

alculquicondor commented Jan 27, 2023

PeterChg commented May 23, 2022 •

edited

Loading