Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement suspend semantics #1859

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jul 11, 2023

What this PR does / why we need it:
I implemented the suspend semantics like batch/job and MPIJob v2beta1 to PyTorchJob. The semantics enables the external controller can stop creating pods. For example, this is useful for adapting Kubeflow TrainingJob to the job queueing system.

The training operator removes the following resources regardless of runPolicy.cleanPodPolicy when the runPolicy.suspend is true:

  1. Pods
  2. Services
  3. HorizontalPodAutoscalers
  4. PodGroups (for volcano / scheduler-plugins)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #1519
Related to: #1853

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

Pull Request Test Coverage Report for Build 5520115950

  • 60 of 169 (35.5%) changed or added relevant lines in 15 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.5%) to 33.498%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/common/pod.go 0 1 0.0%
pkg/controller.v1/mpi/mpijob_controller.go 4 5 80.0%
pkg/controller.v1/pytorch/hpa.go 13 15 86.67%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go 0 5 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 5 0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go 10 16 62.5%
pkg/apis/kubeflow.org/v1/openapi_generated.go 0 7 0.0%
pkg/controller.v1/mxnet/mxjob_controller.go 0 10 0.0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 4 14 28.57%
pkg/reconciler.v1/common/job.go 0 11 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 79.36%
Totals Coverage Status
Change from base Build 5511536982: 0.5%
Covered Lines: 3257
Relevant Lines: 9723

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jul 11, 2023

Pull Request Test Coverage Report for Build 5613234102

  • 65 of 176 (36.93%) changed or added relevant lines in 15 files are covered.
  • 11 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.3%) to 33.538%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/common/pod.go 0 1 0.0%
pkg/controller.v1/mpi/mpijob_controller.go 4 5 80.0%
pkg/controller.v1/pytorch/hpa.go 13 14 92.86%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go 0 5 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 5 0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go 8 14 57.14%
pkg/apis/kubeflow.org/v1/openapi_generated.go 0 7 0.0%
pkg/controller.v1/mxnet/mxjob_controller.go 0 10 0.0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 4 14 28.57%
pkg/reconciler.v1/common/job.go 0 11 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 79.73%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 9 55.87%
Totals Coverage Status
Change from base Build 5545906377: 0.3%
Covered Lines: 3280
Relevant Lines: 9780

💛 - Coveralls

@tenzen-y
Copy link
Member Author

/assign @johnugeorge

cc: @alculquicondor @mimowo This PR is the first step to support suspend semantics in the kubeflow/training-operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the core logic for suspend semantics.

@alculquicondor
Copy link

cc @trasc

pkg/controller.v1/common/job.go Outdated Show resolved Hide resolved
}

func IsSuspend(status apiv1.JobStatus) bool {
return hasCondition(status, apiv1.JobSuspended)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasCondition name now is misleading and it looks to be doing the same thing as apimachinery/pkg/api/meta.IsStatusConditionTrue

Copy link
Member Author

@tenzen-y tenzen-y Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, our condition.status is typed corev1.ConditionStatus.

Status v1.ConditionStatus `json:"status"`

So apimachinery/pkg/api/meta.IsStatusConditionTrue doesn't work :(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could still rename the function to isConditionTrue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood @trasc 's comment.
I will replace all functions with IsStatusConditionTrue. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tenzen-y tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 42b1cc9 to 1ab6390 Compare July 11, 2023 15:32
pkg/util/status.go Outdated Show resolved Hide resolved
@tenzen-y tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 1ab6390 to c1af716 Compare July 13, 2023 10:07
@tenzen-y
Copy link
Member Author

/hold for @alculquicondor's comment.

Copy link

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, but I would prefer to make it more similar to Job & MPIJob.

pkg/controller.v1/pytorch/hpa.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/hpa.go Outdated Show resolved Hide resolved
pkg/util/status_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/common/job.go Outdated Show resolved Hide resolved
return err
}
for rType := range jobStatus.ReplicaStatuses {
jobStatus.ReplicaStatuses[rType].Active = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would be set anyway by the other code once the replica pods are cleaned up. This is the approach we take in MPIJob and Job. I would like if we could apply here the same approach

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any other codes to reset the Active field, and the replicaStatus[*].Active is reset only by

if commonutil.IsSucceeded(jobStatus) {
for rtype := range jobStatus.ReplicaStatuses {
jobStatus.ReplicaStatuses[rtype].Succeeded += jobStatus.ReplicaStatuses[rtype].Active
jobStatus.ReplicaStatuses[rtype].Active = 0
}
}
.

So we need to reset the Active field here if the job is suspended.

However, since I think we should reset the Active field when cleaning up replica pods, I would do the refactoring in the follow-ups.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any code that sets Active to non zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any code that sets Active to non zero?

Yes, here:

updateJobReplicaStatuses(jobStatus, rType, pod)
->
func updateJobReplicaStatuses(jobStatus *apiv1.JobStatus, rtype apiv1.ReplicaType, pod *corev1.Pod) {
->
func UpdateJobReplicaStatuses(jobStatus *apiv1.JobStatus, rtype apiv1.ReplicaType, pod *corev1.Pod) {
switch pod.Status.Phase {
case corev1.PodRunning:
if pod.DeletionTimestamp != nil {
// when node is not ready, the pod will be in terminating state.
// Count deleted Pods as failures to account for orphan Pods that
// never have a chance to reach the Failed phase.
jobStatus.ReplicaStatuses[rtype].Failed++
} else {
jobStatus.ReplicaStatuses[rtype].Active++
}
case corev1.PodSucceeded:
jobStatus.ReplicaStatuses[rtype].Succeeded++
case corev1.PodFailed:
jobStatus.ReplicaStatuses[rtype].Failed++
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so those functions wouldn't be called in the next reconcile, essentially resetting the number of active pods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so those functions wouldn't be called in the next reconcile

Yes. If the job is suspended, JobController never calls ReconcilePods(): https://github.com/tenzen-y/training-operator/blob/ce7259ecfaacbd529b6b1095dd6b632517dac0d0/pkg/controller.v1/common/job.go#L147-L173

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting counts manually, why can't it be derived from the status of all pods? Since we have already cleaned up, active pods will be zero. We can do this refactoring separately as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have functions to decrease Active count, only have functions to reset the count. So I guess we should refactor ReconcileJob(). However, the refactor will affect Succeeded and Failed conditions, too. So I would like to work on another PR.

}
jc.Recorder.Event(runtimeObject, corev1.EventTypeNormal, commonutil.NewReason(jobKind, commonutil.JobSuspendedReason), msg)
if !reflect.DeepEqual(*oldStatus, jobStatus) {
return jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to let the reconcile function continue, so that other status fields are updated, such as Active (mentioned above)? I think it is preferable not to update here but let other fields be updated too, so that we can update as much as possible in a single reconciliation run.

Copy link
Member Author

@tenzen-y tenzen-y Jul 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is preferable not to update here but let other fields be updated too, so that we can update as much as possible in a single reconciliation run.

That makes sense. As I say above, we need to refactor the ReconcilePods:

func (jc *JobController) ReconcilePods(
.

So I would do your suggestion in follow-ups.

pkg/controller.v1/pytorch/pytorchjob_controller.go Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member Author

@mimowo I updated this PR. PTAL.

pkg/util/status.go Show resolved Hide resolved
pkg/controller.v1/common/job.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/pytorchjob_controller_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/pytorchjob_controller_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/pytorchjob_controller_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/pytorch/pytorchjob_controller_test.go Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member Author

I have rebased.

@google-oss-prow
Copy link

@mimowo: changing LGTM is restricted to collaborators

In response to this:

/lgtm
/assign @alculquicondor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link

Is the title accurate? It says PyTorchJob, but I see API updates in every CRD.

@tenzen-y tenzen-y changed the title Implement suspend semantics to PyTorchJob Implement suspend semantics Jul 18, 2023
@tenzen-y
Copy link
Member Author

Is the title accurate? It says PyTorchJob, but I see API updates in every CRD.

Sure.

Copy link

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall
I just have side questions and a nit

continue
}
if err := jc.PodControl.DeletePod(pod.Namespace, pod.Name, job.(runtime.Object)); err != nil {
if err := jc.PodControl.DeletePod(pod.Namespace, pod.Name, runtimeObject); err != nil {
return err
}
// Pod and service have the same name, thus the service could be deleted using pod's name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side question: why is there a service per pod? That sounds like unnecessary load.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, ml framework configs need different FQDN for each pod.

For example, tensorflow ClusterSpec: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't a single headless Service allow that? similar to this https://kubernetes.io/docs/tasks/job/job-with-pod-to-pod-communication/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right.
Actually, using a single headless service is planning, although it is closed :(

#1030

return err
}
for rType := range jobStatus.ReplicaStatuses {
jobStatus.ReplicaStatuses[rType].Active = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any code that sets Active to non zero?

}

func IsSuspend(status apiv1.JobStatus) bool {
return hasCondition(status, apiv1.JobSuspended)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could still rename the function to isConditionTrue

@tenzen-y tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 77ec73e to 4c28296 Compare July 19, 2023 05:53
@alculquicondor
Copy link

LGTM

@tenzen-y
Copy link
Member Author

Thanks everyone!

/hold cancel
/assign @johnugeorge

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the support-suspend-semantics-for-pytorchjob branch from 4c28296 to e4bf325 Compare July 20, 2023 15:59
@tenzen-y
Copy link
Member Author

@johnugeorge I addressed your comments and squashed commits into one. PTAL.

@johnugeorge
Copy link
Member

Thanks for this awesome feature!
/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Jul 20, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants