Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix https://github.com/kubeflow/training-operator/issues/1704 #1705

Merged
merged 3 commits into from
Jan 25, 2023

Conversation

HeGaoYuan
Copy link
Contributor

@HeGaoYuan HeGaoYuan commented Dec 23, 2022

Which issue(s) this PR fixes : Fixes #1704

Checklist:

  • Docs included if any changes are user facing

And I found the event reason constant is a little "messy", so I use string literal but I am waiting to rebase my codes

@google-cla
Copy link

google-cla bot commented Dec 23, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@@ -133,7 +133,8 @@ func (jc *MPIJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
}

if err = kubeflowv1.ValidateV1MpiJobSpec(&mpijob.Spec); err != nil {
logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String())
logger.Error(err, "MPIJob failed validation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to add req.NamespacedName.String() as well in the log

Copy link
Contributor Author

@HeGaoYuan HeGaoYuan Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think add req.NamespaceName.String() is redundant and strange as a value of KV.
As following shows, the first is from logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String()), the second is from logger.Error(err, "MPIJob failed validation").
By the way, this also belongs to the log format problem that I think we should to optimize

1.6717681860476494e+09  INFO    PyTorchReplicaType is Master2 but must be one of [Master Worker]        {"pytorchjob": "default/pytorch-test-validate", "PyTorchJob failed validation": "default/pytorch-test-validate"}
1.6717681860476797e+09  ERROR   PyTorchJob failed validation    {"pytorchjob": "default/pytorch-test-validate", "error": "PyTorchReplicaType is Master2 but must be one of [Master Worker]"}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@tenzen-y
Copy link
Member

@HeGaoYuan Can you sign the CLA?

@HeGaoYuan
Copy link
Contributor Author

@johnugeorge what is your suggestion about event reason constant.

I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

@coveralls
Copy link

coveralls commented Dec 23, 2022

Pull Request Test Coverage Report for Build 4007614566

  • 0 of 17 (0.0%) changed or added relevant lines in 6 files are covered.
  • 9 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+0.07%) to 39.033%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/mpi/mpijob_controller.go 0 2 0.0%
pkg/controller.v1/mxnet/mxjob_controller.go 0 3 0.0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 0 3 0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go 0 3 0.0%
pkg/controller.v1/tensorflow/tfjob_controller.go 0 3 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 3 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 1 0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 1 54.12%
pkg/controller.v1/pytorch/pytorchjob_controller.go 1 60.15%
pkg/controller.v1/tensorflow/tfjob_controller.go 1 69.47%
pkg/controller.v1/xgboost/xgboostjob_controller.go 1 0%
pkg/controller.v1/mpi/mpijob_controller.go 4 76.79%
Totals Coverage Status
Change from base Build 4003012644: 0.07%
Covered Lines: 2687
Relevant Lines: 6884

💛 - Coveralls

@johnugeorge
Copy link
Member

johnugeorge commented Dec 23, 2022

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?

/cc @gaocegege

@johnugeorge
Copy link
Member

/cc @tenzen-y

@HeGaoYuan
Copy link
Contributor Author

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?

/cc @gaocegege

Yes, I also notice this problem. If you want to mark the job as failed, it is like about the problem I said "state transition table". As I said now the "state transition table" is now clear, so we should be careful to add new "state transition". Reconciled continuously is common and not a big problem? We can decide it later when we conclude the "state transition table"?

@johnugeorge
Copy link
Member

Can we create an issue to track? The validation failure is non recoverable error and I don't see any value in wasting resources to do continuous reconciliation. We may track it in a different PR. Others, thoughts ?

/cc @kubeflow/wg-training-leads @kubeflow/common-team

@google-oss-prow google-oss-prow bot requested a review from a team December 23, 2022 07:11
@tenzen-y
Copy link
Member

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?

/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {
logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())
}

@johnugeorge
Copy link
Member

@johnugeorge what is your suggestion about event reason constant.

I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

Some inconsistencies happened because operators from multiple repos were merged into training operator couple of releases ago. we can use commonutil.JobSucceededReason

@tenzen-y
Copy link
Member

@johnugeorge what is your suggestion about event reason constant.
I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

Some inconsistencies happened because operators from multiple repos were merged into training operator couple of releases ago. we can use commonutil.JobSucceededReason

I see. I agree with using commonutil.JobSucceededReason. We might need to do refactoring across controllers.

@johnugeorge
Copy link
Member

johnugeorge commented Dec 23, 2022

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?
/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {
logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())
}

If we don't return error for ValidationError , reconciliation won't happen again. Is there a better solution?

@HeGaoYuan
Copy link
Contributor Author

HeGaoYuan commented Dec 23, 2022

A possible another solution is not return ctrl.Result{}, err but return ctrl.Result{}, nil after validate fails. Then controller-runtime will not reconciled continuously. @johnugeorge @tenzen-y

@johnugeorge
Copy link
Member

A possible another solution is not return ctrl.Result{}, err but return ctrl.Result{}, nil after validate fails. Then controller-runtime will not reconciled continuously. @johnugeorge @tenzen-y

Yeah. I referred to that earlier. We should do that as this error is non recoverable anyways

@HeGaoYuan
Copy link
Contributor Author

@johnugeorge what is your suggestion about event reason constant.
I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

Some inconsistencies happened because operators from multiple repos were merged into training operator couple of releases ago. we can use commonutil.JobSucceededReason

I see. I agree with using commonutil.JobSucceededReason. We might need to do refactoring across controllers.

If you recomment to use commonutil.JobSucceededReason, then I need to open a PR in kubeflow/common to add a JobFailedValidation reason constant. @johnugeorge @tenzen-y

@tenzen-y
Copy link
Member

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?
/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {
logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())
}

If we don't return error for ValidationError , reconciliation won't happen again. Is there a better solution?

@johnugeorge Another better option; if a validation error occurs, add a special annotation to the target CRD (e.g. TFJob) then run return ctrl.Result{}, err. And watcher rejects (use predicators) to enqueue CRDs with the special annotation to the workqueue.

// using onOwnerCreateFunc is easier to set defaults
if err = c.Watch(&source.Kind{Type: &kubeflowv1.TFJob{}}, &handler.EnqueueRequestForObject{},
predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()},
); err != nil {
return err
}

@HeGaoYuan
Copy link
Contributor Author

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?
/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {
logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())
}

If we don't return error for ValidationError , reconciliation won't happen again. Is there a better solution?

@johnugeorge Another better option; if a validation error occurs, add a special annotation to the target CRD (e.g. TFJob) then run return ctrl.Result{}, err. And watcher rejects (use predicators) to enqueue CRDs with the special annotation to the workqueue.

// using onOwnerCreateFunc is easier to set defaults
if err = c.Watch(&source.Kind{Type: &kubeflowv1.TFJob{}}, &handler.EnqueueRequestForObject{},
predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()},
); err != nil {
return err
}

A little complex 😂.
Imaging this situation, users will find their Job spec has problem, then they are going to fix it. They need to know the annotion and remove the annotation, or our codes should add additional logic.

@HeGaoYuan
Copy link
Contributor Author

If you recomment to use commonutil.JobSucceededReason, then I need to open a PR in kubeflow/common to add a JobFailedValidation reason constant. @johnugeorge @tenzen-y

@HeGaoYuan For now, we can go ahead by adding a new constant in kubeflow/common and then get this PR merged. We will do further cleanup post 1.6 release after merging kubeflow/common as described in #1714 (comment)

I got it

@johnugeorge
Copy link
Member

@HeGaoYuan Can you update it?

@HeGaoYuan
Copy link
Contributor Author

A possible another solution is not return ctrl.Result{}, err but return ctrl.Result{}, nil after validate fails. Then controller-runtime will not reconciled continuously. @johnugeorge @tenzen-y

@johnugeorge Yes, I can update. But then how about this? Should I keep return ctrl.Result{}, err or change to return ctrl.Result{}, nil

@johnugeorge
Copy link
Member

johnugeorge commented Jan 22, 2023

I would recommend to update the PR to use JobFailedValidationReason https://github.com/kubeflow/common/blob/9ec55d141f90faaf52fd6df271e987e5a6781945/pkg/util/status.go#L21 and keep return ctrl.Result{}, err (as in the current PR) so as to remain consistent with all controllers.

In the next release, we can discuss and implement the state change in #1711

/cc @tenzen-y What do you think?

@johnugeorge
Copy link
Member

@HeGaoYuan We are creating a release tomorrow. Can you update this PR and rebase ?

@tenzen-y
Copy link
Member

I would recommend to update the PR to use JobFailedValidationReason https://github.com/kubeflow/common/blob/9ec55d141f90faaf52fd6df271e987e5a6781945/pkg/util/status.go#L21 and keep return ctrl.Result{}, err (as in the current PR) so as to remain consistent with all controllers.

In the next release, we can discuss and implement the state change in #1711

/cc @tenzen-y What do you think?

Sorry for the late response. I was missing the notification.

That makes sense. It would be better to discuss that after the next release since we should handle the behavior of Job conditions, carefully.

So, it would be better to change only an error reason.

@tenzen-y
Copy link
Member

@johnugeorge Would you like to take over this PR before we cut the new release? Or, we postpone releasing this improvement after the next release?

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@HeGaoYuan
Copy link
Contributor Author

HeGaoYuan commented Jan 25, 2023

@johnugeorge @tenzen-y
I am so sorry for replying late, because it is Spring Festival recently.
I have updated it by merge master branch.
BTW, If I rebase this 3 commits as you see in this page to 1 commit, this page will show I have changed 200+ files. So I keep it as it is.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeGaoYuan Thanks for the updates!
/lgtm

/assign @johnugeorge

@johnugeorge
Copy link
Member

@tenzen-y Have you noticed that tests are really flaky now?

@tenzen-y
Copy link
Member

@tenzen-y Have you noticed that tests are really flaky now?

Is it E2E?

@johnugeorge
Copy link
Member

@tenzen-y Have you noticed that tests are really flaky now?

Is it E2E?

There are e2e failures. Also, Publish Images workflows take longer time

@johnugeorge
Copy link
Member

Thanks @HeGaoYuan

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HeGaoYuan, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit d0fb5c0 into kubeflow:master Jan 25, 2023
@tenzen-y
Copy link
Member

@tenzen-y Have you noticed that tests are really flaky now?

Is it E2E?

There are e2e failures. Also, Publish Images workflows take longer time

@johnugeorge I don't think these changes caused these flaky tests.

There are e2e failures

As I can see, training jobs sometimes fail... We might need to improve sample training codes.

Publish Images workflows take longer time

I guess building jobs have taken longer times since #1692.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent implementation about when the validation of job's spec failed
4 participants