fix https://github.com/kubeflow/training-operator/issues/1704 #1705

HeGaoYuan · 2022-12-23T04:27:09Z

Which issue(s) this PR fixes : Fixes #1704

Checklist:

Docs included if any changes are user facing

And I found the event reason constant is a little "messy", so I use string literal but I am waiting to rebase my codes

google-cla · 2022-12-23T04:27:12Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

johnugeorge · 2022-12-23T04:30:05Z

pkg/controller.v1/mpi/mpijob_controller.go

@@ -133,7 +133,8 @@ func (jc *MPIJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
 	}

 	if err = kubeflowv1.ValidateV1MpiJobSpec(&mpijob.Spec); err != nil {
-		logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String())
+		logger.Error(err, "MPIJob failed validation")


Nice to add req.NamespacedName.String() as well in the log

I think add req.NamespaceName.String() is redundant and strange as a value of KV.
As following shows, the first is from logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String()), the second is from logger.Error(err, "MPIJob failed validation").
By the way, this also belongs to the log format problem that I think we should to optimize

1.6717681860476494e+09 INFO PyTorchReplicaType is Master2 but must be one of [Master Worker] {"pytorchjob": "default/pytorch-test-validate", "PyTorchJob failed validation": "default/pytorch-test-validate"} 1.6717681860476797e+09 ERROR PyTorchJob failed validation {"pytorchjob": "default/pytorch-test-validate", "error": "PyTorchReplicaType is Master2 but must be one of [Master Worker]"}

tenzen-y · 2022-12-23T04:32:40Z

@HeGaoYuan Can you sign the CLA?

HeGaoYuan · 2022-12-23T05:02:02Z

@johnugeorge what is your suggestion about event reason constant.

I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Line 399 in 69813fb

    
           r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Line 489 in 69813fb

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

coveralls · 2022-12-23T05:04:33Z

Pull Request Test Coverage Report for Build 4007614566

0 of 17 (0.0%) changed or added relevant lines in 6 files are covered.
9 unchanged lines in 6 files lost coverage.
Overall coverage increased (+0.07%) to 39.033%

Changes Missing Coverage	Changed/Added Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	2	0.0%
pkg/controller.v1/mxnet/mxjob_controller.go	3	0.0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go	3	0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go	3	0.0%
pkg/controller.v1/tensorflow/tfjob_controller.go	3	0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go	3	0.0%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mxnet/mxjob_controller.go	1	0%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go	1	54.12%
pkg/controller.v1/pytorch/pytorchjob_controller.go	1	60.15%
pkg/controller.v1/tensorflow/tfjob_controller.go	1	69.47%
pkg/controller.v1/xgboost/xgboostjob_controller.go	1	0%
pkg/controller.v1/mpi/mpijob_controller.go	4	76.79%

Totals
Change from base Build 4003012644:	0.07%
Covered Lines:	2687
Relevant Lines:	6884

💛 - Coveralls

johnugeorge · 2022-12-23T06:05:53Z

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?

/cc @gaocegege

johnugeorge · 2022-12-23T06:37:52Z

/cc @tenzen-y

HeGaoYuan · 2022-12-23T06:53:31Z

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?

/cc @gaocegege

Yes, I also notice this problem. If you want to mark the job as failed, it is like about the problem I said "state transition table". As I said now the "state transition table" is now clear, so we should be careful to add new "state transition". Reconciled continuously is common and not a big problem? We can decide it later when we conclude the "state transition table"?

johnugeorge · 2022-12-23T07:11:19Z

Can we create an issue to track? The validation failure is non recoverable error and I don't see any value in wasting resources to do continuous reconciliation. We may track it in a different PR. Others, thoughts ?

/cc @kubeflow/wg-training-leads @kubeflow/common-team

tenzen-y · 2022-12-23T07:47:09Z

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?

/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 155 to 157 in 69813fb

    
           if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil { 
        
           	logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String()) 
        
           }

johnugeorge · 2022-12-23T09:15:13Z

@johnugeorge what is your suggestion about event reason constant.

I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Line 399 in 69813fb

r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Line 489 in 69813fb

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

Some inconsistencies happened because operators from multiple repos were merged into training operator couple of releases ago. we can use commonutil.JobSucceededReason

tenzen-y · 2022-12-23T09:19:39Z

@johnugeorge what is your suggestion about event reason constant.
I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Line 399 in 69813fb

r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Line 489 in 69813fb

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

Some inconsistencies happened because operators from multiple repos were merged into training operator couple of releases ago. we can use commonutil.JobSucceededReason

I see. I agree with using commonutil.JobSucceededReason. We might need to do refactoring across controllers.

johnugeorge · 2022-12-23T09:22:06Z

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?
/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 155 to 157 in 69813fb

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {

logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())

}

If we don't return error for ValidationError , reconciliation won't happen again. Is there a better solution?

HeGaoYuan · 2022-12-23T09:26:36Z

A possible another solution is not return ctrl.Result{}, err but return ctrl.Result{}, nil after validate fails. Then controller-runtime will not reconciled continuously. @johnugeorge @tenzen-y

johnugeorge · 2022-12-23T10:00:02Z

A possible another solution is not return ctrl.Result{}, err but return ctrl.Result{}, nil after validate fails. Then controller-runtime will not reconciled continuously. @johnugeorge @tenzen-y

Yeah. I referred to that earlier. We should do that as this error is non recoverable anyways

HeGaoYuan · 2022-12-23T10:52:57Z

@johnugeorge what is your suggestion about event reason constant.
I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Line 399 in 69813fb

r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Line 489 in 69813fb

r.recorder.Event(tfJob, corev1.EventTypeNormal, tfJobSucceededReason, msg)

Some inconsistencies happened because operators from multiple repos were merged into training operator couple of releases ago. we can use commonutil.JobSucceededReason

I see. I agree with using commonutil.JobSucceededReason. We might need to do refactoring across controllers.

If you recomment to use commonutil.JobSucceededReason, then I need to open a PR in kubeflow/common to add a JobFailedValidation reason constant. @johnugeorge @tenzen-y

tenzen-y · 2022-12-23T10:54:44Z

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?
/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 155 to 157 in 69813fb

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {

logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())

}

If we don't return error for ValidationError , reconciliation won't happen again. Is there a better solution?

@johnugeorge Another better option; if a validation error occurs, add a special annotation to the target CRD (e.g. TFJob) then run return ctrl.Result{}, err. And watcher rejects (use predicators) to enqueue CRDs with the special annotation to the workqueue.

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 206 to 211 in 69813fb

    
           // using onOwnerCreateFunc is easier to set defaults 
        
           if err = c.Watch(&source.Kind{Type: &kubeflowv1.TFJob{}}, &handler.EnqueueRequestForObject{}, 
        
           	predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()}, 
        
           ); err != nil { 
        
           	return err 
        
           }

HeGaoYuan · 2022-12-23T11:06:09Z

One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed?
/cc @gaocegege

@johnugeorge Does that mean, in the following validation step, the training-operator should mark Failed to JobCondition if validations failed and then if JobCondiction is Failed, skip reconciling CustomJob (e.g. TFJob)?

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 155 to 157 in 69813fb

if err = kubeflowv1.ValidateV1TFJobSpec(&tfjob.Spec); err != nil {

logger.Info(err.Error(), "TFJob failed validation", req.NamespacedName.String())

}

If we don't return error for ValidationError , reconciliation won't happen again. Is there a better solution?

@johnugeorge Another better option; if a validation error occurs, add a special annotation to the target CRD (e.g. TFJob) then run return ctrl.Result{}, err. And watcher rejects (use predicators) to enqueue CRDs with the special annotation to the workqueue.

training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go

Lines 206 to 211 in 69813fb

// using onOwnerCreateFunc is easier to set defaults

if err = c.Watch(&source.Kind{Type: &kubeflowv1.TFJob{}}, &handler.EnqueueRequestForObject{},

predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()},

); err != nil {

return err

}

A little complex 😂.
Imaging this situation, users will find their Job spec has problem, then they are going to fix it. They need to know the annotion and remove the annotation, or our codes should add additional logic.

HeGaoYuan · 2023-01-11T15:00:55Z

If you recomment to use commonutil.JobSucceededReason, then I need to open a PR in kubeflow/common to add a JobFailedValidation reason constant. @johnugeorge @tenzen-y

@HeGaoYuan For now, we can go ahead by adding a new constant in kubeflow/common and then get this PR merged. We will do further cleanup post 1.6 release after merging kubeflow/common as described in #1714 (comment)

I got it

johnugeorge · 2023-01-21T10:07:57Z

@HeGaoYuan Can you update it?

HeGaoYuan · 2023-01-22T03:36:29Z

A possible another solution is not return ctrl.Result{}, err but return ctrl.Result{}, nil after validate fails. Then controller-runtime will not reconciled continuously. @johnugeorge @tenzen-y

@johnugeorge Yes, I can update. But then how about this? Should I keep return ctrl.Result{}, err or change to return ctrl.Result{}, nil

johnugeorge · 2023-01-22T05:17:39Z

I would recommend to update the PR to use JobFailedValidationReason https://github.com/kubeflow/common/blob/9ec55d141f90faaf52fd6df271e987e5a6781945/pkg/util/status.go#L21 and keep return ctrl.Result{}, err (as in the current PR) so as to remain consistent with all controllers.

In the next release, we can discuss and implement the state change in #1711

/cc @tenzen-y What do you think?

johnugeorge · 2023-01-24T11:48:33Z

@HeGaoYuan We are creating a release tomorrow. Can you update this PR and rebase ?

tenzen-y · 2023-01-24T12:07:41Z

I would recommend to update the PR to use JobFailedValidationReason https://github.com/kubeflow/common/blob/9ec55d141f90faaf52fd6df271e987e5a6781945/pkg/util/status.go#L21 and keep return ctrl.Result{}, err (as in the current PR) so as to remain consistent with all controllers.

In the next release, we can discuss and implement the state change in #1711

/cc @tenzen-y What do you think?

Sorry for the late response. I was missing the notification.

That makes sense. It would be better to discuss that after the next release since we should handle the behavior of Job conditions, carefully.

So, it would be better to change only an error reason.

tenzen-y · 2023-01-25T13:07:32Z

@johnugeorge Would you like to take over this PR before we cut the new release? Or, we postpone releasing this improvement after the next release?

review-notebook-app · 2023-01-25T15:50:48Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

HeGaoYuan · 2023-01-25T15:57:57Z

@johnugeorge @tenzen-y
I am so sorry for replying late, because it is Spring Festival recently.
I have updated it by merge master branch.
BTW, If I rebase this 3 commits as you see in this page to 1 commit, this page will show I have changed 200+ files. So I keep it as it is.

tenzen-y

@HeGaoYuan Thanks for the updates!
/lgtm

/assign @johnugeorge

johnugeorge · 2023-01-25T17:14:30Z

@tenzen-y Have you noticed that tests are really flaky now?

tenzen-y · 2023-01-25T17:18:05Z

@tenzen-y Have you noticed that tests are really flaky now?

Is it E2E?

johnugeorge · 2023-01-25T17:28:39Z

@tenzen-y Have you noticed that tests are really flaky now?

Is it E2E?

There are e2e failures. Also, Publish Images workflows take longer time

johnugeorge · 2023-01-25T17:56:20Z

Thanks @HeGaoYuan

/approve

google-oss-prow · 2023-01-25T17:56:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HeGaoYuan, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2023-01-25T18:05:10Z

@tenzen-y Have you noticed that tests are really flaky now?

Is it E2E?

There are e2e failures. Also, Publish Images workflows take longer time

@johnugeorge I don't think these changes caused these flaky tests.

There are e2e failures

As I can see, training jobs sometimes fail... We might need to improve sample training codes.

Publish Images workflows take longer time

I guess building jobs have taken longer times since #1692.

google-oss-prow bot requested review from jinchihe and kuizhiqing December 23, 2022 04:27

google-oss-prow bot added the size/S label Dec 23, 2022

johnugeorge reviewed Dec 23, 2022

View reviewed changes

fix kubeflow#1704

250def7

HeGaoYuan force-pushed the issues1704 branch from 89e6968 to 250def7 Compare December 23, 2022 04:40

google-oss-prow bot requested a review from gaocegege December 23, 2022 06:05

google-oss-prow bot requested a review from tenzen-y December 23, 2022 06:37

google-oss-prow bot requested a review from a team December 23, 2022 07:11

johnugeorge mentioned this pull request Dec 23, 2022

Inconsistent implementation about when the validation of job's spec failed #1704

Open

HeGaoYuan pushed a commit to HeGaoYuan/common that referenced this pull request Dec 27, 2022

add JobFailedValidation referring to kubeflow/training-operator#1705

6d5ee26

HeGaoYuan mentioned this pull request Dec 27, 2022

add JobFailedValidation reason constant kubeflow/common#201

Merged

HeGaoYuan added a commit to HeGaoYuan/common that referenced this pull request Dec 27, 2022

add JobFailedValidation referring to kubeflow/training-operator#1705

d274f47

HeGaoYuan mentioned this pull request Jan 16, 2023

The Job condition's transition is not clear and has bugs #1711

Open

HeGaoYuan added 2 commits January 25, 2023 23:37

Merge branch 'master' into issues1704

b97e47e

use commonutil.JobFailedValidationReason replace of JobFailedValidation

a92e479

HeGaoYuan force-pushed the issues1704 branch from a92e479 to 35a160d Compare January 25, 2023 15:50

google-oss-prow bot added size/XXL and removed size/S labels Jan 25, 2023

HeGaoYuan force-pushed the issues1704 branch from 35a160d to a92e479 Compare January 25, 2023 15:52

google-oss-prow bot added size/S and removed size/XXL labels Jan 25, 2023

tenzen-y reviewed Jan 25, 2023

View reviewed changes

google-oss-prow bot assigned johnugeorge and tenzen-y Jan 25, 2023

google-oss-prow bot added the lgtm label Jan 25, 2023

google-oss-prow bot added the approved label Jan 25, 2023

google-oss-prow bot merged commit d0fb5c0 into kubeflow:master Jan 25, 2023

fix https://github.com/kubeflow/training-operator/issues/1704 #1705

fix https://github.com/kubeflow/training-operator/issues/1704 #1705

Conversation

HeGaoYuan commented Dec 23, 2022 • edited

google-cla bot commented Dec 23, 2022

johnugeorge Dec 23, 2022

Choose a reason for hiding this comment

HeGaoYuan Dec 23, 2022 • edited

Choose a reason for hiding this comment

johnugeorge Dec 23, 2022

Choose a reason for hiding this comment

tenzen-y commented Dec 23, 2022

HeGaoYuan commented Dec 23, 2022

coveralls commented Dec 23, 2022 • edited

Pull Request Test Coverage Report for Build 4007614566

💛 - Coveralls

johnugeorge commented Dec 23, 2022 • edited

johnugeorge commented Dec 23, 2022

HeGaoYuan commented Dec 23, 2022

johnugeorge commented Dec 23, 2022

tenzen-y commented Dec 23, 2022

johnugeorge commented Dec 23, 2022

tenzen-y commented Dec 23, 2022

johnugeorge commented Dec 23, 2022 • edited

HeGaoYuan commented Dec 23, 2022 • edited

johnugeorge commented Dec 23, 2022

HeGaoYuan commented Dec 23, 2022

tenzen-y commented Dec 23, 2022

HeGaoYuan commented Dec 23, 2022

HeGaoYuan commented Jan 11, 2023

johnugeorge commented Jan 21, 2023

HeGaoYuan commented Jan 22, 2023

johnugeorge commented Jan 22, 2023 • edited

johnugeorge commented Jan 24, 2023

tenzen-y commented Jan 24, 2023

tenzen-y commented Jan 25, 2023

review-notebook-app bot commented Jan 25, 2023

HeGaoYuan commented Jan 25, 2023 • edited

tenzen-y left a comment

Choose a reason for hiding this comment

johnugeorge commented Jan 25, 2023

tenzen-y commented Jan 25, 2023

johnugeorge commented Jan 25, 2023

johnugeorge commented Jan 25, 2023

google-oss-prow bot commented Jan 25, 2023

tenzen-y commented Jan 25, 2023

HeGaoYuan commented Dec 23, 2022 •

edited

HeGaoYuan Dec 23, 2022 •

edited

coveralls commented Dec 23, 2022 •

edited

johnugeorge commented Dec 23, 2022 •

edited

johnugeorge commented Dec 23, 2022 •

edited

HeGaoYuan commented Dec 23, 2022 •

edited

johnugeorge commented Jan 22, 2023 •

edited

HeGaoYuan commented Jan 25, 2023 •

edited