Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mpi-operator(v1) to the unified operator #1457

Merged
merged 3 commits into from
Nov 19, 2021

Conversation

hackerboy01
Copy link
Member

add mpi-operator(v1) to the unified operator

@aws-kf-ci-bot
Copy link
Contributor

Hi @hackerboy01. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gaocegege
Copy link
Member

/ok-to-test

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wha happened to the previous PR for this?

go.mod Outdated Show resolved Hide resolved
@gaocegege
Copy link
Member

Hi @hackerboy01

Is it ready to review?

Copy link
Member

@zw0610 zw0610 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks great. We might still need tiny fix.

pkg/apis/mxnet/v1/zz_generated.deepcopy.go Outdated Show resolved Hide resolved
//+kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups="",resources=serviceaccount,verbs=get;list;watch;create;delete
//+kubebuilder:rbac:groups="",resources=configmaps,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups="rbac.authorization.k8s.io",resources=roles,verbs=get;list;watch;create;delete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support elastic feature (horovod-elastic), we also need permission to update role: https://github.com/kubeflow/mpi-operator/blob/master/pkg/controllers/v1/mpi_job_controller.go#L777

utilruntime.HandleError(fmt.Errorf("couldn't get jobKey for job object %#v: %v", mpijob, err))
}
replicaTypes := util.GetReplicaTypes(mpijob.Spec.MPIReplicaSpecs)
needReconcile := util.SatisfiedExpectations(r.Expectations, jobKey, replicaTypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function SatisfiedExpectations only checks Pod and Service, which is sufficient for other job API. However, I wonder if MPIJob needs a more sophisticated SatisfiedExpectations implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this ever checked?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor you mean mpi-operator had not ever checked excpectations?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, did the author check @zw0610's questions?

return err
}

// inject watching for job related service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for mpi-controller.v1, there won't be job-related service. Instead, ServiceAccount, Role, RoleBinding and other resource may need be watched.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the controller works properly without solving Wang's comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not addressed? @hackerboy01

pkg/controller.v1/mpi/mpijob_controller.go Outdated Show resolved Hide resolved
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Let's make sure we don't introduce additional enhancements and keep this PR almost like copy-paste from the existing controller code. This way it's easy to track/review changes.

@terrytangyuan
Copy link
Member

terrytangyuan commented Nov 8, 2021

/cc @alculquicondor

@alculquicondor
Copy link

can you also move the unit tests please?

msg := fmt.Sprintf("MPIReplicaSpec is not valid: Image is undefined in the container of %v", rType)
return fmt.Errorf(msg)
}
// if container.Name == mpiv1.DefaultContainerName {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove these code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor We will move the unit tests in the next pr.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this commented code if it's not actually needed.

fix: controller-gen

fix:fmt
@terrytangyuan
Copy link
Member

@alculquicondor We will move the unit tests in the next pr.

@hackerboy01 I think you'll want to include the tests in this PR. Otherwise, we have no way to verify the correctness of the migrated code.

@zw0610
Copy link
Member

zw0610 commented Nov 10, 2021

@alculquicondor We will move the unit tests in the next pr.

@hackerboy01 I think you'll want to include the tests in this PR. Otherwise, we have no way to verify the correctness of the migrated code.

As we changed from the controller mode to reconciler mode, I would suggest only moving unit tests which do not use the old controller directly. Tests with the old controller mode does not verify the correctness of this new reconciler.

For the next pr, we can follow the suite_test.go and tensorflow_test.go and add unit tests for the reconciler.

What do you think? @terrytangyuan

Meanwhile, it seems we also need to add more tests to tensorflow reconciler as well. @Jeffwan

@terrytangyuan
Copy link
Member

terrytangyuan commented Nov 10, 2021

Why adding tests in the next PR instead of having it as part of this PR?

@zw0610
Copy link
Member

zw0610 commented Nov 10, 2021

Why adding tests in the next PR instead of having it as part of this PR?

Because we lack a good example among existing reconcilers. Anyone could add tests for tf/pytorch reconciler so mpi reconciler can follow the path?

@alculquicondor
Copy link

As we changed from the controller mode to reconciler mode, I would suggest only moving unit tests which do not use the old controller directly.

Yes, the unit tests need to be adapted, just like the code was adapted. But we shouldn't add such a significant amount of code without tests in the same PR. The contributor could decide to leave and then we have untested code to maintain.

@coveralls
Copy link

coveralls commented Nov 11, 2021

Pull Request Test Coverage Report for Build 1460365743

  • 888 of 1149 (77.28%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+16.0%) to 24.149%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/mpi/mpijob.go 102 108 94.44%
pkg/controller.v1/mpi/mpijob_controller.go 786 1041 75.5%
Totals Coverage Status
Change from base Build 1431720272: 16.0%
Covered Lines: 1199
Relevant Lines: 4965

💛 - Coveralls

@zw0610
Copy link
Member

zw0610 commented Nov 15, 2021

We just converted all test for v1 controller from mpi-operator to this pr.

@terrytangyuan
Copy link
Member

LGTM. Thank you for adding the tests!

/cc @alculquicondor Would you like to take another look before we merge?

@zw0610
Copy link
Member

zw0610 commented Nov 19, 2021

Can we move this pr forward? @terrytangyuan

@terrytangyuan
Copy link
Member

/approve
/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hackerboy01, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 63a6067 into kubeflow:master Nov 19, 2021
@alculquicondor
Copy link

Sorry, I was on vacation. I'll take a look regardless and hopefully we can fix any problems in a follow up.

@alculquicondor
Copy link

@hackerboy01 are you going to be migrating the v2 operator as well?

@terrytangyuan
Copy link
Member

terrytangyuan commented Nov 22, 2021

@hackerboy01 are you going to be migrating the v2 operator as well?

Let's start the discussion on v2 migration in #1479 instead of on this merged PR.

Copy link

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have more capacity to look at the v1 controller changes, but I'm not convinced that this PR went through a proper review cycle. There are a handful of important open questions that were never looked at.

manifests/base/cluster-role.yaml Show resolved Hide resolved
manifests/base/cluster-role.yaml Show resolved Hide resolved
- apiGroups:
- policy
resources:
- poddisruptionbudgets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall a need for this either

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -0,0 +1,38 @@
// Copyright 2019 The Kubeflow Authors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021 here and other files

// Package v1 contains API Schema definitions for the kubeflow.org v1 API group
//+kubebuilder:object:generate=true
//+groupName=kubeflow.org
package v1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have a doc.go file if there is already this comment here?

utilruntime.HandleError(fmt.Errorf("couldn't get jobKey for job object %#v: %v", mpijob, err))
}
replicaTypes := util.GetReplicaTypes(mpijob.Spec.MPIReplicaSpecs)
needReconcile := util.SatisfiedExpectations(r.Expectations, jobKey, replicaTypes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this ever checked?

return err
}

// inject watching for job related service

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the controller works properly without solving Wang's comment.

}

//mpijob not need delete services
func (r *MPIJobReconciler) DeletePodsAndServices(runPolicy *commonv1.RunPolicy, job interface{}, pods []*corev1.Pod) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which services?

// reconcileServices checks and updates services for each given ReplicaSpec.
// It will requeue the job in case of an error while creating/deleting services.
// mpijob not need services
func (jc *MPIJobReconciler) ReconcileServices(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to implement this function?

return true
}

r.Scheme.Default(mpiJob)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this is safe. If a user changes an MPIJob before the controller had a chance to update the job, I think the defaults would be cleared. And maybe there could be a race condition in the client cache, but I'm not sure how the cache is implemented in controller-runtime.

cc @Jeffwan who did the same for tf-operator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor
I did same thing for tf. Currently, it's still using a lightweight solution, the ideal way is to move the logics to webhooks. If user changes the job, the event sequence for the same object is guaranteed but since they are different kind of events and handled by different methods, I would say there's chance to get into the race condition.

T1: onOwnerCreateFunc is revoked and it tries to set default RunPolicy
T2: onUpdate() is revoked and it updates similar fields
T3. In the end, onOwnerCreateFunc set defaults.

Honestly, I have not played with this case. I agree even if we won't update that fast but the case is still there theoretically. Is there a better way to handle this without webhook? Maybe we can move most of logics to Defaulting https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#defaulting

@zw0610
Copy link
Member

zw0610 commented Nov 24, 2021

@hackerboy01 and I have reviewed the comments above. Fixing pull requests will follow this week.

- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remind me what CRD needs to be created?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm... I wouldn't expect the controller to create CRDs

- apiGroups:
- policy
resources:
- poddisruptionbudgets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

if err = validation.ValidateV1MpiJobSpec(&mpijob.Spec); err != nil {
logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We return put the object back to the queue. It's worthless to move it forward.
Another thing is I think the original validation logic is kind of outdated. Personally I don't see too many specific fields to validate. With latest controller tools, I think most of these invalid cases would be blocked by CRD validation.

return true
}

r.Scheme.Default(mpiJob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor
I did same thing for tf. Currently, it's still using a lightweight solution, the ideal way is to move the logics to webhooks. If user changes the job, the event sequence for the same object is guaranteed but since they are different kind of events and handled by different methods, I would say there's chance to get into the race condition.

T1: onOwnerCreateFunc is revoked and it tries to set default RunPolicy
T2: onUpdate() is revoked and it updates similar fields
T3. In the end, onOwnerCreateFunc set defaults.

Honestly, I have not played with this case. I agree even if we won't update that fast but the case is still there theoretically. Is there a better way to handle this without webhook? Maybe we can move most of logics to Defaulting https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#defaulting

@Jeffwan
Copy link
Member

Jeffwan commented Nov 25, 2021

@hackerboy01 Great work. Please help address the comments from @alculquicondor @zw0610 and me. Let's create a reliable and stable version for the users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants