Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mpi jobs in universal operator #1345

Closed
Jeffwan opened this issue Aug 10, 2021 · 28 comments
Closed

Support mpi jobs in universal operator #1345

Jeffwan opened this issue Aug 10, 2021 · 28 comments

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Aug 10, 2021

We've migrated pytorch, mxnet(byteps), xgboost (lightbgm). The next step is to consider moving mpi apis and controllers to this repo. (for people who are not familiar with the background, check https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)

The reason we didn't make it in the first draft is the code structure of mpi-operator is a little bit different from rest operators.

However, we do see strong needs to adopt it in universal operator so users can easily leverage it to run horovod and horovod elastic jobs. Another big motivation is DGL support can be built on top of MPI-operator. This would be a big gain for graph users.

/cc @kubeflow/wg-training-leads @ryantd @carmark

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 13, 2021

/cc @alculquicondor

@gaocegege
Copy link
Member

cc @zw0610

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 14, 2021

@zw0610 had some discussion with @carmark and @ryantd in the past. Do you have the conclusion? BTW, what's the plan for v2 MPI? Is it ok to adopt v2 directly?

@zw0610
Copy link
Member

zw0610 commented Aug 15, 2021

On the last community meeting, I've explained that internally we prefer using the APIs that are already production-ready. From my understanding, the v2 API as long as its controller is still under completion and has not been widely adopted by the community users. Merging v2 version directly may not benefit many users who persist with v1 API.

@alculquicondor
Copy link

I estimate the v2 controller and API will be ready in a month at most. Then we can produce a release.

The new controller has more coverage at unit, integration and e2e level, so it's in good track to be production-ready :)

@alculquicondor
Copy link

Also, I'm interested in helping with the migration of mpi-operator to the universal operator, but only after the v2 release. I don't think I would be able to implement everything myself, but I can at the very least help with reviews.

@johnugeorge
Copy link
Member

How about getting v2 API directly in the universal operator while we continue using v1 from current MPI repo? This way, both will be isolated and switching from v1 to v2 will be easy?

@alculquicondor
Copy link

I vote yes to that. We can migrate only the v2 API. But we are not ready yet.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 19, 2021

@johnugeorge

@zw0610 proposed to merge two version in this repo. Since mpi uses different API versions, it's ok for them to coexist. There're lots of v1 users as I know and they will not immediately migrate to v2 because of both operator and api changes

@alculquicondor
Copy link

One problem I see is that we don't have a conversion webhook. The v2 controller can't support v1 APIs, and even if it could, it cannot handle the missing resources that jobs created with a v1 controller wouldn't have (like the workers Service). I think it's reasonable to only have support for v2. If users don't want to upgrade yet, they can keep using the separate v1 controller.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 19, 2021

One problem I see is that we don't have a conversion webhook. The v2 controller can't support v1 APIs, and even if it could, it cannot handle the missing resources that jobs created with a v1 controller wouldn't have (like the workers Service). I think it's reasonable to only have support for v2. If users don't want to upgrade yet, they can keep using the separate v1 controller.

I mean two apis + two controllers. Just treat mpiv2 as a separate framework. In that case, we don't necessarily need conversions in that case.

@alculquicondor
Copy link

I don't see much value from adding the two controllers, considering the extra effort.

@alculquicondor
Copy link

I think we should take this as an opportunity to encourage users to upgrade and simplify their infrastructure.

@gaocegege
Copy link
Member

Is there any progress? Recently some users expect that horovod can be supported in the universal operator.

@gaocegege
Copy link
Member

/cc @qiankunli

@carmark
Copy link
Member

carmark commented Sep 9, 2021

Our team have been working on this integration, hope that we can submit a draft PR next week.

@alculquicondor
Copy link

Are the integration and E2E tests being migrated as well?

@terrytangyuan
Copy link
Member

terrytangyuan commented Sep 23, 2021

#1413 (comment)

From @alculquicondor

That sounds like a maintenance nightmare. The new controller supports the same features as the old controller and more, it's more robust and has greater coverage. I don't see why we should migrate the old controller. If people really want to use it, they can refer to the implementation in the old repository.

Let's continue this conversion here.

Is everyone okay with ONLY migrating the NEW V2 controller? I can see things will be hard to maintain if we migrate two versions. The old version will be deprecated at some point in the future anyways so we will have to come up with a migration/deprecation plan and schedule. Any functionality we are missing in v2?

@alculquicondor
Copy link

First reference question:
I suppose the first kubeflow version to have the migrated controller would be at least 1.5. When is it supposed to be released?

@zw0610
Copy link
Member

zw0610 commented Sep 23, 2021

#1413 (comment)

From @alculquicondor

That sounds like a maintenance nightmare. The new controller supports the same features as the old controller and more, it's more robust and has greater coverage. I don't see why we should migrate the old controller. If people really want to use it, they can refer to the implementation in the old repository.

Let's continue this conversion here.

Is everyone okay with ONLY migrating the NEW V2 controller? I can see things will be hard to maintain if we migrate two versions. The old version will be deprecated at some point in the future anyways so we will have to come up with a migration/deprecation plan and schedule. Any functionality we are missing in v2?

No. I still see many users using the v1 API. Forcing users to adopt v2 API seems not a good choice for our internal users and external customers. As I brought in community meeting before, it will be great to provide solid user cases for v2 API at scale, which I can use as examples to our users showing the v2 controller is production ready.

@zw0610
Copy link
Member

zw0610 commented Sep 23, 2021

Meanwhile, I think one of the benefits we use one-manager-multi-contorller mode is we can support multiple controllers in one repository. To me, mpi-controller v1 and mpi-controller v2 are just two controllers, which does not need to be maintained by the same group of contributors. I do not fully make sense of the exclusive decision to keep one instead of both.

@terrytangyuan
Copy link
Member

Seems like we are facing the chicken or the egg problem. We need adoptions to show production readiness but production readiness comes from a wider range of testing from users.

@alculquicondor You mentioned the maintenance efforts for two controllers. I think at least @zw0610's team can help maintain v1 controller. Any other concerns?

@alculquicondor
Copy link

alculquicondor commented Sep 23, 2021

No. I still see many users using the v1 API.

They can still use the v1 API. We just have to provide a conversion webhook. That seems like a much better direction to me. Still, without the conversion webhook, the jobs mostly work, except for a couple of fields.

@alculquicondor
Copy link

Depending on the date for the kubeflow 1.5 release, we can have this plan:

  • Migrate the v2 controller only
  • Wait for signal for users of the v2 controller from the mpi-operator repository.
  • Fix any performance issue that might came up
  • If it really doesn't scale, migrate the v1 controller as well.

@terrytangyuan
Copy link
Member

FYI I added this to Oct 6th AutoML and Training WG Meeting agenda to discuss. cc @andreyvelich

@andreyvelich
Copy link
Member

Thank you @terrytangyuan.
@alculquicondor @zw0610 Please try to attend this meeting.
This is the meeting notes: https://bit.ly/2PWVCkV

@alculquicondor
Copy link

Thanks, I'll be there

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants