Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paddle operator proposal to kubeflow community. #502

Merged
merged 1 commit into from
Jun 21, 2021

Conversation

tizhou86
Copy link
Contributor

Add paddle operator proposal to kubeflow community.


## Alternatives Considered

One option is to add PaddlePaddle support to the existing tf-operator, but the parameters and operations between two frameworks are quite different. Combining them may make the user experience unnecessarily complicated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. What about the implementation of the operator? Could you leverage https://github.com/kubeflow/common's interface that's used for other operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using kube-builder as the skeleton for the paddle-operator, part of the kubeflow common components are not necessary for our project, like job controller etc. We'll see whether we can leverage on kubeflow common for advanced features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, we are aware of your concern, technically, we are building our operator with the newest version of kubebuilder which take care of almost all the staff like Informer/Indexer/clienset etc. It leaves out the Reconcile function to be implemented, it also provides CRUD operations directly on resources with context, which means that the kubeflow/common may not necessary in this circumstances.

@terrytangyuan
Copy link
Member

terrytangyuan commented Mar 19, 2021

cc @kubeflow/wg-training-leads @kubeflow/wg-automl-leads

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM from me. Please take another look @kubeflow/wg-training-leads @kubeflow/wg-automl-leads as well.

@tizhou86 @kuizhiqing Are you planning to transfer the existing repo https://github.com/PaddleFlow/paddle-operator to Kubeflow org once this is approved? @Bobgy @theadactyl Any questions or concerns on this?

@terrytangyuan
Copy link
Member

/hold

Copy link
Member

@PatrickXYS PatrickXYS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Only nits comment

Comment on lines +223 to +225
* Paddle Operator Architecture on Kubernetes, please check out [design-arch](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-arch.md)
* Paddle training job instance fault tolerant, please check out [design-fault-tolerant](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-fault-tolerant.md)
* Co-scheduling training job to prevent job instances from resource deadlock, please check out [design-coschedule](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-coschedule.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All link expired

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've just merged the new refactoring, I'll fix it soon.

@Jeffwan
Copy link
Member

Jeffwan commented Mar 25, 2021

We briefly talked about this story in AutoML and training bi-weekly meeting. People are interested in support plan as well. For example, if there's active maintainers for this project. If it want to be part of kubeflow release, etc.

@tizhou86
Copy link
Contributor Author

This LGTM from me. Please take another look @kubeflow/wg-training-leads @kubeflow/wg-automl-leads as well.

@tizhou86 @kuizhiqing Are you planning to transfer the existing repo https://github.com/PaddleFlow/paddle-operator to Kubeflow org once this is approved? @Bobgy @theadactyl Any questions or concerns on this?

Yes, Paddle-operator is one of our ecosystem projects that we want to transfer to kubeflow org for maintenance, It's approved by the Paddle team already, we think the open governance will be good for the Paddle ecosystem projects.

@tizhou86
Copy link
Contributor Author

We briefly talked about this story in AutoML and training bi-weekly meeting. People are interested in support plan as well. For example, if there's active maintainers for this project. If it want to be part of kubeflow release, etc.

Great! If possible, we can do a presentations on kubeflow weekly meeting, and then we can discuss about the future plan in the meeting.

@gaocegege
Copy link
Member

/assign @Jeffwan

/cc @zw0610

@google-oss-robot
Copy link

@gaocegege: GitHub didn't allow me to request PR reviews from the following users: zw0610.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @Jeffwan

/cc @zw0610

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jeffwan
Copy link
Member

Jeffwan commented Apr 13, 2021

@gaocegege Last WG-training meeting was CN time friendly and I miss it. Any conclusion?

@gaocegege
Copy link
Member

We think we can accept it. If we decide to merge into one controller, we can refactor the paddle operator later.

WDYT

@Jeffwan
Copy link
Member

Jeffwan commented Apr 13, 2021

We think we can accept it. If we decide to merge into one controller, we can refactor the paddle operator later.

WDYT

Agree. Let's kick off the transfer process.

/lgtm
/approve

/cc @Bobgy @theadactyl

@gaocegege @terrytangyuan and I help review the proposal from WG-Training side and it looks good to us. Can you help take a review?

@gaocegege
Copy link
Member

/cc @Bobgy

Could you please help us create the repository for the paddle-operator?

@tizhou86
Copy link
Contributor Author

Hi Bob, please let us know if anything should be done before the transfer process, thanks!

@tizhou86
Copy link
Contributor Author

Hi Bob, please let us know if anything should be done before the transfer process, thanks!

The original repository is: https://github.com/PaddleFlow/paddle-operator , the new repository I suppose will be : https://github.com/kubeflow/paddle-operator , we've recently added more user guides and documentations for developers.

@tizhou86
Copy link
Contributor Author

/cc @Bobgy for fear that this thread may have escaped from attention. :-)

@Bobgy
Copy link
Contributor

Bobgy commented Apr 29, 2021

Sorry for the delay, @kubeflow/wg-training-leads.

I discussed with @kubeflow/project-steering-group and we would suggest merging all training operators into one monorepo. How do feel about that?

The rationale:
we need to do a complex release process each time a training operator repo is created. It seems that the number of frameworks will only keep increase in the future. However, if I understand correctly, most operators share quite some stuff in common. So I think it might be a lower effort thing for both of us to create one central training operators repo once, and keep adding framework-specific operators to it.

@Bobgy
Copy link
Contributor

Bobgy commented Apr 29, 2021

I think the proposal also relates to #512

@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 29, 2021

@Bobgy Please see our discussion on merging the operators in kubeflow/common#103. Although I think this should be discussed separately and should not block the acceptance of paddle-operator. Release is also worth a separate discussion as it's the release process in general that's complex but not specific to training operators.

@Bobgy
Copy link
Contributor

Bobgy commented May 12, 2021

@terrytangyuan regardless of whether operators are merged, what about using a monorepo?
I'm not talking about the engineering process for release, but actually Google's internal release process, any new repo in Kubeflow org needs to go through Google release process and we want to avoid this by using a monorepo (anyway, only in case when monorepo makes sense for Training WG).

Looks like we can wait for kubeflow/common#103 (comment) to see more updates?

@Jeffwan
Copy link
Member

Jeffwan commented May 18, 2021

any new repo in Kubeflow org needs to go through Google release process

em. Since this is internal process and we don't have any clues, could you help us understand the efforts? It helps WG leads to make deliberate decision at the time we have these kind of repo requests.

@tizhou86
Copy link
Contributor Author

Yeah, we are really looking forward to have paddle operator to integrated into Kubeflow community, some of our end users are building the PaddlePaddle deep learning system leveraging on Kubeflow and paddle operator, we want to make it more straightforward to our users. And with the impact of the PaddlePaddle community(15.4k for the paddle main repo and 50k+for the paddle org projects), we think it will be a win-win solution for both community.

@Bobgy
Copy link
Contributor

Bobgy commented Jun 8, 2021

Sorry for the long delay, let me give an update:

  1. I have learned what the process for adopting a new repo roughly looks like, but I'll have to do it once to understand how much efforts we need each time. Due to policy issues, I cannot share more details.
  2. I had some discussions within PSG, but haven't reached a final conclusion yet. I'll update again next week.

@Bobgy
Copy link
Contributor

Bobgy commented Jun 16, 2021

Hi all, @kubeflow/project-steering-group agreed to accept this proposal and start a new repo for paddle-operator without waiting for potential merge of Training WG operators.

@kubeflow/wg-training-leads may I confirm that you are still willing to accept the proposal too, right? because it's been a while.

If everyone agrees, I can start working with @tizhou86 on the process to adopt the paddle-operator repo.

@terrytangyuan
Copy link
Member

+1 from me. @kubeflow/wg-training-leads Any concerns?

@johnugeorge
Copy link
Member

+1

@gaocegege
Copy link
Member

SGTM

@tizhou86
Copy link
Contributor Author

Great, thanks @Bobgy ! Please let me know if anything is needed from PaddlePaddle team.

@Bobgy
Copy link
Contributor

Bobgy commented Jun 21, 2021

/lgtm
/approve
Let's get the proposal in first.

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bobgy, gaocegege, Jeffwan, terrytangyuan, tizhou86

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Bobgy
Copy link
Contributor

Bobgy commented Jun 21, 2021

/unhold

UPDATE: for adoption progress, refer to #520

@google-oss-robot google-oss-robot merged commit 8d2838c into kubeflow:master Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants