Add paddle operator proposal to kubeflow community. #502

tizhou86 · 2021-03-17T12:21:21Z

Add paddle operator proposal to kubeflow community.

terrytangyuan · 2021-03-17T13:22:59Z

proposals/paddle-operator-proposal.md

+
+## Alternatives Considered
+
+One option is to add PaddlePaddle support to the existing tf-operator, but the parameters and operations between two frameworks are quite different. Combining them may make the user experience unnecessarily complicated.


This makes sense. What about the implementation of the operator? Could you leverage https://github.com/kubeflow/common's interface that's used for other operators?

We are using kube-builder as the skeleton for the paddle-operator, part of the kubeflow common components are not necessary for our project, like job controller etc. We'll see whether we can leverage on kubeflow common for advanced features.

Thanks for your reply, we are aware of your concern, technically, we are building our operator with the newest version of kubebuilder which take care of almost all the staff like Informer/Indexer/clienset etc. It leaves out the Reconcile function to be implemented, it also provides CRUD operations directly on resources with context, which means that the kubeflow/common may not necessary in this circumstances.

terrytangyuan · 2021-03-19T21:21:06Z

cc @kubeflow/wg-training-leads @kubeflow/wg-automl-leads

terrytangyuan

This LGTM from me. Please take another look @kubeflow/wg-training-leads @kubeflow/wg-automl-leads as well.

@tizhou86 @kuizhiqing Are you planning to transfer the existing repo https://github.com/PaddleFlow/paddle-operator to Kubeflow org once this is approved? @Bobgy @theadactyl Any questions or concerns on this?

terrytangyuan · 2021-03-22T15:07:35Z

/hold

PatrickXYS

Awesome!

Only nits comment

PatrickXYS · 2021-03-22T18:03:19Z

proposals/paddle-operator-proposal.md

+* Paddle Operator Architecture on Kubernetes, please check out [design-arch](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-arch.md)  
+* Paddle training job instance fault tolerant, please check out [design-fault-tolerant](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-fault-tolerant.md)
+* Co-scheduling training job to prevent job instances from resource deadlock, please check out [design-coschedule](https://github.com/PaddleFlow/paddle-operator/blob/main/docs/design-coschedule.md)


All link expired

Since we've just merged the new refactoring, I'll fix it soon.

Jeffwan · 2021-03-25T07:10:45Z

We briefly talked about this story in AutoML and training bi-weekly meeting. People are interested in support plan as well. For example, if there's active maintainers for this project. If it want to be part of kubeflow release, etc.

tizhou86 · 2021-03-26T10:58:07Z

This LGTM from me. Please take another look @kubeflow/wg-training-leads @kubeflow/wg-automl-leads as well.

@tizhou86 @kuizhiqing Are you planning to transfer the existing repo https://github.com/PaddleFlow/paddle-operator to Kubeflow org once this is approved? @Bobgy @theadactyl Any questions or concerns on this?

Yes, Paddle-operator is one of our ecosystem projects that we want to transfer to kubeflow org for maintenance, It's approved by the Paddle team already, we think the open governance will be good for the Paddle ecosystem projects.

tizhou86 · 2021-03-26T10:59:38Z

We briefly talked about this story in AutoML and training bi-weekly meeting. People are interested in support plan as well. For example, if there's active maintainers for this project. If it want to be part of kubeflow release, etc.

Great! If possible, we can do a presentations on kubeflow weekly meeting, and then we can discuss about the future plan in the meeting.

gaocegege · 2021-04-12T10:56:38Z

/assign @Jeffwan

/cc @zw0610

google-oss-robot · 2021-04-12T10:56:39Z

@gaocegege: GitHub didn't allow me to request PR reviews from the following users: zw0610.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @Jeffwan

/cc @zw0610

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jeffwan · 2021-04-13T02:43:16Z

@gaocegege Last WG-training meeting was CN time friendly and I miss it. Any conclusion?

gaocegege · 2021-04-13T03:08:39Z

We think we can accept it. If we decide to merge into one controller, we can refactor the paddle operator later.

WDYT

Jeffwan · 2021-04-13T06:14:34Z

We think we can accept it. If we decide to merge into one controller, we can refactor the paddle operator later.

WDYT

Agree. Let's kick off the transfer process.

/lgtm
/approve

/cc @Bobgy @theadactyl

@gaocegege @terrytangyuan and I help review the proposal from WG-Training side and it looks good to us. Can you help take a review?

gaocegege · 2021-04-27T02:27:42Z

/cc @Bobgy

Could you please help us create the repository for the paddle-operator?

tizhou86 · 2021-04-27T03:51:24Z

Hi Bob, please let us know if anything should be done before the transfer process, thanks!

tizhou86 · 2021-04-27T08:33:12Z

Hi Bob, please let us know if anything should be done before the transfer process, thanks!

The original repository is: https://github.com/PaddleFlow/paddle-operator , the new repository I suppose will be : https://github.com/kubeflow/paddle-operator , we've recently added more user guides and documentations for developers.

tizhou86 · 2021-04-29T07:05:48Z

/cc @Bobgy for fear that this thread may have escaped from attention. :-)

Bobgy · 2021-04-29T07:12:08Z

Sorry for the delay, @kubeflow/wg-training-leads.

I discussed with @kubeflow/project-steering-group and we would suggest merging all training operators into one monorepo. How do feel about that?

The rationale:
we need to do a complex release process each time a training operator repo is created. It seems that the number of frameworks will only keep increase in the future. However, if I understand correctly, most operators share quite some stuff in common. So I think it might be a lower effort thing for both of us to create one central training operators repo once, and keep adding framework-specific operators to it.

Bobgy · 2021-04-29T07:14:38Z

I think the proposal also relates to #512

terrytangyuan · 2021-04-29T12:03:15Z

@Bobgy Please see our discussion on merging the operators in kubeflow/common#103. Although I think this should be discussed separately and should not block the acceptance of paddle-operator. Release is also worth a separate discussion as it's the release process in general that's complex but not specific to training operators.

Bobgy · 2021-05-12T03:23:16Z

@terrytangyuan regardless of whether operators are merged, what about using a monorepo?
I'm not talking about the engineering process for release, but actually Google's internal release process, any new repo in Kubeflow org needs to go through Google release process and we want to avoid this by using a monorepo (anyway, only in case when monorepo makes sense for Training WG).

Looks like we can wait for kubeflow/common#103 (comment) to see more updates?

Jeffwan · 2021-05-18T23:50:19Z

any new repo in Kubeflow org needs to go through Google release process

em. Since this is internal process and we don't have any clues, could you help us understand the efforts? It helps WG leads to make deliberate decision at the time we have these kind of repo requests.

tizhou86 · 2021-05-19T00:32:12Z

Yeah, we are really looking forward to have paddle operator to integrated into Kubeflow community, some of our end users are building the PaddlePaddle deep learning system leveraging on Kubeflow and paddle operator, we want to make it more straightforward to our users. And with the impact of the PaddlePaddle community(15.4k for the paddle main repo and 50k+for the paddle org projects), we think it will be a win-win solution for both community.

Bobgy · 2021-06-08T12:45:02Z

Sorry for the long delay, let me give an update:

I have learned what the process for adopting a new repo roughly looks like, but I'll have to do it once to understand how much efforts we need each time. Due to policy issues, I cannot share more details.
I had some discussions within PSG, but haven't reached a final conclusion yet. I'll update again next week.

Bobgy · 2021-06-16T03:33:07Z

Hi all, @kubeflow/project-steering-group agreed to accept this proposal and start a new repo for paddle-operator without waiting for potential merge of Training WG operators.

@kubeflow/wg-training-leads may I confirm that you are still willing to accept the proposal too, right? because it's been a while.

If everyone agrees, I can start working with @tizhou86 on the process to adopt the paddle-operator repo.

terrytangyuan · 2021-06-16T04:02:10Z

+1 from me. @kubeflow/wg-training-leads Any concerns?

johnugeorge · 2021-06-16T05:26:49Z

+1

gaocegege · 2021-06-17T02:32:42Z

SGTM

tizhou86 · 2021-06-17T03:10:16Z

Great, thanks @Bobgy ! Please let me know if anything is needed from PaddlePaddle team.

Bobgy · 2021-06-21T04:29:07Z

/lgtm
/approve
Let's get the proposal in first.

google-oss-robot · 2021-06-21T04:29:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bobgy, gaocegege, Jeffwan, terrytangyuan, tizhou86

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Bobgy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Bobgy · 2021-06-21T05:33:45Z

/unhold

UPDATE: for adoption progress, refer to #520

Add paddle operator proposal to kubeflow community.

d1e9ae0

google-cla bot added the cla: yes label Mar 17, 2021

google-oss-robot requested review from Bobgy and theadactyl March 17, 2021 12:21

google-oss-robot added the size/L label Mar 17, 2021

tizhou86 mentioned this pull request Mar 17, 2021

Proposal for a Paddle Operator #503

Closed

terrytangyuan reviewed Mar 17, 2021

View reviewed changes

gaocegege approved these changes Mar 20, 2021

View reviewed changes

terrytangyuan approved these changes Mar 22, 2021

View reviewed changes

google-oss-robot added the do-not-merge/hold label Mar 22, 2021

PatrickXYS reviewed Mar 22, 2021

View reviewed changes

google-oss-robot assigned Jeffwan Apr 12, 2021

google-oss-robot added the lgtm label Apr 13, 2021

google-oss-robot assigned Bobgy Jun 21, 2021

google-oss-robot added the approved label Jun 21, 2021

Bobgy mentioned this pull request Jun 21, 2021

Process to adopt paddle/operator #520

Closed

google-oss-robot removed the do-not-merge/hold label Jun 21, 2021

google-oss-robot merged commit 8d2838c into kubeflow:master Jun 21, 2021


		## Alternatives Considered

		One option is to add PaddlePaddle support to the existing tf-operator, but the parameters and operations between two frameworks are quite different. Combining them may make the user experience unnecessarily complicated.

Add paddle operator proposal to kubeflow community. #502

Add paddle operator proposal to kubeflow community. #502

Conversation

tizhou86 commented Mar 17, 2021

terrytangyuan Mar 17, 2021

Choose a reason for hiding this comment

tizhou86 Mar 18, 2021

Choose a reason for hiding this comment

kuizhiqing Mar 22, 2021

Choose a reason for hiding this comment

terrytangyuan commented Mar 19, 2021 • edited Loading

terrytangyuan left a comment

Choose a reason for hiding this comment

terrytangyuan commented Mar 22, 2021

PatrickXYS left a comment

Choose a reason for hiding this comment

PatrickXYS Mar 22, 2021

Choose a reason for hiding this comment

tizhou86 Mar 26, 2021

Choose a reason for hiding this comment

Jeffwan commented Mar 25, 2021

tizhou86 commented Mar 26, 2021

tizhou86 commented Mar 26, 2021

gaocegege commented Apr 12, 2021

google-oss-robot commented Apr 12, 2021

Jeffwan commented Apr 13, 2021

gaocegege commented Apr 13, 2021

Jeffwan commented Apr 13, 2021 • edited Loading

gaocegege commented Apr 27, 2021

tizhou86 commented Apr 27, 2021

tizhou86 commented Apr 27, 2021

tizhou86 commented Apr 29, 2021

Bobgy commented Apr 29, 2021

Bobgy commented Apr 29, 2021

terrytangyuan commented Apr 29, 2021 • edited Loading

Bobgy commented May 12, 2021

Jeffwan commented May 18, 2021

tizhou86 commented May 19, 2021

Bobgy commented Jun 8, 2021

Bobgy commented Jun 16, 2021

terrytangyuan commented Jun 16, 2021

johnugeorge commented Jun 16, 2021

gaocegege commented Jun 17, 2021

tizhou86 commented Jun 17, 2021

Bobgy commented Jun 21, 2021

google-oss-robot commented Jun 21, 2021

Bobgy commented Jun 21, 2021 • edited Loading

terrytangyuan commented Mar 19, 2021 •

edited

Loading

Jeffwan commented Apr 13, 2021 •

edited

Loading

terrytangyuan commented Apr 29, 2021 •

edited

Loading

Bobgy commented Jun 21, 2021 •

edited

Loading