[WIP] Caffe2 operator proposal #41

carmark · 2018-03-18T13:04:36Z

This change is

googlebot · 2018-03-18T13:04:39Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

k8s-ci-robot · 2018-03-18T13:04:43Z

Hi @carmark. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

googlebot · 2018-03-18T13:08:38Z

CLAs look good, thanks!

gaocegege

🎉 Thanks for your awesome work! I can't help reviewing the PR although it WIP.

gaocegege · 2018-03-18T13:39:42Z

proposals/caffe2-operator-proposal.md

+spec:
+  backend: "redis"
+  replicaSpecs:
+    - replicas: 1


We are using map instead of array in v1alpha2 in tf-operator, maybe we could keep consistent here.

sure, will consider do that.

jlewi

Thanks. This looks pretty good. Just a couple questions.

jlewi · 2018-03-18T23:43:39Z

proposals/caffe2-operator-proposal.md

+metadata:
+  name: "example-job"
+spec:
+  backend: "redis"


What is the backend?

backend Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found here.

PyTorch has a similar issue: kubeflow/pytorch-operator#7 (comment)

jlewi · 2018-03-18T23:44:08Z

proposals/caffe2-operator-proposal.md

+              name: caffe2
+          restartPolicy: Never
+    - replicas: 1
+      ReplicaType: HELPER


What is the helper?

HELPER replica type will be used to service finding for redis backend, and will be useless for gloo backend.

jlewi · 2018-03-18T23:45:12Z

proposals/caffe2-operator-proposal.md

+Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.
+
+## Goals
+A Kubeflow user should be able to run training using Caffe2 as easily as they can using Tensorflow.  This proposal is centered around a Kubernetes operator for Caffe2. A user should be able to run both single node and distributed training jobs with Caffe2.


Can you provide some motivation for creating a custom operator? Why are the built in operators unsuitable?

Thanks for review, and add the follow motivation:

For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/MPI to find the other nodes to communicate.

jlewi · 2018-03-20T02:35:14Z

Thanks.

/lgtm

gaocegege · 2018-03-20T02:50:14Z

/ok-to-test
/hold cancel
/lgtm

k8s-ci-robot · 2018-03-20T02:50:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaocegege · 2018-03-20T02:51:49Z

@carmark Is the PR still WIP?

carmark · 2018-03-20T03:13:53Z

@gaocegege Yeah, I would like to add a diagram and make the design more clear.

jose5918 · 2018-03-20T17:55:58Z

proposals/caffe2-operator-proposal.md

+          hostNetwork: true
+          containers:
+            - image: carmark/caffe2:latest
+              securityContext:


Why the does the process need increased privilege?

MPI needs this, but I am not sure for Redis backend.

jose5918 · 2018-03-20T17:59:33Z

proposals/caffe2-operator-proposal.md

+      ReplicaType: WORKER
+      template:
+        spec:
+          hostNetwork: true


What is the reasoning for using hostNetwork?

In our case, we use IB to communicate, so the hostNetwork is necessary.

jose5918 · 2018-04-06T18:00:21Z

@carmark Were you still working on this?
@jlewi Have we seen prow merge things that haven't passed status checks anywhere else?

carmark · 2018-04-07T00:56:57Z

@jose5918 yeah, still working on this.

jlewi · 2018-04-07T02:12:06Z

Tide shouldn't have merged it because it has "WIP" in the title. The tide status even reports that.

I think what happened is @kunmingg sync'd are labels recently and I think that may have removed the label do-not-merge/work-in-progress which caused it to get merged.

kunmingg · 2018-04-07T16:56:02Z

@jlewi
Yes, we had an "do-not-merge/work-in-progress" -> "status/in progress" sync action...
Shall I revert the merge?

kunmingg · 2018-04-07T17:23:18Z

@jlewi
Potential affected repos (might merge PR incomplete):

reporting
examples
community
pytorch-operator
caffe2-operator
experimental-beagle
hp-tuning

Those repos are not affected due to bot user's lack of permission (include our most active repos):

kubeflow
tf-operator
testing
example-seldon
experimental-kvc

k8s-ci-robot added the do-not-merge/work-in-progress label Mar 18, 2018

k8s-ci-robot requested review from gaocegege and jlewi March 18, 2018 13:04

k8s-ci-robot added the needs-ok-to-test label Mar 18, 2018

k8s-ci-robot added the size/L label Mar 18, 2018

carmark force-pushed the caffe2 branch 2 times, most recently from f1bc646 to 15708f5 Compare March 18, 2018 13:08

gaocegege reviewed Mar 18, 2018

View reviewed changes

jlewi reviewed Mar 18, 2018

View reviewed changes

[WIP] Caffe2 operator proposal

aafa861

carmark force-pushed the caffe2 branch from 15708f5 to aafa861 Compare March 19, 2018 05:12

k8s-ci-robot assigned jlewi Mar 20, 2018

k8s-ci-robot added lgtm approved labels Mar 20, 2018

k8s-ci-robot assigned gaocegege Mar 20, 2018

k8s-ci-robot removed the needs-ok-to-test label Mar 20, 2018

gaocegege mentioned this pull request Mar 20, 2018

[Repository Reuqest] Create kubeflow/caffe2-operator #43

Closed

jose5918 reviewed Mar 20, 2018

View reviewed changes

k8s-ci-robot merged commit a43fab6 into kubeflow:master Apr 5, 2018

This was referenced Apr 7, 2018

Sync common labels to every repository #24

Closed

Fix prow merge's incompatibility with new synced labels. kubeflow/testing#102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Caffe2 operator proposal #41

[WIP] Caffe2 operator proposal #41

carmark commented Mar 18, 2018 •

edited by jlewi

googlebot commented Mar 18, 2018

k8s-ci-robot commented Mar 18, 2018

googlebot commented Mar 18, 2018

gaocegege left a comment •

edited

gaocegege Mar 18, 2018

carmark Mar 19, 2018

jlewi left a comment

jlewi Mar 18, 2018

carmark Mar 19, 2018

gaocegege Mar 19, 2018

jlewi Mar 18, 2018

carmark Mar 19, 2018

jlewi Mar 18, 2018

carmark Mar 19, 2018

jlewi commented Mar 20, 2018

gaocegege commented Mar 20, 2018

k8s-ci-robot commented Mar 20, 2018

gaocegege commented Mar 20, 2018

carmark commented Mar 20, 2018

jose5918 Mar 20, 2018

carmark Mar 21, 2018

jose5918 Mar 20, 2018

carmark Mar 21, 2018

jose5918 commented Apr 6, 2018

carmark commented Apr 7, 2018

jlewi commented Apr 7, 2018

kunmingg commented Apr 7, 2018

kunmingg commented Apr 7, 2018

[WIP] Caffe2 operator proposal #41

[WIP] Caffe2 operator proposal #41

Conversation

carmark commented Mar 18, 2018 • edited by jlewi

googlebot commented Mar 18, 2018

k8s-ci-robot commented Mar 18, 2018

googlebot commented Mar 18, 2018

gaocegege left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Mar 20, 2018

gaocegege commented Mar 20, 2018

k8s-ci-robot commented Mar 20, 2018

gaocegege commented Mar 20, 2018

carmark commented Mar 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jose5918 commented Apr 6, 2018

carmark commented Apr 7, 2018

jlewi commented Apr 7, 2018

kunmingg commented Apr 7, 2018

kunmingg commented Apr 7, 2018

carmark commented Mar 18, 2018 •

edited by jlewi

gaocegege left a comment •

edited