-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Caffe2 operator proposal #41
Conversation
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
Hi @carmark. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
f1bc646
to
15708f5
Compare
CLAs look good, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 Thanks for your awesome work! I can't help reviewing the PR although it WIP.
spec: | ||
backend: "redis" | ||
replicaSpecs: | ||
- replicas: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using map instead of array in v1alpha2 in tf-operator, maybe we could keep consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will consider do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This looks pretty good. Just a couple questions.
metadata: | ||
name: "example-job" | ||
spec: | ||
backend: "redis" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backend
Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch has a similar issue: kubeflow/pytorch-operator#7 (comment)
name: caffe2 | ||
restartPolicy: Never | ||
- replicas: 1 | ||
ReplicaType: HELPER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HELPER
replica type will be used to service finding for redis
backend, and will be useless for gloo
backend.
Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. | ||
|
||
## Goals | ||
A Kubeflow user should be able to run training using Caffe2 as easily as they can using Tensorflow. This proposal is centered around a Kubernetes operator for Caffe2. A user should be able to run both single node and distributed training jobs with Caffe2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide some motivation for creating a custom operator? Why are the built in operators unsuitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review, and add the follow motivation:
For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/MPI to find the other nodes to communicate.
Thanks. /lgtm |
/ok-to-test |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege, jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@carmark Is the PR still WIP? |
@gaocegege Yeah, I would like to add a diagram and make the design more clear. |
hostNetwork: true | ||
containers: | ||
- image: carmark/caffe2:latest | ||
securityContext: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the does the process need increased privilege?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI needs this, but I am not sure for Redis backend.
ReplicaType: WORKER | ||
template: | ||
spec: | ||
hostNetwork: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning for using hostNetwork?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case, we use IB to communicate, so the hostNetwork
is necessary.
@jose5918 yeah, still working on this. |
Tide shouldn't have merged it because it has "WIP" in the title. The tide status even reports that. I think what happened is @kunmingg sync'd are labels recently and I think that may have removed the label do-not-merge/work-in-progress which caused it to get merged. |
@jlewi |
@jlewi
Those repos are not affected due to bot user's lack of permission (include our most active repos):
|
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)