Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Caffe2 operator proposal #41

Merged
merged 1 commit into from
Apr 5, 2018
Merged

Conversation

carmark
Copy link
Member

@carmark carmark commented Mar 18, 2018

This change is Reviewable

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@k8s-ci-robot
Copy link

Hi @carmark. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@googlebot
Copy link

CLAs look good, thanks!

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Thanks for your awesome work! I can't help reviewing the PR although it WIP.

spec:
backend: "redis"
replicaSpecs:
- replicas: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using map instead of array in v1alpha2 in tf-operator, maybe we could keep consistent here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will consider do that.

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This looks pretty good. Just a couple questions.

metadata:
name: "example-job"
spec:
backend: "redis"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the backend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch has a similar issue: kubeflow/pytorch-operator#7 (comment)

name: caffe2
restartPolicy: Never
- replicas: 1
ReplicaType: HELPER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HELPER replica type will be used to service finding for redis backend, and will be useless for gloo backend.

Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.

## Goals
A Kubeflow user should be able to run training using Caffe2 as easily as they can using Tensorflow. This proposal is centered around a Kubernetes operator for Caffe2. A user should be able to run both single node and distributed training jobs with Caffe2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some motivation for creating a custom operator? Why are the built in operators unsuitable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review, and add the follow motivation:

For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/MPI to find the other nodes to communicate.

@jlewi
Copy link
Contributor

jlewi commented Mar 20, 2018

Thanks.

/lgtm

@gaocegege
Copy link
Member

/ok-to-test
/hold cancel
/lgtm

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaocegege
Copy link
Member

@carmark Is the PR still WIP?

@carmark
Copy link
Member Author

carmark commented Mar 20, 2018

@gaocegege Yeah, I would like to add a diagram and make the design more clear.

hostNetwork: true
containers:
- image: carmark/caffe2:latest
securityContext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the does the process need increased privilege?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI needs this, but I am not sure for Redis backend.

ReplicaType: WORKER
template:
spec:
hostNetwork: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for using hostNetwork?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case, we use IB to communicate, so the hostNetwork is necessary.

@k8s-ci-robot k8s-ci-robot merged commit a43fab6 into kubeflow:master Apr 5, 2018
@jose5918
Copy link
Contributor

jose5918 commented Apr 6, 2018

@carmark Were you still working on this?
@jlewi Have we seen prow merge things that haven't passed status checks anywhere else?

@carmark
Copy link
Member Author

carmark commented Apr 7, 2018

@jose5918 yeah, still working on this.

@jlewi
Copy link
Contributor

jlewi commented Apr 7, 2018

Tide shouldn't have merged it because it has "WIP" in the title. The tide status even reports that.

I think what happened is @kunmingg sync'd are labels recently and I think that may have removed the label do-not-merge/work-in-progress which caused it to get merged.

@kunmingg
Copy link

kunmingg commented Apr 7, 2018

@jlewi
Yes, we had an "do-not-merge/work-in-progress" -> "status/in progress" sync action...
Shall I revert the merge?

@kunmingg
Copy link

kunmingg commented Apr 7, 2018

@jlewi
Potential affected repos (might merge PR incomplete):

reporting
examples
community
pytorch-operator
caffe2-operator
experimental-beagle
hp-tuning

Those repos are not affected due to bot user's lack of permission (include our most active repos):

kubeflow
tf-operator
testing
example-seldon
experimental-kvc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants