-
Notifications
You must be signed in to change notification settings - Fork 220
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Lei Xue
committed
Mar 18, 2018
1 parent
099c277
commit f1bc646
Showing
1 changed file
with
130 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
## Motivation | ||
Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow. | ||
|
||
## Goals | ||
A Kubeflow user should be able to run training using Caffe2 as easily as they can using Tensorflow. This proposal is centered around a Kubernetes operator for Caffe2. A user should be able to run both single node and distributed training jobs with Caffe2. | ||
|
||
This proposal defines the following: | ||
- A Caffe2 operator | ||
- A way to deploy the operator with kubectl | ||
- A single pod Caffe2 example | ||
- A distributed Caffe2 example | ||
- A distributed Caffe2 proposal with [batchd scheduler](https://github.com/kubernetes-incubator/kube-arbitrator) | ||
|
||
## Non-Goals | ||
For the scope of this proposal, we won't be addressing the method for serving the model. | ||
|
||
## API (CRD and resulting objects) | ||
|
||
### Custom Resource Definition | ||
The custom resource submitted to the Kubernetes API would look something like this: | ||
|
||
```yaml | ||
apiVersion: "kubeflow.org/v1alpha1" | ||
kind: "Caffe2Job" | ||
metadata: | ||
name: "example-job" | ||
spec: | ||
backend: "redis" | ||
replicaSpecs: | ||
- replicas: 1 | ||
ReplicaType: MASTER | ||
template: | ||
spec: | ||
hostNetwork: true | ||
containers: | ||
- image: carmark/caffe2:latest | ||
name: caffe2 | ||
securityContext: | ||
capabilities: | ||
add: ["ALL"] | ||
restartPolicy: Never | ||
- replicas: 2 | ||
ReplicaType: WORKER | ||
template: | ||
spec: | ||
hostNetwork: true | ||
containers: | ||
- image: carmark/caffe2:latest | ||
securityContext: | ||
capabilities: | ||
add: ["ALL"] | ||
name: caffe2 | ||
restartPolicy: Never | ||
- replicas: 1 | ||
ReplicaType: HELPER | ||
template: | ||
spec: | ||
containers: | ||
- image: redis:latest | ||
name: redis | ||
ports: | ||
- containerPort: 6379 | ||
restartPolicy: Never | ||
``` | ||
This Caffe2Job resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of `backend` options. | ||
|
||
`backend` Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found [here](https://caffe2.ai/docs/distributed-training.html). | ||
|
||
### Resulting Master | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: caffe2-master-${job_id} | ||
labels: | ||
app=caffe2-job-xx | ||
caffe2_job_name=example-job | ||
controller-uid=dc3669c6-29f1-11e8-9ccd-ac1f6b8040c6 | ||
job-name=example-job-master-20lm-1 | ||
job_type=MASTER | ||
kubeflow.org= | ||
runtime_id=20lm | ||
task_index=0 | ||
spec: | ||
containers: | ||
- image: carmark/caffe2:latest | ||
imagePullPolicy: IfNotPresent | ||
name: caffe2 | ||
restartPolicy: Never | ||
``` | ||
|
||
The labels variables provided are used when initializing a distributed process group with Caffe2. `task_index` is determined by adding the number of replicas in each 'MASTER' and 'WORKER' replicaSpecs. `job_type` is `MASTER` for the master. | ||
|
||
### Resulting Worker | ||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: caffe2-worker-${job_id} | ||
labels: | ||
app=caffe2-job-xx | ||
caffe2_job_name=example-job | ||
controller-uid=dc3669c6-29f1-11e8-9ccd-ac1f6b8040c6 | ||
job-name=example-job-worker-20lm-0 | ||
job_type=WORKER | ||
kubeflow.org= | ||
runtime_id=20lm | ||
task_index=0 | ||
spec: | ||
containers: | ||
- image: carmark/caffe2:latest | ||
imagePullPolicy: IfNotPresent | ||
name: caffe2 | ||
restartPolicy: Never | ||
``` | ||
|
||
The worker spec generates a pod. They will communicate to the master through the redis's service name. | ||
|
||
## Design | ||
This is an implementaion of the Caffe2 distributed design patterns, found [here](https://caffe2.ai/docs/SynchronousSGD.html), via the lense of TFJob found [here](https://github.com/kubeflow/tf-operator). | ||
|
||
Diagram pending | ||
|
||
## Other backends | ||
|
||
Form [here](https://caffe2.ai/docs/distributed-training.html), Caffe2 also support [Gloo](https://github.com/facebookincubator/gloo) which is another communications library for multi-machine training. For Gloo with MPI, we do not neet the redis to communicate, the master and workers will communicate by ssh. So it should better to define another sshd port to communicate in container, then start the works first and then master container. | ||
|
||
To finish this start process, we may invole the [batchd scheduler](https://github.com/kubernetes-incubator/kube-arbitrator) and use priority class to define the priority. |