Skip to content

Commit

Permalink
Merge branch 'master' into e2e_test
Browse files Browse the repository at this point in the history
  • Loading branch information
jlewi committed Oct 17, 2017
2 parents 014dcc7 + bfd4de5 commit b21bc3a
Show file tree
Hide file tree
Showing 5 changed files with 112 additions and 31 deletions.
58 changes: 32 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,31 +12,31 @@ Custom Resources require Kubernetes 1.7

## Motivation

Distributed TensorFlow training jobs require managing multiple sets of TensorFlow replicas.
Distributed TensorFlow training jobs require managing multiple sets of TensorFlow replicas.
Each set of replicas usually has a different role in the job. For example, one set acts
as parameter servers, another provides workers and another provides a controller.

K8s makes it easy to configure and deploy each set of TF replicas. Various tools like
[helm](https://github.com/kubernetes/helm) and [ksonnet](http://ksonnet.heptio.com/) can
be used to simplify generating the configs for a TF job.

However, in addition to generating the configs we need some custom control logic because
K8s built-in controllers (Jobs, ReplicaSets, StatefulSets, etc...) don't provide the semantics
needed for managing TF jobs.
To solve this we define a

To solve this we define a
[K8S Custom Resource](https://kubernetes.io/docs/concepts/api-extension/custom-resources/)
and [Operator](https://coreos.com/blog/introducing-operators.html) to manage a TensorFlow
job on K8s.


TfJob provides a K8s resource representing a single, distributed, TensorFlow job.
TfJob provides a K8s resource representing a single, distributed, TensorFlow job.
The Spec and Status (defined in [tf_job.go](https://github.com/jlewi/mlkube.io/blob/master/pkg/spec/tf_job.go))
are customized for TensorFlow. The spec allows specifying the Docker image and arguments to use for each TensorFlow
replica (i.e. master, worker, and parameter server). The status provides relevant information such as the number of
replicas in various states.

Using a TPR gives users the ability to create and manage TF Jobs just like builtin K8s resources. For example to
Using a CRD gives users the ability to create and manage TF Jobs just like builtin K8s resources. For example to
create a job

```
Expand All @@ -56,17 +56,17 @@ example-job TfJob.v1beta1.mlkube.io

The code is closely modeled on Coreos's [etcd-operator](https://github.com/coreos/etcd-operator).

The TfJob Spec(defined in [tf_job.go](https://github.com/jlewi/mlkube.io/blob/master/pkg/spec/tf_job.go))
reuses the existing Kubernetes structure PodTemplateSpec to describe TensorFlow processes.
We use PodTemplateSpec because we want to make it easy for users to
configure the processes; for example setting resource requirements or adding volumes.
The TfJob Spec(defined in [tf_job.go](https://github.com/jlewi/mlkube.io/blob/master/pkg/spec/tf_job.go))
reuses the existing Kubernetes structure PodTemplateSpec to describe TensorFlow processes.
We use PodTemplateSpec because we want to make it easy for users to
configure the processes; for example setting resource requirements or adding volumes.
We expect
helm or ksonnet could be used to add syntactic sugar to create more convenient APIs for users not familiar
with Kubernetes.

Leader election allows a K8s deployment resource to be used to upgrade the operator.

## Installing the TPR and operator on your k8s cluster
## Installing the CRD and operator on your k8s cluster

1. Clone the repository

Expand All @@ -76,15 +76,21 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera

1. Deploy the operator

For non-RBAC enabled clusters:
```
helm install tf-job-operator-chart -n tf-job --wait --replace
```

For RBAC-enabled clusters:
```
helm install tf-job-chart/ -n tf-job --wait --replace
helm install tf-job-operator-chart -n tf-job --wait --replace --set rbac.install=true
```

1. Make sure the operator is running

```
kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-job-operator-3083500267-wxj43 1/1 Running 0 48m
Expand All @@ -100,7 +106,7 @@ Leader election allows a K8s deployment resource to be used to upgrade the opera

## Using GPUs

The use of GPUs and K8s is still in flux. The following works with GKE & K8s 1.7.2. If this doesn't work on
The use of GPUs and K8s is still in flux. The following works with GKE & K8s 1.7.2. If this doesn't work on
your setup please consider opening an issue.

### Prerequisites
Expand Down Expand Up @@ -133,7 +139,7 @@ any container which uses this resource should have the volumes mentioned mounted
from the host.

The config is usually specified using a K8s ConfigMap and then passing the config into the controller via
the --controller_config_file.
the --controller_config_file.

The helm package for the controller includes a config map suitable for GKE. This ConfigMap may need to be modified
for your cluster if you aren't using GKE.
Expand Down Expand Up @@ -170,10 +176,10 @@ Here are the configuration options for TensorBoard:

| Name | Description | Required | Default |
|---|---|---|---|
| `logDir` | Specifies the directory where TensorBoard will look to find TensorFlow event files that it can display | Yes | `None` |
| `volumes` | `Volumes` information that will be passed to the TensorBoard `deployment` | No | [] |
| `volumeMounts` | `VolumeMounts` information that will be passed to the TensorBoard `deployment` | No | [] |
| `serviceType` | `ServiceType` information that will be passed to the TensorBoard `service`| No | `ClusterIP` |
| `logDir` | Specifies the directory where TensorBoard will look to find TensorFlow event files that it can display | Yes | `None` |
| `volumes` | `Volumes` information that will be passed to the TensorBoard `deployment` | No | [] |
| `volumeMounts` | `VolumeMounts` information that will be passed to the TensorBoard `deployment` | No | [] |
| `serviceType` | `ServiceType` information that will be passed to the TensorBoard `service`| No | `ClusterIP` |

For example:

Expand Down Expand Up @@ -208,20 +214,20 @@ spec:
volumeMounts:
- mountPath: /tmp/tensorflow
name: azurefile
```


## Run the example

A simplistic TF program is in the directory tf_sample.
A simplistic TF program is in the directory tf_sample.

1. Start the example

```
helm install --name=tf-job ./examples/tf_job
```

1. Check the job

```
Expand All @@ -244,7 +250,7 @@ replica that produced the log entry. There are two issues here
* Usinge Python sitecustomize.py might facilitate injecting a custom log handler that outputs json entries.
* For parameter servers, we might want to just run the TensorFlow standard server and its not clear how we
would convert those logs to json.

1. Integrate with Kubernetes cluster level logging.

* We'd like the logs to integrate nicely with whatever cluster level logging users configure.
Expand All @@ -261,7 +267,7 @@ kubectl logs
So that users don't need to depend on cluster level logging just to see basic logs.

In the current implementation, pods aren't deleted until the TfJob is deleted. This allows standard out/error to be fetched
via kubectl. Unfortunately, this leaves PODs in the RUNNING state when the TfJob is marked as done which is confusing.
via kubectl. Unfortunately, this leaves PODs in the RUNNING state when the TfJob is marked as done which is confusing.

### Status information

Expand Down Expand Up @@ -315,7 +321,7 @@ go install github.com/jlewi/mlkube.io/cmd/tf_operator

Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.

We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with
We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with
a K8s cluster.

Set your environment
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,7 @@ data:
- name: nvidia-debug-tools # optional
mountPath: /usr/local/bin/nvidia
hostPath: /home/kubernetes/bin/nvidia/bin
---

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
Expand All @@ -28,6 +26,9 @@ spec:
labels:
name: tf-job-operator
spec:
{{- if .Values.rbac.install }}
serviceAccountName: tf-job-operator
{{- end }}
containers:
- name: tf-job-operator
image: {{ .Values.image }}
Expand Down
65 changes: 65 additions & 0 deletions tf-job-operator-chart/templates/rbac.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
{{ if .Values.rbac.install }}
apiVersion: rbac.authorization.k8s.io/{{ required "A valid .Values.rbac.apiVersion entry required!" .Values.rbac.apiVersion }}
kind: ClusterRole
metadata:
name: tf-job-operator
labels:
app: tf-job-operator
rules:
- apiGroups:
- mlkube.io
resources:
- tfjobs
verbs:
- "*"
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- "*"
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- "*"
- apiGroups:
- batch
resources:
- jobs
verbs:
- "*"
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- persistentvolumeclaims
- events
verbs:
- "*"
- apiGroups:
- apps
- extensions
resources:
- deployments
verbs:
- "*"
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/{{ required "A valid .Values.rbac.apiVersion entry required!" .Values.rbac.apiVersion }}
metadata:
name: tf-job-operator
labels:
app: tf-job-operator
subjects:
- kind: ServiceAccount
name: tf-job-operator
namespace: {{ .Release.Namespace }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tf-job-operator
{{ end }}
8 changes: 8 additions & 0 deletions tf-job-operator-chart/templates/service-account.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{{ if .Values.rbac.install }}
apiVersion: v1
kind: ServiceAccount
metadata:
name: tf-job-operator
labels:
app: tf-job-operator
{{ end }}
7 changes: 4 additions & 3 deletions tf-job-operator-chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
image: gcr.io/tf-on-k8s-dogfood/tf_operator:10b10fd
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff




## Install Default RBAC roles and bindings
rbac:
install: false
apiVersion: v1beta1

0 comments on commit b21bc3a

Please sign in to comment.