This repository has been archived by the owner on Feb 22, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 16.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[stable/horovod] Add chart for horovod (#5415)
* [incubator/horovod] support horovod * fix lint check * update the readme * Separate files per resource * add use secrets * add env for using secrets * move to stable * change to offical docker image
- Loading branch information
1 parent
6489136
commit dd642bf
Showing
12 changed files
with
623 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Patterns to ignore when building packages. | ||
# This supports shell glob matching, relative path matching, and | ||
# negation (prefixed with !). Only one pattern per line. | ||
.DS_Store | ||
# Common VCS dirs | ||
.git/ | ||
.gitignore | ||
.bzr/ | ||
.bzrignore | ||
.hg/ | ||
.hgignore | ||
.svn/ | ||
# Common backup files | ||
*.swp | ||
*.bak | ||
*.tmp | ||
*~ | ||
# Various IDEs | ||
.project | ||
.idea/ | ||
*.tmproj |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
apiVersion: v1 | ||
description: A Helm chart for deploying Horovod | ||
name: horovod | ||
version: "0.1.1" | ||
appVersion: "0.12.1" | ||
sources: | ||
- https://github.com/uber/horovod | ||
- https://github.com/uber/horovod/blob/master/docs/docker.md | ||
home: https://eng.uber.com/horovod/ | ||
maintainers: | ||
- name: cheyang | ||
email: cheyang@163.com |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# Horovod | ||
|
||
[Horovod](https://eng.uber.com/horovod/) is a distributed training framework for TensorFlow, and it's provided by UBER. The goal of Horovod is to make distributed Deep Learning fast and easy to use. And it provides [Horovod in Docker](https://github.com/uber/horovod/blob/master/docs/docker.md) to streamline the installation process. | ||
|
||
## Introduction | ||
|
||
This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package manager. It deploys Horovod workers as statefulsets, and the Horovod master as a job, then discover the the host list automatically.ß | ||
|
||
## Prerequisites | ||
|
||
- Kubernetes cluster v1.8+ | ||
|
||
## Build Docker Image | ||
|
||
You can download [offical Horovod Dockerfile](https://github.com/uber/horovod/blob/master/Dockerfile), then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version. | ||
|
||
``` | ||
# mkdir horovod-docker | ||
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/uber/horovod/master/Dockerfile | ||
# docker build -t horovod:latest horovod-docker | ||
``` | ||
|
||
## Define the values.yaml | ||
|
||
To deploy Horovod with GPU, you can create `values.yaml` like | ||
|
||
``` | ||
worker: | ||
number: 3 | ||
podManagementPolicy: Parallel | ||
image: | ||
repository: uber/horovod | ||
tag: 0.12.1-tf1.8.0-py3.5 | ||
pullPolicy: IfNotPresent | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 | ||
requests: | ||
nvidia.com/gpu: 1 | ||
master: | ||
image: | ||
repository: uber/horovod | ||
tag: 0.12.1-tf1.8.0-py3.5 | ||
pullPolicy: IfNotPresent | ||
args: | ||
- "mpiexec -n ${WORKERS} --hostfile /kubeflow/openmpi/assets/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'" | ||
``` | ||
|
||
To deploy Horovod without GPU, you can create `values.yaml` like | ||
|
||
``` | ||
worker: | ||
number: 3 | ||
podManagementPolicy: Parallel | ||
image: | ||
repository: uber/horovod | ||
tag: 0.12.1-tf1.8.0-py3.5 | ||
pullPolicy: IfNotPresent | ||
master: | ||
image: | ||
repository: uber/horovod | ||
tag: 0.12.1-tf1.8.0-py3.5 | ||
pullPolicy: IfNotPresent | ||
args: | ||
- "mpiexec -n 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs python /examples/tensorflow_mnist.py'" | ||
``` | ||
|
||
|
||
|
||
## Installing the Chart | ||
|
||
To install the chart with the release name `mnist`: | ||
|
||
```bash | ||
$ helm install --values values.yaml --name mnist incubator/horovod | ||
``` | ||
|
||
## Uninstalling the Chart | ||
|
||
To uninstall/delete the `mnist` deployment: | ||
|
||
```bash | ||
$ helm delete mnist | ||
``` | ||
|
||
The command removes all the Kubernetes components associated with the chart and | ||
deletes the release. | ||
|
||
## Configuration | ||
|
||
The following tables lists the configurable parameters of the Horovod | ||
chart and their default values. | ||
|
||
| Parameter | Description | Default | | ||
|-----------|-------------|---------| | ||
| `ssh.port` | The ssh port | `22` | | ||
| `ssh.useSecrets` | Determine if using the secrets for ssh | `false` | | ||
| `worker.number`| The worker's number | `5` | | ||
| `worker.image.repository` | horovod worker image | `uber/horovod` | | ||
| `worker.image.pullPolicy` | `pullPolicy` for the worker | `IfNotPresent` | | ||
| `worker.image.tag` | `tag` for the worker | `0.12.1-tf1.8.0-py3.5` | | ||
| `worker.resources`| worker's pod resource requests & limits| `{}`| | ||
| `worker.env` | worker's environment varaibles | `{}` | | ||
| `master.image.repository` | horovod master image | `uber/horovod` | | ||
| `master.image.tag` | `tag` for the master | `0.12.1-tf1.8.0-py3.5` | | ||
| `master.image.pullPolicy` | image pullPolicy for the master image| `IfNotPresent` | | ||
| `master.args` | master's args | `{}` | | ||
| `master.resources`| master's pod resource requests & limits| `{}`| | ||
| `master.env` | master's environment varaibles | `{}` | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
1. Get the application URL by running these commands: | ||
|
||
*** NOTE: It may take a few minutes for the statefulset to be avaialble | ||
|
||
*** you can watch the status of statefulset by running 'kubectl get sts --namespace {{ .Release.Namespace }} -w {{ template "horovod.fullname" . }}' *** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{{/* vim: set filetype=mustache: */}} | ||
{{/* | ||
Expand the name of the chart. | ||
*/}} | ||
{{- define "horovod.name" -}} | ||
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} | ||
{{- end -}} | ||
|
||
{{/* | ||
Create a default fully qualified app name. | ||
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). | ||
If release name contains chart name it will be used as a full name. | ||
*/}} | ||
{{- define "horovod.fullname" -}} | ||
{{- if .Values.fullnameOverride -}} | ||
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}} | ||
{{- else -}} | ||
{{- $name := default .Chart.Name .Values.nameOverride -}} | ||
{{- if contains $name .Release.Name -}} | ||
{{- .Release.Name | trunc 63 | trimSuffix "-" -}} | ||
{{- else -}} | ||
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} | ||
{{- end -}} | ||
{{- end -}} | ||
{{- end -}} | ||
|
||
{{/* | ||
Create chart name and version as used by the chart label. | ||
*/}} | ||
{{- define "horovod.chart" -}} | ||
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}} | ||
{{- end -}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
{{- $workerNum := .Values.worker.number -}} | ||
{{- $name := include "horovod.fullname" . }} | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: {{ template "horovod.fullname" . }} | ||
labels: | ||
heritage: {{ .Release.Service | quote }} | ||
release: {{ .Release.Name | quote }} | ||
chart: {{ template "horovod.chart" . }} | ||
app: {{ template "horovod.fullname" . }} | ||
data: | ||
hostfile.config: | | ||
{{ $name }}-master | ||
{{- range $i, $none := until (int $workerNum) }} | ||
{{ $name }}-{{ $i }}.{{ $name }} | ||
{{- end }} | ||
ssh.readiness: | | ||
#!/bin/bash | ||
set -xev | ||
ssh localhost ls | ||
master.run: | | ||
#!/bin/bash | ||
set -x | ||
sleep 5 | ||
mkdir -p /root/.ssh | ||
rm -f /root/.ssh/config | ||
touch /root/.ssh/config | ||
if [ "$USESECRETS" == "true" ];then | ||
ln -s /etc/secret-volume/id_rsa /root/.ssh/id_rsa | ||
ln -s /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys | ||
fi | ||
if [ -n "$SSHPORT" ]; then | ||
echo "Port $SSHPORT" > /root/.ssh/config | ||
sed -ri "s/#Port 22/Port ${SSH_PORT}/g" /etc/ssh/sshd_config | ||
fi | ||
echo "StrictHostKeyChecking no" >> /root/.ssh/config | ||
/usr/sbin/sshd | ||
if [ $# -eq 0 ]; then | ||
sleep infinity | ||
else | ||
bash -c "$*" | ||
fi | ||
master.waitWorkerReady: | | ||
#!/bin/bash | ||
set -xev | ||
function updateSSHPort() { | ||
if [ -n "$SSH_PORT" ]; then | ||
sed -i "s/^Port.*/Port $SSH_PORT /g" /root/.ssh/config | ||
echo "StrictHostKeyChecking no" >> /root/.ssh/config | ||
fi | ||
} | ||
function runCheckSSH() { | ||
if [ "$USESECRETS" == "true" ];then | ||
ln -s /etc/secret-volume/id_rsa /root/.ssh/id_rsa | ||
ln -s /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys | ||
fi | ||
for i in `cat $1`;do | ||
if [[ "$i" != *"master" ]];then | ||
retry 30 ssh -o ConnectTimeout=2 -q $i exit | ||
fi | ||
done | ||
} | ||
function retry() | ||
{ | ||
local n=0;local try=$1 | ||
local cmd="${@: 2}" | ||
[[ $# -le 1 ]] && { | ||
echo "Usage $0 <retry_number> <Command>"; | ||
} | ||
set +e | ||
until [[ $n -ge $try ]] | ||
do | ||
$cmd && break || { | ||
echo "Command Fail.." | ||
((n++)) | ||
echo "retry $n :: [$cmd]" | ||
sleep 1; | ||
} | ||
done | ||
$cmd | ||
if [ $? -ne 0 ]; then | ||
exit 1 | ||
fi | ||
set -e | ||
} | ||
updateSSHPort | ||
runCheckSSH $1 | ||
worker.run: | | ||
#! /bin/sh | ||
set -x | ||
mkdir -p /root/.ssh | ||
rm -f /root/.ssh/config | ||
touch /root/.ssh/config | ||
if [ "$USESECRETS" == "true" ];then | ||
ln -s /etc/secret-volume/id_rsa /root/.ssh/id_rsa | ||
ln -s /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys | ||
fi | ||
if [ -n "$SSHPORT" ]; then | ||
echo "Port $SSHPORT" > /root/.ssh/config | ||
sed -ri "s/#Port 22/Port ${SSH_PORT}/g" /etc/ssh/sshd_config | ||
fi | ||
echo "StrictHostKeyChecking no" >> /root/.ssh/config | ||
/usr/sbin/sshd -D |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
metadata: | ||
name: {{ template "horovod.fullname" . }}-master | ||
labels: | ||
app: {{ template "horovod.name" . }} | ||
chart: {{ template "horovod.chart" . }} | ||
release: {{ .Release.Name }} | ||
heritage: {{ .Release.Service }} | ||
spec: | ||
clusterIP: None | ||
ports: | ||
- name: ssh | ||
port: {{ .Values.ssh.port }} | ||
targetPort: {{ .Values.ssh.port }} | ||
selector: | ||
app: {{ template "horovod.name" . }} | ||
release: {{ .Release.Name }} | ||
role: master |
Oops, something went wrong.