Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

Commit

Permalink
[stable/horovod] Add chart for horovod (#5415)
Browse files Browse the repository at this point in the history
* [incubator/horovod] support horovod

* fix lint check

* update the readme

* Separate files per resource

* add use secrets

* add env for using secrets

* move to stable

* change to offical docker image
  • Loading branch information
cheyang authored and k8s-ci-robot committed May 18, 2018
1 parent 6489136 commit dd642bf
Show file tree
Hide file tree
Showing 12 changed files with 623 additions and 0 deletions.
21 changes: 21 additions & 0 deletions stable/horovod/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*~
# Various IDEs
.project
.idea/
*.tmproj
12 changes: 12 additions & 0 deletions stable/horovod/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: v1
description: A Helm chart for deploying Horovod
name: horovod
version: "0.1.1"
appVersion: "0.12.1"
sources:
- https://github.com/uber/horovod
- https://github.com/uber/horovod/blob/master/docs/docker.md
home: https://eng.uber.com/horovod/
maintainers:
- name: cheyang
email: cheyang@163.com
111 changes: 111 additions & 0 deletions stable/horovod/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Horovod

[Horovod](https://eng.uber.com/horovod/) is a distributed training framework for TensorFlow, and it's provided by UBER. The goal of Horovod is to make distributed Deep Learning fast and easy to use. And it provides [Horovod in Docker](https://github.com/uber/horovod/blob/master/docs/docker.md) to streamline the installation process.

## Introduction

This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package manager. It deploys Horovod workers as statefulsets, and the Horovod master as a job, then discover the the host list automatically.ß

## Prerequisites

- Kubernetes cluster v1.8+

## Build Docker Image

You can download [offical Horovod Dockerfile](https://github.com/uber/horovod/blob/master/Dockerfile), then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version.

```
# mkdir horovod-docker
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
# docker build -t horovod:latest horovod-docker
```

## Define the values.yaml

To deploy Horovod with GPU, you can create `values.yaml` like

```
worker:
number: 3
podManagementPolicy: Parallel
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
pullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
master:
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
pullPolicy: IfNotPresent
args:
- "mpiexec -n ${WORKERS} --hostfile /kubeflow/openmpi/assets/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
```

To deploy Horovod without GPU, you can create `values.yaml` like

```
worker:
number: 3
podManagementPolicy: Parallel
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
pullPolicy: IfNotPresent
master:
image:
repository: uber/horovod
tag: 0.12.1-tf1.8.0-py3.5
pullPolicy: IfNotPresent
args:
- "mpiexec -n 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs python /examples/tensorflow_mnist.py'"
```



## Installing the Chart

To install the chart with the release name `mnist`:

```bash
$ helm install --values values.yaml --name mnist incubator/horovod
```

## Uninstalling the Chart

To uninstall/delete the `mnist` deployment:

```bash
$ helm delete mnist
```

The command removes all the Kubernetes components associated with the chart and
deletes the release.

## Configuration

The following tables lists the configurable parameters of the Horovod
chart and their default values.

| Parameter | Description | Default |
|-----------|-------------|---------|
| `ssh.port` | The ssh port | `22` |
| `ssh.useSecrets` | Determine if using the secrets for ssh | `false` |
| `worker.number`| The worker's number | `5` |
| `worker.image.repository` | horovod worker image | `uber/horovod` |
| `worker.image.pullPolicy` | `pullPolicy` for the worker | `IfNotPresent` |
| `worker.image.tag` | `tag` for the worker | `0.12.1-tf1.8.0-py3.5` |
| `worker.resources`| worker's pod resource requests & limits| `{}`|
| `worker.env` | worker's environment varaibles | `{}` |
| `master.image.repository` | horovod master image | `uber/horovod` |
| `master.image.tag` | `tag` for the master | `0.12.1-tf1.8.0-py3.5` |
| `master.image.pullPolicy` | image pullPolicy for the master image| `IfNotPresent` |
| `master.args` | master's args | `{}` |
| `master.resources`| master's pod resource requests & limits| `{}`|
| `master.env` | master's environment varaibles | `{}` |
5 changes: 5 additions & 0 deletions stable/horovod/templates/NOTES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
1. Get the application URL by running these commands:

*** NOTE: It may take a few minutes for the statefulset to be avaialble

*** you can watch the status of statefulset by running 'kubectl get sts --namespace {{ .Release.Namespace }} -w {{ template "horovod.fullname" . }}' ***
32 changes: 32 additions & 0 deletions stable/horovod/templates/_helpers.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{{/* vim: set filetype=mustache: */}}
{{/*
Expand the name of the chart.
*/}}
{{- define "horovod.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
{{- end -}}

{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "horovod.fullname" -}}
{{- if .Values.fullnameOverride -}}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "horovod.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}}
{{- end -}}
115 changes: 115 additions & 0 deletions stable/horovod/templates/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
{{- $workerNum := .Values.worker.number -}}
{{- $name := include "horovod.fullname" . }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ template "horovod.fullname" . }}
labels:
heritage: {{ .Release.Service | quote }}
release: {{ .Release.Name | quote }}
chart: {{ template "horovod.chart" . }}
app: {{ template "horovod.fullname" . }}
data:
hostfile.config: |
{{ $name }}-master
{{- range $i, $none := until (int $workerNum) }}
{{ $name }}-{{ $i }}.{{ $name }}
{{- end }}
ssh.readiness: |
#!/bin/bash
set -xev
ssh localhost ls
master.run: |
#!/bin/bash
set -x
sleep 5
mkdir -p /root/.ssh
rm -f /root/.ssh/config
touch /root/.ssh/config
if [ "$USESECRETS" == "true" ];then
ln -s /etc/secret-volume/id_rsa /root/.ssh/id_rsa
ln -s /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
fi
if [ -n "$SSHPORT" ]; then
echo "Port $SSHPORT" > /root/.ssh/config
sed -ri "s/#Port 22/Port ${SSH_PORT}/g" /etc/ssh/sshd_config
fi
echo "StrictHostKeyChecking no" >> /root/.ssh/config
/usr/sbin/sshd
if [ $# -eq 0 ]; then
sleep infinity
else
bash -c "$*"
fi
master.waitWorkerReady: |
#!/bin/bash
set -xev
function updateSSHPort() {
if [ -n "$SSH_PORT" ]; then
sed -i "s/^Port.*/Port $SSH_PORT /g" /root/.ssh/config
echo "StrictHostKeyChecking no" >> /root/.ssh/config
fi
}
function runCheckSSH() {
if [ "$USESECRETS" == "true" ];then
ln -s /etc/secret-volume/id_rsa /root/.ssh/id_rsa
ln -s /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
fi
for i in `cat $1`;do
if [[ "$i" != *"master" ]];then
retry 30 ssh -o ConnectTimeout=2 -q $i exit
fi
done
}
function retry()
{
local n=0;local try=$1
local cmd="${@: 2}"
[[ $# -le 1 ]] && {
echo "Usage $0 <retry_number> <Command>";
}
set +e
until [[ $n -ge $try ]]
do
$cmd && break || {
echo "Command Fail.."
((n++))
echo "retry $n :: [$cmd]"
sleep 1;
}
done
$cmd
if [ $? -ne 0 ]; then
exit 1
fi
set -e
}
updateSSHPort
runCheckSSH $1
worker.run: |
#! /bin/sh
set -x
mkdir -p /root/.ssh
rm -f /root/.ssh/config
touch /root/.ssh/config
if [ "$USESECRETS" == "true" ];then
ln -s /etc/secret-volume/id_rsa /root/.ssh/id_rsa
ln -s /etc/secret-volume/authorized_keys /root/.ssh/authorized_keys
fi
if [ -n "$SSHPORT" ]; then
echo "Port $SSHPORT" > /root/.ssh/config
sed -ri "s/#Port 22/Port ${SSH_PORT}/g" /etc/ssh/sshd_config
fi
echo "StrictHostKeyChecking no" >> /root/.ssh/config
/usr/sbin/sshd -D
19 changes: 19 additions & 0 deletions stable/horovod/templates/job-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: {{ template "horovod.fullname" . }}-master
labels:
app: {{ template "horovod.name" . }}
chart: {{ template "horovod.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
spec:
clusterIP: None
ports:
- name: ssh
port: {{ .Values.ssh.port }}
targetPort: {{ .Values.ssh.port }}
selector:
app: {{ template "horovod.name" . }}
release: {{ .Release.Name }}
role: master
Loading

0 comments on commit dd642bf

Please sign in to comment.