Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Remove chainer docs (#1516)
* Update chainer.md * Update chainer.md * Update chainer.md * Added brevity and links to Kubeflow 0.6 docs * Added specific link to Chainer page Co-authored-by: Sarah Maddox <sarahmaddox@users.noreply.github.com>
- Loading branch information
1 parent
c7c3b01
commit 99c27f6d15abf9305e96ae9f9ba6fea490b0a075
Showing
1 changed file
with
2 additions
and
142 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,148 +1,8 @@ | ||
+++ | ||
title = "Chainer Training" | ||
description = "Instructions for using Chainer for training" | ||
description = "See Kubeflow [v0.6 docs](https://v0-6.kubeflow.org/docs/components/training/chainer/) for instructions on using Chainer for training" | ||
weight = 4 | ||
toc = true | ||
+++ | ||
|
||
This guide walks you through using Chainer for training your model. | ||
|
||
## What is Chainer? | ||
|
||
[Chainer](https://chainer.org/) is a powerful, flexible and intuitive deep learning framework. | ||
|
||
- Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort. | ||
- Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures. | ||
- Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug. | ||
|
||
[ChainerMN](https://github.com/chainer/chainermn) is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features: | ||
|
||
- Scalable --- it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI, | ||
- Flexible --- even dynamic neural networks can be trained in parallel thanks to Chainer's flexibility, and | ||
- Easy --- minimal changes to existing user code are required. | ||
|
||
[This blog post](https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html) provides a benchmark results using up to 128 GPUs. | ||
|
||
## Installing Chainer Operator | ||
|
||
If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow. | ||
|
||
An **alpha** version of [Chainer](https://chainer.org/) support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0. | ||
|
||
## Verify that Chainer support is included in your Kubeflow deployment | ||
|
||
_This section has not yet been converted to kustomize, please refer to [kubeflow/manifests/issues/232](https://github.com/kubeflow/manifests/issues/232)._ | ||
|
||
Check that the Chainer Job custom resource is installed | ||
|
||
```shell | ||
kubectl get crd | ||
``` | ||
|
||
The output should include `chainerjobs.kubeflow.org` | ||
|
||
``` | ||
NAME AGE | ||
... | ||
chainerjobs.kubeflow.org 4d | ||
... | ||
``` | ||
|
||
If it is not included you can add it as follows | ||
|
||
```shells | ||
cd ${KSONNET_APP} | ||
ks pkg install kubeflow/chainer-job | ||
ks generate chainer-operator chainer-operator | ||
ks apply ${ENVIRONMENT} -c chainer-operator | ||
``` | ||
|
||
## Creating a Chainer Job | ||
|
||
You can create an Chainer Job by defining an ChainerJob config file. First, please create a file `example-job-mn.yaml` like below: | ||
|
||
```yaml | ||
apiVersion: kubeflow.org/v1alpha1 | ||
kind: ChainerJob | ||
metadata: | ||
name: example-job-mn | ||
spec: | ||
backend: mpi | ||
master: | ||
mpiConfig: | ||
slots: 1 | ||
activeDeadlineSeconds: 6000 | ||
backoffLimit: 60 | ||
template: | ||
spec: | ||
containers: | ||
- name: chainer | ||
image: everpeace/chainermn:1.3.0 | ||
command: | ||
- sh | ||
- -c | ||
- | | ||
mpiexec -n 3 -N 1 --allow-run-as-root --display-map --mca mpi_cuda_support 0 \ | ||
python3 /train_mnist.py -e 2 -b 1000 -u 100 | ||
workerSets: | ||
ws0: | ||
replicas: 2 | ||
mpiConfig: | ||
slots: 1 | ||
template: | ||
spec: | ||
containers: | ||
- name: chainer | ||
image: everpeace/chainermn:1.3.0 | ||
command: | ||
- sh | ||
- -c | ||
- | | ||
while true; do sleep 1 & wait; done | ||
``` | ||
|
||
See [examples/chainerjob-reference.yaml](https://github.com/kubeflow/chainer-operator/blob/master/examples/chainerjob-reference.yaml) for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers). | ||
|
||
Deploy the ChainerJob resource to start training: | ||
|
||
```shell | ||
kubectl create -f example-job-mn.yaml | ||
``` | ||
|
||
You should now be able to see the created pods which consist of the chainer job. | ||
|
||
``` | ||
kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn | ||
``` | ||
|
||
The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress. | ||
|
||
``` | ||
PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name) | ||
kubectl logs -f ${PODNAME} | ||
``` | ||
|
||
## Monitoring an Chainer Job | ||
|
||
```shell | ||
kubectl get -o yaml chainerjobs example-job-mn | ||
``` | ||
|
||
See the status section to monitor the job status. Here is sample output when the job is successfully completed. | ||
|
||
```yaml | ||
apiVersion: kubeflow.org/v1alpha1 | ||
kind: ChainerJob | ||
metadata: | ||
name: example-job-mn | ||
... | ||
status: | ||
completionTime: 2018-09-01T16:42:35Z | ||
conditions: | ||
- lastProbeTime: 2018-09-01T16:42:35Z | ||
lastTransitionTime: 2018-09-01T16:42:35Z | ||
status: "True" | ||
type: Complete | ||
startTime: 2018-09-01T16:34:04Z | ||
succeeded: 1 | ||
``` | ||
[Chainer](https://github.com/kubeflow/chainer-operator) is not supported in Kubeflow versions greater than 0.6. |