Skip to content
Permalink
Browse files

Remove chainer docs (#1516)

* Update chainer.md

* Update chainer.md

* Update chainer.md

* Added brevity and links to Kubeflow 0.6 docs

* Added specific link to Chainer page

Co-authored-by: Sarah Maddox <sarahmaddox@users.noreply.github.com>
  • Loading branch information
2 people authored and k8s-ci-robot committed Jan 10, 2020
1 parent c7c3b01 commit 99c27f6d15abf9305e96ae9f9ba6fea490b0a075
Showing with 2 additions and 142 deletions.
  1. +2 −142 content/docs/components/training/chainer.md
@@ -1,148 +1,8 @@
+++
title = "Chainer Training"
description = "Instructions for using Chainer for training"
description = "See Kubeflow [v0.6 docs](https://v0-6.kubeflow.org/docs/components/training/chainer/) for instructions on using Chainer for training"
weight = 4
toc = true
+++

This guide walks you through using Chainer for training your model.

## What is Chainer?

[Chainer](https://chainer.org/) is a powerful, flexible and intuitive deep learning framework.

- Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.
- Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures.
- Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.

[ChainerMN](https://github.com/chainer/chainermn) is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features:

- Scalable --- it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI,
- Flexible --- even dynamic neural networks can be trained in parallel thanks to Chainer's flexibility, and
- Easy --- minimal changes to existing user code are required.

[This blog post](https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html) provides a benchmark results using up to 128 GPUs.

## Installing Chainer Operator

If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow.

An **alpha** version of [Chainer](https://chainer.org/) support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0.

## Verify that Chainer support is included in your Kubeflow deployment

_This section has not yet been converted to kustomize, please refer to [kubeflow/manifests/issues/232](https://github.com/kubeflow/manifests/issues/232)._

Check that the Chainer Job custom resource is installed

```shell
kubectl get crd
```

The output should include `chainerjobs.kubeflow.org`

```
NAME AGE
...
chainerjobs.kubeflow.org 4d
...
```

If it is not included you can add it as follows

```shells
cd ${KSONNET_APP}
ks pkg install kubeflow/chainer-job
ks generate chainer-operator chainer-operator
ks apply ${ENVIRONMENT} -c chainer-operator
```

## Creating a Chainer Job

You can create an Chainer Job by defining an ChainerJob config file. First, please create a file `example-job-mn.yaml` like below:

```yaml
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
name: example-job-mn
spec:
backend: mpi
master:
mpiConfig:
slots: 1
activeDeadlineSeconds: 6000
backoffLimit: 60
template:
spec:
containers:
- name: chainer
image: everpeace/chainermn:1.3.0
command:
- sh
- -c
- |
mpiexec -n 3 -N 1 --allow-run-as-root --display-map --mca mpi_cuda_support 0 \
python3 /train_mnist.py -e 2 -b 1000 -u 100
workerSets:
ws0:
replicas: 2
mpiConfig:
slots: 1
template:
spec:
containers:
- name: chainer
image: everpeace/chainermn:1.3.0
command:
- sh
- -c
- |
while true; do sleep 1 & wait; done
```

See [examples/chainerjob-reference.yaml](https://github.com/kubeflow/chainer-operator/blob/master/examples/chainerjob-reference.yaml) for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers).

Deploy the ChainerJob resource to start training:

```shell
kubectl create -f example-job-mn.yaml
```

You should now be able to see the created pods which consist of the chainer job.

```
kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn
```

The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress.

```
PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name)
kubectl logs -f ${PODNAME}
```

## Monitoring an Chainer Job

```shell
kubectl get -o yaml chainerjobs example-job-mn
```

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

```yaml
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
name: example-job-mn
...
status:
completionTime: 2018-09-01T16:42:35Z
conditions:
- lastProbeTime: 2018-09-01T16:42:35Z
lastTransitionTime: 2018-09-01T16:42:35Z
status: "True"
type: Complete
startTime: 2018-09-01T16:34:04Z
succeeded: 1
```
[Chainer](https://github.com/kubeflow/chainer-operator) is not supported in Kubeflow versions greater than 0.6.

0 comments on commit 99c27f6

Please sign in to comment.
You can’t perform that action at this time.