Skip to content

Commit 99c27f6

Browse files
jspenc72sarahmaddox
authored andcommitted
Remove chainer docs (#1516)
* Update chainer.md * Update chainer.md * Update chainer.md * Added brevity and links to Kubeflow 0.6 docs * Added specific link to Chainer page Co-authored-by: Sarah Maddox <sarahmaddox@users.noreply.github.com>
1 parent c7c3b01 commit 99c27f6

1 file changed

Lines changed: 2 additions & 142 deletions

File tree

Lines changed: 2 additions & 142 deletions
Original file line numberDiff line numberDiff line change
@@ -1,148 +1,8 @@
11
+++
22
title = "Chainer Training"
3-
description = "Instructions for using Chainer for training"
3+
description = "See Kubeflow [v0.6 docs](https://v0-6.kubeflow.org/docs/components/training/chainer/) for instructions on using Chainer for training"
44
weight = 4
55
toc = true
66
+++
77

8-
This guide walks you through using Chainer for training your model.
9-
10-
## What is Chainer?
11-
12-
[Chainer](https://chainer.org/) is a powerful, flexible and intuitive deep learning framework.
13-
14-
- Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.
15-
- Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures.
16-
- Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.
17-
18-
[ChainerMN](https://github.com/chainer/chainermn) is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features:
19-
20-
- Scalable --- it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI,
21-
- Flexible --- even dynamic neural networks can be trained in parallel thanks to Chainer's flexibility, and
22-
- Easy --- minimal changes to existing user code are required.
23-
24-
[This blog post](https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html) provides a benchmark results using up to 128 GPUs.
25-
26-
## Installing Chainer Operator
27-
28-
If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow.
29-
30-
An **alpha** version of [Chainer](https://chainer.org/) support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0.
31-
32-
## Verify that Chainer support is included in your Kubeflow deployment
33-
34-
_This section has not yet been converted to kustomize, please refer to [kubeflow/manifests/issues/232](https://github.com/kubeflow/manifests/issues/232)._
35-
36-
Check that the Chainer Job custom resource is installed
37-
38-
```shell
39-
kubectl get crd
40-
```
41-
42-
The output should include `chainerjobs.kubeflow.org`
43-
44-
```
45-
NAME AGE
46-
...
47-
chainerjobs.kubeflow.org 4d
48-
...
49-
```
50-
51-
If it is not included you can add it as follows
52-
53-
```shells
54-
cd ${KSONNET_APP}
55-
ks pkg install kubeflow/chainer-job
56-
ks generate chainer-operator chainer-operator
57-
ks apply ${ENVIRONMENT} -c chainer-operator
58-
```
59-
60-
## Creating a Chainer Job
61-
62-
You can create an Chainer Job by defining an ChainerJob config file. First, please create a file `example-job-mn.yaml` like below:
63-
64-
```yaml
65-
apiVersion: kubeflow.org/v1alpha1
66-
kind: ChainerJob
67-
metadata:
68-
name: example-job-mn
69-
spec:
70-
backend: mpi
71-
master:
72-
mpiConfig:
73-
slots: 1
74-
activeDeadlineSeconds: 6000
75-
backoffLimit: 60
76-
template:
77-
spec:
78-
containers:
79-
- name: chainer
80-
image: everpeace/chainermn:1.3.0
81-
command:
82-
- sh
83-
- -c
84-
- |
85-
mpiexec -n 3 -N 1 --allow-run-as-root --display-map --mca mpi_cuda_support 0 \
86-
python3 /train_mnist.py -e 2 -b 1000 -u 100
87-
workerSets:
88-
ws0:
89-
replicas: 2
90-
mpiConfig:
91-
slots: 1
92-
template:
93-
spec:
94-
containers:
95-
- name: chainer
96-
image: everpeace/chainermn:1.3.0
97-
command:
98-
- sh
99-
- -c
100-
- |
101-
while true; do sleep 1 & wait; done
102-
```
103-
104-
See [examples/chainerjob-reference.yaml](https://github.com/kubeflow/chainer-operator/blob/master/examples/chainerjob-reference.yaml) for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers).
105-
106-
Deploy the ChainerJob resource to start training:
107-
108-
```shell
109-
kubectl create -f example-job-mn.yaml
110-
```
111-
112-
You should now be able to see the created pods which consist of the chainer job.
113-
114-
```
115-
kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn
116-
```
117-
118-
The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress.
119-
120-
```
121-
PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name)
122-
kubectl logs -f ${PODNAME}
123-
```
124-
125-
## Monitoring an Chainer Job
126-
127-
```shell
128-
kubectl get -o yaml chainerjobs example-job-mn
129-
```
130-
131-
See the status section to monitor the job status. Here is sample output when the job is successfully completed.
132-
133-
```yaml
134-
apiVersion: kubeflow.org/v1alpha1
135-
kind: ChainerJob
136-
metadata:
137-
name: example-job-mn
138-
...
139-
status:
140-
completionTime: 2018-09-01T16:42:35Z
141-
conditions:
142-
- lastProbeTime: 2018-09-01T16:42:35Z
143-
lastTransitionTime: 2018-09-01T16:42:35Z
144-
status: "True"
145-
type: Complete
146-
startTime: 2018-09-01T16:34:04Z
147-
succeeded: 1
148-
```
8+
[Chainer](https://github.com/kubeflow/chainer-operator) is not supported in Kubeflow versions greater than 0.6.

0 commit comments

Comments
 (0)