|
1 | 1 | +++ |
2 | 2 | title = "Chainer Training" |
3 | | -description = "Instructions for using Chainer for training" |
| 3 | +description = "See Kubeflow [v0.6 docs](https://v0-6.kubeflow.org/docs/components/training/chainer/) for instructions on using Chainer for training" |
4 | 4 | weight = 4 |
5 | 5 | toc = true |
6 | 6 | +++ |
7 | 7 |
|
8 | | -This guide walks you through using Chainer for training your model. |
9 | | - |
10 | | -## What is Chainer? |
11 | | - |
12 | | -[Chainer](https://chainer.org/) is a powerful, flexible and intuitive deep learning framework. |
13 | | - |
14 | | -- Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort. |
15 | | -- Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures. |
16 | | -- Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug. |
17 | | - |
18 | | -[ChainerMN](https://github.com/chainer/chainermn) is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features: |
19 | | - |
20 | | -- Scalable --- it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI, |
21 | | -- Flexible --- even dynamic neural networks can be trained in parallel thanks to Chainer's flexibility, and |
22 | | -- Easy --- minimal changes to existing user code are required. |
23 | | - |
24 | | -[This blog post](https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html) provides a benchmark results using up to 128 GPUs. |
25 | | - |
26 | | -## Installing Chainer Operator |
27 | | - |
28 | | -If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow. |
29 | | - |
30 | | -An **alpha** version of [Chainer](https://chainer.org/) support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0. |
31 | | - |
32 | | -## Verify that Chainer support is included in your Kubeflow deployment |
33 | | - |
34 | | -_This section has not yet been converted to kustomize, please refer to [kubeflow/manifests/issues/232](https://github.com/kubeflow/manifests/issues/232)._ |
35 | | - |
36 | | -Check that the Chainer Job custom resource is installed |
37 | | - |
38 | | -```shell |
39 | | -kubectl get crd |
40 | | -``` |
41 | | - |
42 | | -The output should include `chainerjobs.kubeflow.org` |
43 | | - |
44 | | -``` |
45 | | -NAME AGE |
46 | | -... |
47 | | -chainerjobs.kubeflow.org 4d |
48 | | -... |
49 | | -``` |
50 | | - |
51 | | -If it is not included you can add it as follows |
52 | | - |
53 | | -```shells |
54 | | -cd ${KSONNET_APP} |
55 | | -ks pkg install kubeflow/chainer-job |
56 | | -ks generate chainer-operator chainer-operator |
57 | | -ks apply ${ENVIRONMENT} -c chainer-operator |
58 | | -``` |
59 | | - |
60 | | -## Creating a Chainer Job |
61 | | - |
62 | | -You can create an Chainer Job by defining an ChainerJob config file. First, please create a file `example-job-mn.yaml` like below: |
63 | | - |
64 | | -```yaml |
65 | | -apiVersion: kubeflow.org/v1alpha1 |
66 | | -kind: ChainerJob |
67 | | -metadata: |
68 | | - name: example-job-mn |
69 | | -spec: |
70 | | - backend: mpi |
71 | | - master: |
72 | | - mpiConfig: |
73 | | - slots: 1 |
74 | | - activeDeadlineSeconds: 6000 |
75 | | - backoffLimit: 60 |
76 | | - template: |
77 | | - spec: |
78 | | - containers: |
79 | | - - name: chainer |
80 | | - image: everpeace/chainermn:1.3.0 |
81 | | - command: |
82 | | - - sh |
83 | | - - -c |
84 | | - - | |
85 | | - mpiexec -n 3 -N 1 --allow-run-as-root --display-map --mca mpi_cuda_support 0 \ |
86 | | - python3 /train_mnist.py -e 2 -b 1000 -u 100 |
87 | | - workerSets: |
88 | | - ws0: |
89 | | - replicas: 2 |
90 | | - mpiConfig: |
91 | | - slots: 1 |
92 | | - template: |
93 | | - spec: |
94 | | - containers: |
95 | | - - name: chainer |
96 | | - image: everpeace/chainermn:1.3.0 |
97 | | - command: |
98 | | - - sh |
99 | | - - -c |
100 | | - - | |
101 | | - while true; do sleep 1 & wait; done |
102 | | -``` |
103 | | -
|
104 | | -See [examples/chainerjob-reference.yaml](https://github.com/kubeflow/chainer-operator/blob/master/examples/chainerjob-reference.yaml) for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers). |
105 | | -
|
106 | | -Deploy the ChainerJob resource to start training: |
107 | | -
|
108 | | -```shell |
109 | | -kubectl create -f example-job-mn.yaml |
110 | | -``` |
111 | | - |
112 | | -You should now be able to see the created pods which consist of the chainer job. |
113 | | - |
114 | | -``` |
115 | | -kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn |
116 | | -``` |
117 | | - |
118 | | -The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress. |
119 | | - |
120 | | -``` |
121 | | -PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name) |
122 | | -kubectl logs -f ${PODNAME} |
123 | | -``` |
124 | | - |
125 | | -## Monitoring an Chainer Job |
126 | | - |
127 | | -```shell |
128 | | -kubectl get -o yaml chainerjobs example-job-mn |
129 | | -``` |
130 | | - |
131 | | -See the status section to monitor the job status. Here is sample output when the job is successfully completed. |
132 | | - |
133 | | -```yaml |
134 | | -apiVersion: kubeflow.org/v1alpha1 |
135 | | -kind: ChainerJob |
136 | | -metadata: |
137 | | - name: example-job-mn |
138 | | -... |
139 | | -status: |
140 | | - completionTime: 2018-09-01T16:42:35Z |
141 | | - conditions: |
142 | | - - lastProbeTime: 2018-09-01T16:42:35Z |
143 | | - lastTransitionTime: 2018-09-01T16:42:35Z |
144 | | - status: "True" |
145 | | - type: Complete |
146 | | - startTime: 2018-09-01T16:34:04Z |
147 | | - succeeded: 1 |
148 | | -``` |
| 8 | +[Chainer](https://github.com/kubeflow/chainer-operator) is not supported in Kubeflow versions greater than 0.6. |
0 commit comments