Skip to content
Repository for chainer operator
Go Shell Dockerfile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
cmd/chainer-operator
deploy Initial code push (#11) Jun 14, 2018
examples Initial code push (#11) Jun 14, 2018
hack Initial code push (#11) Jun 14, 2018
pkg master is now at the last in hostfile (#20) Aug 7, 2018
scripts Publishing docker image in prow (in postsubmit) (#26) Sep 26, 2018
test/workflows add e2e test scripts in prow (#19) Aug 17, 2018
.dockerignore Initial code push (#11) Jun 14, 2018
.gitignore Initial code push (#11) Jun 14, 2018
.travis.yml Initial code push (#11) Jun 14, 2018
Dockerfile Initial code push (#11) Jun 14, 2018
Gopkg.lock Initial code push (#11) Jun 14, 2018
Gopkg.toml Initial code push (#11) Jun 14, 2018
LICENSE Initial code push (#11) Jun 14, 2018
OWNERS Add jlewi to OWNERS (#21) Aug 23, 2018
README.md Initial code push (#11) Jun 14, 2018
prow_config.yaml fix component name in postsubmit (#27) Sep 26, 2018

README.md

K8s Custom Resource and Operator For Chainer/ChainerMN jobs

Experimental repo notice: This repository is experimental and currently only serves as a proof of concept for running distributed training with Chainer/ChainerMN on Kubernetes.

Overview

ChainerJob provides a Kubernetes custom resource that makes it easy to run distributed or non-distributed Chainer jobs on Kubernetes.

Using a Custom Resource Definition (CRD) gives users the ability to create and manage Chainer Jobs just like builtin K8s resources. For example to create a job

$ kubectl create -f examples/chainerjob.yaml
chainerjob.kubeflow.org "example-job" created

To list chainer jobs:

$ kubectl get chainerjobs
NAME          AGE
example-job   12s

Installing the ChainerJob CRD and its operator on your k8s cluster

kubectl create -f deploy/

This will create:

  • ChainerJob Custom Resource Definition (CRD)
  • chainer-operator namespace
  • RBAC related resources
    • ServiceAccount
    • ClusterRole
      • please see 2-rbac.yaml for detailed authorized operations
    • ClusterRoleBinding
  • Deployment for the chainer-operator

Creating a ChainerJob

Once defining ChainerJob CRD and operator is up, you create a job by defining ChainerJob custom resource.

kubectl create -f examples/chainerjob-mn.yaml

In this case the job spec looks like this:

apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
  name: example-job-mn
spec:
  backend: mpi
  master:
    template:
      spec:
        containers:
        - name: chainer
          image: everpeace/chainermn:1.3.0
          command:
          - sh
          - -c
          - |
            mpiexec -n 3 -N 1 --allow-run-as-root --display-map  --mca mpi_cuda_support 0 \
            python3 /train_mnist.py -e 2 -b 1000 -u 100
  workerSets:
    ws0:
      replicas: 2
      template:
        spec:
          containers:
          - name: chainer
            image: everpeace/chainermn:1.3.0
            command:
            - sh
            - -c
            - |
              while true; do sleep 1 & wait; done

ChainerJob consists of Master/Workers.

master

  • A ChainerJob must have only one master
  • master is a pod (job technically) to boot your entire distributed job.
  • The pod must contain a container named chainer
  • master will be restarted automatically when it failed. You can customize retry behavior with activeDeadlineSeconds/backoffLimit. Please see examples/chainerjob-reference.yaml for details.

workerSets

  • WorkerSet is a concept of a group of pods (statefulset technically) having homogeneous configurations.
  • You can define multiple WorkerSets to create heterogenous workers.
  • A ChainerJob can have 0 or more WorkerSets.
  • WorkerSets can have 1 or more workers.
  • Workers are automatically restarted if they exit

backend

  • backend define the to initiate process groups and exchange tensor data among the processes.
  • Current supported backend is mpi.

backend: mpi

  • the operator automatically setup mpi environment among master and workerSets
  • hostfile and required configurations will be generated automatically
  • slots= clause in hostfile can be configurable. Please see examples/chainerjob-reference.yaml for details.
    • Default value is 1 or the number of GPUs you requested in a container named chainer.

Using GPUs

Kubernetes supports to schedule GPUs (instructions on GKE).

Once you get GPU equipped cluster, you can attach nvidia.com/gpu resource to your ChainerJob definition like this.

apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
  name: example-job-mn
spec:
  backend: mpi
  master:
    template:
      spec:
        containers:
        - name: chainer
          image: everpeace/chainermn:1.3.0
          resources:
            limits:
              nvidia.com/gpu: 1
      ...

Follow chainer's instruction for using in Chainer.

Monotring your Job

To get status of your ChainerJob

$ kubectl get chainerjobs $JOB_NAME -o yaml

apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
...
status:
  completionTime: 2018-06-13T02:13:47Z
  conditions:
  - lastProbeTime: 2018-06-13T02:13:47Z
    lastTransitionTime: 2018-06-13T02:13:47Z
    status: "True"
    type: Complete
  startTime: 2018-06-13T02:04:47Z
  succeeded: 1

You can also list all the pods belonging ChainerJob by using label chainerjob.kubeflow.org/name.

$ kubecl get all -l chainerjob.kubeflow.org/name=example-job-mn

NAME                                 READY     STATUS    RESTARTS   AGE
pod/example-job-mn-master-jm9qw      1/1       Running   0          1m
pod/example-job-mn-workerset-ws0-0   1/1       Running   0          1m
pod/example-job-mn-workerset-ws0-1   1/1       Running   0          1m

NAME                                            DESIRED   CURRENT   AGE
statefulset.apps/example-job-mn-workerset-ws0   2         2         1m

NAME                              DESIRED   SUCCESSFUL   AGE
job.batch/example-job-mn-master   1         0            1m

Access to the logs

Once you can get pod names which belongs to ChainerJob, you can inspect logs in standard ways.

$ kubectl logs example-job-mn-master-jm9qw
Data for JOB [41689,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: example-job-mn-master-8qvk2     Num slots: 1    Max slots: 0    Num procs: 1
        Process OMPI jobid: [41689,1] App: 0 Process rank: 0 Bound: UNBOUND

 Data for node: example-job-mn-workerset-ws0-0  Num slots: 1    Max slots: 0    Num procs: 1
        Process OMPI jobid: [41689,1] App: 0 Process rank: 1 Bound: UNBOUND

 Data for node: example-job-mn-workerset-ws0-1  Num slots: 1    Max slots: 0    Num procs: 1
        Process OMPI jobid: [41689,1] App: 0 Process rank: 2 Bound: UNBOUND

 =============================================================
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
Warning: using naive communicator because only naive supports CPU-only execution
==========================================
Num process (COMM_WORLD): 3
Using hierarchical communicator
Num unit: 100
Num Minibatch-size: 1000
Num epoch: 2
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           1.68413     0.87129               0.5325         0.807938                  10.3654
2           0.58754     0.403208              0.8483         0.884564                  16.4705
...
You can’t perform that action at this time.