Skip to content
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.

Multinode Training with PyCaffe

Feng Zou edited this page Apr 16, 2018 · 4 revisions

Now Intel Caffe (release 1.1.1) supports multinode training via PyCaffe interface. To speed up training by multinode on Intel CPUs, you can simply inject 2 lines of code into python code. Here takes LeNet as an example. For single node, the python code is:

     ...
     solver = caffe.SGDSolver('examples/mnist/lenet_auto_solver.prototxt')
     ...
     for it in range(niter):
        solver.step(1)  # SGD by Caffe
        ...

To support multinode, you need to inject 2 lines of code to initialize MultiSync object as below:

     ...
     solver = caffe.SGDSolver('examples/mnist/lenet_auto_solver.prototxt')
     ***sync = caffe.MultiSync(solver)***
     ***sync.init()***
     ...
     for it in range(niter):
        solver.step(1)  # SGD by Caffe
        ...

And to achieve better performance, we recommend calling update_and_forward, clear_param_diffs and backward functions, instead of step function only, to overlap gradient synchronization and update with forward:

    ...
    # we need to call step once as test net used shared weights of train net
    solver.step(1)
    for it in range(niter):
        sync.solver.update_and_forward()
        sync.solver.clear_param_diffs()
        sync.solver.backward()
        solver.increment_iter()
        ...

Sample code for single node

    ...
    solver = caffe.SGDSolver('examples/mnist/lenet_auto_solver.prototxt')
    ...
    # we need to call step once as test net used shared weights of train net
    solver.step(1)
    for it in range(niter):
        solver.net.clear_param_diffs()
        solver.net.forward()
        solver.net.backward()
        solver.apply_update()
        solver.increment_iter()
        ...
Clone this wiki locally