problem with convergence while distributed training #34

hjchen2 · 2017-01-09T02:20:01Z

Is there any problem if training distributed? I have tested resnet-50 with 1 node and 10 nodes, and I observed that 10 nodes will cause non convergence, yet 1 node case is normal. It is some logs below for 10 nodes case.

Mon Jan  9 09:02:16 2017[1,9]<stderr>:I0109 09:02:16.037472 25409 solver.cpp:288] [9] Iteration 9000, loss = 0.0017091
Mon Jan  9 09:02:16 2017[1,9]<stderr>:I0109 09:02:16.037565 25409 solver.cpp:309]     Train net output #0: loss = 0.00170914 (* 1 = 0.00170914 loss)
Mon Jan  9 09:02:16 2017[1,3]<stderr>:I0109 09:02:16.311161 20303 solver.cpp:288] [3] Iteration 9000, loss = 0.00197356
Mon Jan  9 09:02:16 2017[1,3]<stderr>:I0109 09:02:16.311249 20303 solver.cpp:309]     Train net output #0: loss = 0.00197353 (* 1 = 0.00197353 loss)
Mon Jan  9 09:02:19 2017[1,4]<stderr>:I0109 09:02:19.016288 23385 solver.cpp:288] [4] Iteration 9000, loss = 0.00176254
Mon Jan  9 09:02:19 2017[1,4]<stderr>:I0109 09:02:19.016391 23385 solver.cpp:309]     Train net output #0: loss = 0.00176255 (* 1 = 0.00176255 loss)
Mon Jan  9 09:02:19 2017[1,1]<stderr>:I0109 09:02:19.080947  9519 solver.cpp:288] [1] Iteration 9000, loss = 0.00233511
Mon Jan  9 09:02:19 2017[1,1]<stderr>:I0109 09:02:19.081048  9519 solver.cpp:309]     Train net output #0: loss = 0.00233503 (* 1 = 0.00233503 loss)
Mon Jan  9 09:02:19 2017[1,8]<stderr>:I0109 09:02:19.323590 17418 solver.cpp:288] [8] Iteration 9000, loss = 0.00130848
Mon Jan  9 09:02:19 2017[1,8]<stderr>:I0109 09:02:19.323689 17418 solver.cpp:309]     Train net output #0: loss = 0.00130845 (* 1 = 0.00130845 loss)
Mon Jan  9 09:02:19 2017[1,7]<stderr>:I0109 09:02:19.499922 29108 solver.cpp:288] [7] Iteration 9000, loss = 0.00123265
Mon Jan  9 09:02:19 2017[1,7]<stderr>:I0109 09:02:19.500016 29108 solver.cpp:309]     Train net output #0: loss = 0.00123263 (* 1 = 0.00123263 loss)
Mon Jan  9 09:02:19 2017[1,2]<stderr>:I0109 09:02:19.722164  4260 solver.cpp:288] [2] Iteration 9000, loss = 0.00176037
Mon Jan  9 09:02:19 2017[1,2]<stderr>:I0109 09:02:19.722316  4260 solver.cpp:309]     Train net output #0: loss = 0.00176033 (* 1 = 0.00176033 loss)
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803467 19982 solver.cpp:479]     Test net output #0: top-1 = 0.016
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803689 19982 solver.cpp:479]     Test net output #1: top-5 = 0.0832
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803788 19982 solver.cpp:529] Snapshotting to binary proto file ./output/resnet50_step1_iter_9000.caffemodel
Mon Jan  9 09:10:38 2017[1,0]<stderr>:I0109 09:10:38.504742 19982 sgd_solver.cpp:344] Snapshotting solver state to binary proto file ./output/resnet50_step1_iter_9000.solverstate```

pnoga · 2017-01-09T12:22:06Z

Hi,
What configuration in prototxt file did you used? Did you tried the same configuration on 1 and 10 nodes or did you modified learning rate or batchsize?

hjchen2 · 2017-01-10T01:49:19Z

@pnoga
Hi, thank you for your reply. I didn't modified learning rate and batchsize. Every thing is same except the number of node.

hjchen2 · 2017-01-23T03:17:42Z

@pnoga now It can be converged since I reduced the learning rate of 10 times while distributed training on 10 nodes. Thanks.

hjchen2 closed this as completed Jan 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem with convergence while distributed training #34

problem with convergence while distributed training #34

hjchen2 commented Jan 9, 2017

pnoga commented Jan 9, 2017

hjchen2 commented Jan 10, 2017

hjchen2 commented Jan 23, 2017

problem with convergence while distributed training #34

problem with convergence while distributed training #34

Comments

hjchen2 commented Jan 9, 2017

pnoga commented Jan 9, 2017

hjchen2 commented Jan 10, 2017

hjchen2 commented Jan 23, 2017