Skip to content
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.

problem with convergence while distributed training #34

Closed
hjchen2 opened this issue Jan 9, 2017 · 3 comments
Closed

problem with convergence while distributed training #34

hjchen2 opened this issue Jan 9, 2017 · 3 comments

Comments

@hjchen2
Copy link

hjchen2 commented Jan 9, 2017

Is there any problem if training distributed? I have tested resnet-50 with 1 node and 10 nodes, and I observed that 10 nodes will cause non convergence, yet 1 node case is normal. It is some logs below for 10 nodes case.

Mon Jan  9 09:02:16 2017[1,9]<stderr>:I0109 09:02:16.037472 25409 solver.cpp:288] [9] Iteration 9000, loss = 0.0017091
Mon Jan  9 09:02:16 2017[1,9]<stderr>:I0109 09:02:16.037565 25409 solver.cpp:309]     Train net output #0: loss = 0.00170914 (* 1 = 0.00170914 loss)
Mon Jan  9 09:02:16 2017[1,3]<stderr>:I0109 09:02:16.311161 20303 solver.cpp:288] [3] Iteration 9000, loss = 0.00197356
Mon Jan  9 09:02:16 2017[1,3]<stderr>:I0109 09:02:16.311249 20303 solver.cpp:309]     Train net output #0: loss = 0.00197353 (* 1 = 0.00197353 loss)
Mon Jan  9 09:02:19 2017[1,4]<stderr>:I0109 09:02:19.016288 23385 solver.cpp:288] [4] Iteration 9000, loss = 0.00176254
Mon Jan  9 09:02:19 2017[1,4]<stderr>:I0109 09:02:19.016391 23385 solver.cpp:309]     Train net output #0: loss = 0.00176255 (* 1 = 0.00176255 loss)
Mon Jan  9 09:02:19 2017[1,1]<stderr>:I0109 09:02:19.080947  9519 solver.cpp:288] [1] Iteration 9000, loss = 0.00233511
Mon Jan  9 09:02:19 2017[1,1]<stderr>:I0109 09:02:19.081048  9519 solver.cpp:309]     Train net output #0: loss = 0.00233503 (* 1 = 0.00233503 loss)
Mon Jan  9 09:02:19 2017[1,8]<stderr>:I0109 09:02:19.323590 17418 solver.cpp:288] [8] Iteration 9000, loss = 0.00130848
Mon Jan  9 09:02:19 2017[1,8]<stderr>:I0109 09:02:19.323689 17418 solver.cpp:309]     Train net output #0: loss = 0.00130845 (* 1 = 0.00130845 loss)
Mon Jan  9 09:02:19 2017[1,7]<stderr>:I0109 09:02:19.499922 29108 solver.cpp:288] [7] Iteration 9000, loss = 0.00123265
Mon Jan  9 09:02:19 2017[1,7]<stderr>:I0109 09:02:19.500016 29108 solver.cpp:309]     Train net output #0: loss = 0.00123263 (* 1 = 0.00123263 loss)
Mon Jan  9 09:02:19 2017[1,2]<stderr>:I0109 09:02:19.722164  4260 solver.cpp:288] [2] Iteration 9000, loss = 0.00176037
Mon Jan  9 09:02:19 2017[1,2]<stderr>:I0109 09:02:19.722316  4260 solver.cpp:309]     Train net output #0: loss = 0.00176033 (* 1 = 0.00176033 loss)
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803467 19982 solver.cpp:479]     Test net output #0: top-1 = 0.016
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803689 19982 solver.cpp:479]     Test net output #1: top-5 = 0.0832
Mon Jan  9 09:10:37 2017[1,0]<stderr>:I0109 09:10:37.803788 19982 solver.cpp:529] Snapshotting to binary proto file ./output/resnet50_step1_iter_9000.caffemodel
Mon Jan  9 09:10:38 2017[1,0]<stderr>:I0109 09:10:38.504742 19982 sgd_solver.cpp:344] Snapshotting solver state to binary proto file ./output/resnet50_step1_iter_9000.solverstate```
@pnoga
Copy link
Contributor

pnoga commented Jan 9, 2017

Hi,
What configuration in prototxt file did you used? Did you tried the same configuration on 1 and 10 nodes or did you modified learning rate or batchsize?

@hjchen2
Copy link
Author

hjchen2 commented Jan 10, 2017

@pnoga
Hi, thank you for your reply. I didn't modified learning rate and batchsize. Every thing is same except the number of node.

@hjchen2
Copy link
Author

hjchen2 commented Jan 23, 2017

@pnoga now It can be converged since I reduced the learning rate of 10 times while distributed training on 10 nodes. Thanks.

@hjchen2 hjchen2 closed this as completed Jan 23, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants