You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.
Is there any problem if training distributed? I have tested resnet-50 with 1 node and 10 nodes, and I observed that 10 nodes will cause non convergence, yet 1 node case is normal. It is some logs below for 10 nodes case.
Mon Jan 909:02:162017[1,9]<stderr>:I0109 09:02:16.03747225409 solver.cpp:288] [9] Iteration 9000, loss = 0.0017091
Mon Jan 909:02:162017[1,9]<stderr>:I0109 09:02:16.03756525409 solver.cpp:309] Train net output #0: loss = 0.00170914 (* 1 = 0.00170914 loss)
Mon Jan 909:02:162017[1,3]<stderr>:I0109 09:02:16.31116120303 solver.cpp:288] [3] Iteration 9000, loss = 0.00197356
Mon Jan 909:02:162017[1,3]<stderr>:I0109 09:02:16.31124920303 solver.cpp:309] Train net output #0: loss = 0.00197353 (* 1 = 0.00197353 loss)
Mon Jan 909:02:192017[1,4]<stderr>:I0109 09:02:19.01628823385 solver.cpp:288] [4] Iteration 9000, loss = 0.00176254
Mon Jan 909:02:192017[1,4]<stderr>:I0109 09:02:19.01639123385 solver.cpp:309] Train net output #0: loss = 0.00176255 (* 1 = 0.00176255 loss)
Mon Jan 909:02:192017[1,1]<stderr>:I0109 09:02:19.0809479519 solver.cpp:288] [1] Iteration 9000, loss = 0.00233511
Mon Jan 909:02:192017[1,1]<stderr>:I0109 09:02:19.0810489519 solver.cpp:309] Train net output #0: loss = 0.00233503 (* 1 = 0.00233503 loss)
Mon Jan 909:02:192017[1,8]<stderr>:I0109 09:02:19.32359017418 solver.cpp:288] [8] Iteration 9000, loss = 0.00130848
Mon Jan 909:02:192017[1,8]<stderr>:I0109 09:02:19.32368917418 solver.cpp:309] Train net output #0: loss = 0.00130845 (* 1 = 0.00130845 loss)
Mon Jan 909:02:192017[1,7]<stderr>:I0109 09:02:19.49992229108 solver.cpp:288] [7] Iteration 9000, loss = 0.00123265
Mon Jan 909:02:192017[1,7]<stderr>:I0109 09:02:19.50001629108 solver.cpp:309] Train net output #0: loss = 0.00123263 (* 1 = 0.00123263 loss)
Mon Jan 909:02:192017[1,2]<stderr>:I0109 09:02:19.7221644260 solver.cpp:288] [2] Iteration 9000, loss = 0.00176037
Mon Jan 909:02:192017[1,2]<stderr>:I0109 09:02:19.7223164260 solver.cpp:309] Train net output #0: loss = 0.00176033 (* 1 = 0.00176033 loss)
Mon Jan 909:10:372017[1,0]<stderr>:I0109 09:10:37.80346719982 solver.cpp:479] Test net output #0: top-1 = 0.016
Mon Jan 909:10:372017[1,0]<stderr>:I0109 09:10:37.80368919982 solver.cpp:479] Test net output #1: top-5 = 0.0832
Mon Jan 909:10:372017[1,0]<stderr>:I0109 09:10:37.80378819982 solver.cpp:529] Snapshotting to binary proto file ./output/resnet50_step1_iter_9000.caffemodel
Mon Jan 909:10:382017[1,0]<stderr>:I0109 09:10:38.50474219982 sgd_solver.cpp:344] Snapshotting solver state to binary proto file ./output/resnet50_step1_iter_9000.solverstate```
The text was updated successfully, but these errors were encountered:
Hi,
What configuration in prototxt file did you used? Did you tried the same configuration on 1 and 10 nodes or did you modified learning rate or batchsize?
Is there any problem if training distributed? I have tested resnet-50 with 1 node and 10 nodes, and I observed that 10 nodes will cause non convergence, yet 1 node case is normal. It is some logs below for 10 nodes case.
The text was updated successfully, but these errors were encountered: