Some error about communication #43

jglicat · 2020-02-12T14:00:09Z

When I use 8GPUs to train imagenet, using VGG16, some error occured on pipedream/runtime/connunication.py, line 235 "assert forward_num_iterations % self.num_ranks_in_next_stage == 0".
The "forward_num_iterations" is 10009 and "num_ranks_in_next_stage" is 2, where forward_num_iterations="total number of training set"/"batch_size"/"number of stage".
When i annotation this line and line 242"assert backward_num_iterations % self.num_ranks_in_previous_stage == 0", the program can run successfully in one epoch, then some error occured:
Traceback (most recent call last): File "main_with_runtime.py", line 617, in <module> main() File "main_with_runtime.py", line 321, in main train(train_loader, r, optimizer, epoch) File "main_with_runtime.py", line 442, in train r.run_backward() File "../runtime.py", line 650, in run_backward for output_name in outputs])) File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 479, in distributed_data_parallel_hook self._sync_reduction_works() File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 501, in _sync_reduction_works self.buckets_coalesced[bucket_idx]) RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete
How should I do to sovle this problem, any answer wil be useful. Thank you very much!

The text was updated successfully, but these errors were encountered:

HumanHighlight-IC · 2020-10-30T01:32:37Z

When I use 8GPUs to train imagenet, using VGG16, some error occured on pipedream/runtime/connunication.py, line 235 "assert forward_num_iterations % self.num_ranks_in_next_stage == 0".
The "forward_num_iterations" is 10009 and "num_ranks_in_next_stage" is 2, where forward_num_iterations="total number of training set"/"batch_size"/"number of stage".
When i annotation this line and line 242"assert backward_num_iterations % self.num_ranks_in_previous_stage == 0", the program can run successfully in one epoch, then some error occured:
Traceback (most recent call last): File "main_with_runtime.py", line 617, in <module> main() File "main_with_runtime.py", line 321, in main train(train_loader, r, optimizer, epoch) File "main_with_runtime.py", line 442, in train r.run_backward() File "../runtime.py", line 650, in run_backward for output_name in outputs])) File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 479, in distributed_data_parallel_hook self._sync_reduction_works() File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 501, in _sync_reduction_works self.buckets_coalesced[bucket_idx]) RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete
How should I do to sovle this problem, any answer wil be useful. Thank you very much!

I have the same issue, have you solved this problem?
Thank you.

Q1Shane · 2021-04-21T12:47:49Z

I have the same issue.
Have you solved this problem?

Hyaloid · 2024-01-02T01:14:24Z

I have the same issue, too.
Have you solved this problem?

jglicat · 2024-01-02T13:05:34Z

I have the same issue, have you solved this problem?

Sorry, I don't remember it. 😶

jglicat closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some error about communication #43

Some error about communication #43

jglicat commented Feb 12, 2020

HumanHighlight-IC commented Oct 30, 2020

Q1Shane commented Apr 21, 2021

Hyaloid commented Jan 2, 2024

jglicat commented Jan 2, 2024

Some error about communication #43

Some error about communication #43

Comments

jglicat commented Feb 12, 2020

HumanHighlight-IC commented Oct 30, 2020

Q1Shane commented Apr 21, 2021

Hyaloid commented Jan 2, 2024

jglicat commented Jan 2, 2024