-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some error about communication #43
Comments
I have the same issue, have you solved this problem? |
I have the same issue. |
I have the same issue, too. |
Sorry, I don't remember it. 😶 |
When I use 8GPUs to train imagenet, using VGG16, some error occured on pipedream/runtime/connunication.py, line 235 "assert forward_num_iterations % self.num_ranks_in_next_stage == 0".
The "forward_num_iterations" is 10009 and "num_ranks_in_next_stage" is 2, where forward_num_iterations="total number of training set"/"batch_size"/"number of stage".
When i annotation this line and line 242"assert backward_num_iterations % self.num_ranks_in_previous_stage == 0", the program can run successfully in one epoch, then some error occured:
Traceback (most recent call last): File "main_with_runtime.py", line 617, in <module> main() File "main_with_runtime.py", line 321, in main train(train_loader, r, optimizer, epoch) File "main_with_runtime.py", line 442, in train r.run_backward() File "../runtime.py", line 650, in run_backward for output_name in outputs])) File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 479, in distributed_data_parallel_hook self._sync_reduction_works() File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 501, in _sync_reduction_works self.buckets_coalesced[bucket_idx]) RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete
How should I do to sovle this problem, any answer wil be useful. Thank you very much!
The text was updated successfully, but these errors were encountered: