-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaMemcpyAsync failed: invalid argument during training #404
Comments
I met the same issue when training and stops. |
@ppwwyyxx, @winwinJJiang, anything in |
No, nothing was printed in dmesg on the day the error happens. |
I'd recommend running the job with |
The job was run with NCCL_DEBUG=INFO. It only prints some normal stuff at the very beginning of the training and nothing afterwards:
|
I'm afraid I don't know how to reproduce it. Right now I've only seen it once. Today I saw an issue (tensorflow/tensorflow#21338) which basically says some unchecked cuda error in a buggy TensorFlow op may leak to other ops, making the other op appear to fail. I guess this is a possible explanation. |
I'm receiving this error also in PyTorch, but I'm unable to find a reproduction scenario, it occurs randomly some time after starting a new epoch. |
I have published a branch with debug code to narrow down the @andfoy, @ppwwyyxx, @abidmalikwaterloo, could you try running it and report if you observe any issues? The primary difference in debug branch is that it checks CUDA errors both before and after calls to |
The following messgae is interesting: In file included from /home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/../../lib/include/THC/THC.h:4:0,
error I installed Pytorch from the following the instruction on the site: I used |
Also got the same error:
|
@abidmalikwaterloo, could you try specifying |
@mrfox321, could you try running from debug branch, as described in #404 (comment), to help narrow down this issue? |
@alsrgv I tried as to build from scratch conda install pytorch torchvision -c pytorch I am getting the following message:(/home/amalik/PyTorchHorovod) [amalik@node04 PyTorchHorovod]$ pip install --user -v --no-cache-dir git+https://github.com/uber/horovod@debug_before_memcpy
error It is complaining about the MPI ?? |
@abidmalikwaterloo, do you have HOROVOD_MPICXX_SHOW set? It appears that it's set to |
@alsrgv It seems that I didn't get any error yet with this new setting. I also changed the virtual environment. Currently, I am testing it extensively with different runtime variables just to ensure that if the breaking has to do with the nondeterministic behavior. |
@alsrgv Finally |
@alsrgv I managed to replicate the error once again using your debugging build, here is the error traceback: Traceback (most recent call last):
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 193, in _run_module_as_main
Traceback (most recent call last):
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 85, in _run_code
"__main__", mod_spec)
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
exec(code, run_globals)
File "/media/SSD1/score-textseg/ref_score_net/train.py", line 495, in <module>
File "/media/SSD1/score-textseg/ref_score_net/train.py", line 495, in <module>
train_loss = train(epoch)
File "/media/SSD1/score-textseg/ref_score_net/train.py", line 363, in train
train_loss = train(epoch)
File "/media/SSD1/score-textseg/ref_score_net/train.py", line 363, in train
optimizer.step()
optimizer.step()
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 88, in step
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 88, in step
self.synchronize()
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 84, in synchronize
self.synchronize()
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 84, in synchronize
synchronize(handle)
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
synchronize(handle)
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 197, in safe_call
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 197, in safe_call
result = torch._C._safe_call(*args, **kwargs)
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: cudaMemcpyAsync1 failed: invalid argument
torch.FatalError: cudaMemcpyAsync1 failed: invalid argument |
@abidmalikwaterloo, @andfoy, thanks for reproducing this issue. It certainly narrows it down to a single |
@alsrgv FYI, Running the experiments. Unable to get the resources because of the long queue on the cluster. Will update as soon as I see the crash. |
Haven't seen such errors afterwards. So closing |
Software:
horovod 0.13.10
TF: v1.9.0-0-g25c197e023 1.9.0
cuda 9.0
openmpi 2.1.1
NCCL 2.2.13
I ran a job on 6 nodes (48 GPUs). It's a very normal horovod job with a allreduce every step, and a broadcast once a while. It runs well for 17 hours until horovod throws this error on rank 3:
Post it in case someone sees similar issues. I understand this is probably not a reproducible error, or perhaps it's not a horovod issue.
The text was updated successfully, but these errors were encountered: