Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaMemcpyAsync failed: invalid argument during training #404

Closed
ppwwyyxx opened this issue Jul 26, 2018 · 22 comments
Closed

cudaMemcpyAsync failed: invalid argument during training #404

ppwwyyxx opened this issue Jul 26, 2018 · 22 comments
Labels

Comments

@ppwwyyxx
Copy link
Contributor

Software:
horovod 0.13.10
TF: v1.9.0-0-g25c197e023 1.9.0
cuda 9.0
openmpi 2.1.1
NCCL 2.2.13

I ran a job on 6 nodes (48 GPUs). It's a very normal horovod job with a allreduce every step, and a broadcast once a while. It runs well for 17 hours until horovod throws this error on rank 3:

Caused by op 'HVDAllReduce/HorovodAllreduce_gradients_maskrcnn_conv_BiasAdd_grad_BiasAddGrad_0', defined at:
  File "./train.py[1,3]<stderr>:", line 607, in <module>
    launch_train_with_config(traincfg, trainer)
  File "/HOME/tensorpack/tensorpack/train/interface.py", line 83, in launch_train_with_config
    model._build_graph_get_cost, model.get_optimizer)
  File "/HOME/tensorpack/tensorpack/utils/argtools.py", line 181, in wrapper
    return func(*args, **kwargs)
  File "/HOME/tensorpack/tensorpack/train/tower.py", line 172, in setup_graph
    train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn)
  File "/HOME/tensorpack/tensorpack/train/trainers.py", line 378, in _setup_graph
  File "/HOME/tensorpack/tensorpack/train/trainers.py", line 369, in allreduce
    avg_grad = hvd.allreduce(grad, average=self._average)
  File "/HOME/.local/lib/python3.6/site-packages/horovod/tensorflow/__init__.py", line 82, in allreduce
    summed_tensor = _allreduce(tensor)
  File "/HOME/.local/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 78, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 50, in horovod_allreduce
  File "/HOME/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/HOME/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/HOME/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

UnknownError (see above for traceback): cudaMemcpyAsync failed: invalid argument
   [[Node: HVDAllReduce/HorovodAllreduce_gradients_maskrcnn_conv_BiasAdd_grad_BiasAddGrad_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/maskrcnn/conv/BiasAdd_grad/BiasAddGrad)]]

Post it in case someone sees similar issues. I understand this is probably not a reproducible error, or perhaps it's not a horovod issue.

@winwinJJiang
Copy link

I met the same issue when training and stops.

@alsrgv
Copy link
Member

alsrgv commented Jul 28, 2018

@ppwwyyxx, @winwinJJiang, anything in dmesg?

@ppwwyyxx
Copy link
Contributor Author

ppwwyyxx commented Aug 2, 2018

No, nothing was printed in dmesg on the day the error happens.

@alsrgv
Copy link
Member

alsrgv commented Aug 3, 2018

I'd recommend running the job with NCCL_DEBUG=INFO. That may shed some light next time it happens.

@ppwwyyxx
Copy link
Contributor Author

ppwwyyxx commented Aug 3, 2018

The job was run with NCCL_DEBUG=INFO. It only prints some normal stuff at the very beginning of the training and nothing afterwards:

INFO NET : Using interface enp1s0f0:xx.xx.xx.xx<0>
INFO NET/IB : Using interface enp1s0f0 for sideband communication
INFO NET/IB: [0] mlx5_3:1/IB
INFO NET/IB: [1] mlx5_2:1/IB
INFO NET/IB: [2] mlx5_1:1/IB
INFO NET/IB: [3] mlx5_0:1/IB
INFO Using internal Network IB
INFO Using NCCL Low-latency algorithm for sizes below 16384
INFO comm 0x7f9faeec2b80 rank 3 nranks 48
INFO NET : Using interface enp1s0f0:xx.xx.xx.xx<0>
INFO NET/Socket : 1 interfaces found
INFO CUDA Dev 3, IB Ports : mlx5_3/1(SOC) mlx5_2/1(SOC) mlx5_1/1(PIX) mlx5_0/1(PHB)
INFO Ring 00 : 3[3] -> 1[1] via P2P/IPC
INFO Ring 01 : 3[3] -> 7[7] via P2P/IPC
INFO NET/IB: Dev 2 Port 1 qpn 5285 mtu 5 LID 194
INFO Ring 03 : 3[3] -> 2[2] via P2P/IPC

@alsrgv
Copy link
Member

alsrgv commented Aug 3, 2018

Gotcha. Actually, since it's failing at cudaMemcpyAsync, I'm guessing it's either this or this operation. It may be a race condition or a CUDA issue. Do you have any sort of repro for it?

@ppwwyyxx
Copy link
Contributor Author

ppwwyyxx commented Aug 3, 2018

I'm afraid I don't know how to reproduce it. Right now I've only seen it once.

Today I saw an issue (tensorflow/tensorflow#21338) which basically says some unchecked cuda error in a buggy TensorFlow op may leak to other ops, making the other op appear to fail. I guess this is a possible explanation.

@andfoy
Copy link
Contributor

andfoy commented Sep 21, 2018

I'm receiving this error also in PyTorch, but I'm unable to find a reproduction scenario, it occurs randomly some time after starting a new epoch.

@alsrgv
Copy link
Member

alsrgv commented Sep 28, 2018

I have published a branch with debug code to narrow down the invalid argument issue. To install it, use [your normal flags] pip install -v --no-cache-dir git+https://github.com/uber/horovod@debug_before_memcpy.

@andfoy, @ppwwyyxx, @abidmalikwaterloo, could you try running it and report if you observe any issues?

The primary difference in debug branch is that it checks CUDA errors both before and after calls to cudaMemcpyAsync, and also has a unique ID for each call to help narrow it down.

@abidmalikwaterloo
Copy link

@alsrgv Thanks for this. However, I am not able to install horovod with this

here is the complete track:

(/home/amalik/Pytorch_virtual_enviornment) [amalik@hpc1 hpc1_runs]$ pip uninstall -y horovod
Uninstalling horovod-0.14.1:
Successfully uninstalled horovod-0.14.1
(/home/amalik/Pytorch_virtual_enviornment) [amalik@hpc1 hpc1_runs]$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/home/amalik/nccl_2.1.4-1+cuda8.0_x86_64 pip install --user -v --no-cache-dir git+https://github.com/uber/horovod@debug_before_memcpy
Created temporary directory: /tmp/pip-ephem-wheel-cache-cw8AP3
Created temporary directory: /tmp/pip-install-OzRcAf
Collecting git+https://github.com/uber/horovod@debug_before_memcpy
Created temporary directory: /tmp/pip-req-build-04qBat
Cloning https://github.com/uber/horovod (to revision debug_before_memcpy) to /tmp/pip-req-build-04qBat
Running command git clone -q https://github.com/uber/horovod /tmp/pip-req-build-04qBat
Running command git show-ref debug_before_memcpy
2d42310 refs/remotes/origin/debug_before_memcpy
Running command git rev-parse HEAD
c9435dc
Running command git checkout -q 2d42310
Running setup.py (path:/tmp/pip-req-build-04qBat/setup.py) egg_info for package from git+https://github.com/uber/horovod@debug_before_memcpy
Running command python setup.py egg_info
running egg_info
creating pip-egg-info/horovod.egg-info
writing requirements to pip-egg-info/horovod.egg-info/requires.txt
writing pip-egg-info/horovod.egg-info/PKG-INFO
writing top-level names to pip-egg-info/horovod.egg-info/top_level.txt
writing dependency_links to pip-egg-info/horovod.egg-info/dependency_links.txt
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching '.eggs'
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
Source in /tmp/pip-req-build-04qBat has version 0.14.1, which satisfies requirement horovod==0.14.1 from git+https://github.com/uber/horovod@debug_before_memcpy
Requirement already satisfied: cffi>=1.4.0 in /home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages (from horovod==0.14.1) (1.11.5)
Requirement already satisfied: pycparser in /home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages (from cffi>=1.4.0->horovod==0.14.1) (2.18)
mkl-fft 1.0.4 requires cython, which is not installed.
mkl-random 1.0.1 requires cython, which is not installed.
mxnet-cu90 1.1.0 requires requests==2.18.4, which is not installed.
grpcio 1.14.1 requires enum34>=1.0.4, which is not installed.
tensorflow 1.10.0 requires enum34>=1.1.6, which is not installed.
tensorflow 1.10.0 requires mock>=2.0.0, which is not installed.
tensorflow 1.10.0 has requirement numpy<=1.14.5,>=1.13.3, but you'll have numpy 1.15.1 which is incompatible.
mxnet-cu90 1.1.0 has requirement numpy<=1.13.3, but you'll have numpy 1.15.1 which is incompatible.
Installing collected packages: horovod
Created temporary directory: /tmp/pip-record-lkU3vO
Running setup.py install for horovod ... Running command /home/amalik/Pytorch_virtual_enviornment/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-04qBat/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-lkU3vO/install-record.txt --single-version-externally-managed --compile --user --prefix=
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-2.7/horovod
creating build/lib.linux-x86_64-2.7/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-2.7/horovod/common
creating build/lib.linux-x86_64-2.7/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-2.7/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-2.7/horovod/keras
copying horovod/keras/callbacks_impl.py -> build/lib.linux-x86_64-2.7/horovod/keras
copying horovod/keras/impl.py -> build/lib.linux-x86_64-2.7/horovod/keras
creating build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
creating build/lib.linux-x86_64-2.7/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-2.7/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/torch
creating build/lib.linux-x86_64-2.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow/keras
creating build/lib.linux-x86_64-2.7/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-2.7/horovod/torch/mpi_lib
creating build/lib.linux-x86_64-2.7/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-2.7/horovod/torch/mpi_lib_impl
running build_ext
mpicc -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -fPIC -O2 -I/home/amalik/Pytorch_virtual_enviornment/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-2.7/test_compile/test_cpp_flags.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /home/amalik/Pytorch_virtual_enviornment/compiler_compat -L/home/amalik/Pytorch_virtual_enviornment/lib -Wl,-rpath=/home/amalik/Pytorch_virtual_enviornment/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-2.7/test_compile/test_cpp_flags.o -L/home/amalik/Pytorch_virtual_enviornment/lib -o build/temp.linux-x86_64-2.7/test_compile/test_cpp_flags.so
mpicc -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/amalik/Pytorch_virtual_enviornment/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-2.7/test_compile/test_link_flags.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /home/amalik/Pytorch_virtual_enviornment/compiler_compat -L/home/amalik/Pytorch_virtual_enviornment/lib -Wl,-rpath=/home/amalik/Pytorch_virtual_enviornment/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,--version-script=horovod.lds build/temp.linux-x86_64-2.7/test_compile/test_link_flags.o -L/home/amalik/Pytorch_virtual_enviornment/lib -o build/temp.linux-x86_64-2.7/test_compile/test_link_flags.so
mpicc -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -fPIC -O2 -I/usr/local/cuda/include -I/home/amalik/Pytorch_virtual_enviornment/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_cuda.cc -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
build/temp.linux-x86_64-2.7/test_compile/test_cuda.cc:1:10: fatal error: cuda_runtime.h: No such file or directory
#include <cuda_runtime.h>
^~~~~~~~~~~~~~~~
compilation terminated.
error: CUDA library was not found (see error above).
Please specify correct CUDA location with the HOROVOD_CUDA_HOME environment variable or combination of HOROVOD_CUDA_INCLUDE and HOROVOD_CUDA_LIB environment variables.

HOROVOD_CUDA_HOME - path where CUDA include and lib directories can be found
HOROVOD_CUDA_INCLUDE - path to CUDA include directory
HOROVOD_CUDA_LIB - path to CUDA lib directory

error
Cleaning up...
Removing source in /tmp/pip-req-build-04qBat
Command "/home/amalik/Pytorch_virtual_enviornment/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-04qBat/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-lkU3vO/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-req-build-04qBat/
Exception information:
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/pip/_internal/basecommand.py", line 228, in main
status = self.run(options, args)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/pip/_internal/commands/install.py", line 335, in run
use_user_site=options.use_user_site,
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/pip/_internal/req/init.py", line 49, in install_given_reqs
**kwargs
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/pip/_internal/req/req_install.py", line 779, in install
spinner=spinner,
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/pip/_internal/utils/misc.py", line 698, in call_subprocess
% (command_desc, proc.returncode, cwd))
InstallationError: Command "/home/amalik/Pytorch_virtual_enviornment/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-04qBat/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-lkU3vO/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-req-build-04qBat/
1 location(s) to search for versions of pip:

import horovod as hvd
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named horovod

@abidmalikwaterloo
Copy link

The following messgae is interesting:

In file included from /home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/../../lib/include/THC/THC.h:4:0,
from _test_cuda.c:493:
/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/../../lib/include/THC/THCGeneral.h:12:10: fatal error: cuda.h: No such file or directory
#include "cuda.h"
^~~~~~~~
compilation terminated.
INFO: Above error indicates that this PyTorch installation does not support CUDA.
INFO: Unable to build PyTorch plugin, will skip it.

Traceback (most recent call last):
  File "/tmp/pip-req-build-QFl6Ac/setup.py", line 720, in build_extensions
    build_torch_extension(self, options, torch_version)
  File "/tmp/pip-req-build-QFl6Ac/setup.py", line 587, in build_torch_extension
    'Horovod build with GPU support was requested, but this PyTorch '
DistutilsPlatformError: Horovod build with GPU support was requested, but this PyTorch installation does not support CUDA.

error: Neither TensorFlow nor PyTorch plugins were built. See errors above.

error

I installed Pytorch from the following the instruction on the site:

https://pytorch.org/

I used
pip install --user torch torchvision

@mrfox321
Copy link

Also got the same error:

Caused by op u'DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_cudnn_rnn_stack_cudnn_gru_CudnnRNN_grad_tuple_control_dependency_3_0', defined at:
...
UnknownError (see above for traceback): cudaMemcpyAsync failed: invalid argument [[Node: DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_cudnn_rnn_stack_cudnn_gru_CudnnRNN_grad_tuple_control_dependency_3_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/cudnn_rnn_stack/cudnn_gru/CudnnRNN_grad/CudnnRNNBackprop:3)]]

@alsrgv
Copy link
Member

alsrgv commented Sep 28, 2018

@abidmalikwaterloo, could you try specifying HOROVOD_CUDA_HOME=/path/to/your/cuda in your installation flags?

@alsrgv
Copy link
Member

alsrgv commented Sep 28, 2018

@mrfox321, could you try running from debug branch, as described in #404 (comment), to help narrow down this issue?

@abidmalikwaterloo
Copy link

@alsrgv I tried as to build from scratch


conda install pytorch torchvision -c pytorch
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/amalik/nccl_2.1.15-1+cuda9.0_x86_64/lib
export HOROVOD_NCCL_HOME=/home/amalik/nccl_2.1.15-1+cuda9.0_x86_64/
export HOROVOD_GPU_ALLREDUCE=NCCL
export HOROVOD_CUDA_HOME= /software/cuda/
export PATH=$PATH:/software/openmpi/3.0.0-gnu/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/software/openmpi/3.0.0-gnu/lib/
pip install --user -v --no-cache-dir git+https://github.com/uber/horovod@debug_before_memcpy

I am getting the following message:

(/home/amalik/PyTorchHorovod) [amalik@node04 PyTorchHorovod]$ pip install --user -v --no-cache-dir git+https://github.com/uber/horovod@debug_before_memcpy
Created temporary directory: /tmp/pip-ephem-wheel-cache-v94wl1es
Created temporary directory: /tmp/pip-install-6s89svrr
Collecting git+https://github.com/uber/horovod@debug_before_memcpy
Created temporary directory: /tmp/pip-req-build-vlcqfrb8
Cloning https://github.com/uber/horovod (to revision debug_before_memcpy) to /tmp/pip-req-build-vlcqfrb8
Running command git clone -q https://github.com/uber/horovod /tmp/pip-req-build-vlcqfrb8
Running command git show-ref debug_before_memcpy
2d42310 refs/remotes/origin/debug_before_memcpy
Running command git rev-parse HEAD
8d72d66
Running command git checkout -q 2d42310
Running setup.py (path:/tmp/pip-req-build-vlcqfrb8/setup.py) egg_info for package from git+https://github.com/uber/horovod@debug_before_memcpy
Running command python setup.py egg_info
running egg_info
creating pip-egg-info/horovod.egg-info
writing pip-egg-info/horovod.egg-info/PKG-INFO
writing dependency_links to pip-egg-info/horovod.egg-info/dependency_links.txt
writing requirements to pip-egg-info/horovod.egg-info/requires.txt
writing top-level names to pip-egg-info/horovod.egg-info/top_level.txt
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching '.eggs'
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
Source in /tmp/pip-req-build-vlcqfrb8 has version 0.14.1, which satisfies requirement horovod==0.14.1 from git+https://github.com/uber/horovod@debug_before_memcpy
Requirement already satisfied: cffi>=1.4.0 in ./lib/python3.6/site-packages (from horovod==0.14.1) (1.11.5)
Requirement already satisfied: pycparser in ./lib/python3.6/site-packages (from cffi>=1.4.0->horovod==0.14.1) (2.18)
Installing collected packages: horovod
Created temporary directory: /tmp/pip-record-db5v6nte
Running setup.py install for horovod ... Running command /home/amalik/PyTorchHorovod/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-vlcqfrb8/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-db5v6nte/install-record.txt --single-version-externally-managed --compile --user --prefix=
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/horovod
copying horovod/init.py -> build/lib.linux-x86_64-3.6/horovod
creating build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-3.6/horovod/common
creating build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/callbacks_impl.py -> build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/impl.py -> build/lib.linux-x86_64-3.6/horovod/keras
creating build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
creating build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
running build_ext
gcc -pthread -B /home/amalik/PyTorchHorovod/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -fPIC -O2 -I/home/amalik/PyTorchHorovod/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /home/amalik/PyTorchHorovod/compiler_compat -L/home/amalik/PyTorchHorovod/lib -Wl,-rpath=/home/amalik/PyTorchHorovod/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.so
gcc -pthread -B /home/amalik/PyTorchHorovod/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/amalik/PyTorchHorovod/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /home/amalik/PyTorchHorovod/compiler_compat -L/home/amalik/PyTorchHorovod/lib -Wl,-rpath=/home/amalik/PyTorchHorovod/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.so
error: /software/openmpi/3.0.0-gnu/ failed (see error below), is MPI in $PATH?
Note: If your version of MPI has a custom command to show compilation flags, please specify it with the HOROVOD_MPICXX_SHOW environment variable.

Traceback (most recent call last):
  File "/tmp/pip-req-build-vlcqfrb8/setup.py", line 221, in get_mpi_flags
    shlex.split(show_command), universal_newlines=True).strip()
  File "/home/amalik/PyTorchHorovod/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/home/amalik/PyTorchHorovod/lib/python3.6/subprocess.py", line 403, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/amalik/PyTorchHorovod/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/home/amalik/PyTorchHorovod/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: '/software/openmpi/3.0.0-gnu/'

error
Cleaning up...
Removing source in /tmp/pip-req-build-vlcqfrb8
Command "/home/amalik/PyTorchHorovod/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-vlcqfrb8/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-db5v6nte/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-req-build-vlcqfrb8/
Exception information:
Traceback (most recent call last):
File "/home/amalik/PyTorchHorovod/lib/python3.6/site-packages/pip/_internal/basecommand.py", line 228, in main
status = self.run(options, args)
File "/home/amalik/PyTorchHorovod/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 335, in run
use_user_site=options.use_user_site,
File "/home/amalik/PyTorchHorovod/lib/python3.6/site-packages/pip/_internal/req/init.py", line 49, in install_given_reqs
**kwargs
File "/home/amalik/PyTorchHorovod/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 779, in install
spinner=spinner,
File "/home/amalik/PyTorchHorovod/lib/python3.6/site-packages/pip/_internal/utils/misc.py", line 698, in call_subprocess
% (command_desc, proc.returncode, cwd))
pip._internal.exceptions.InstallationError: Command "/home/amalik/PyTorchHorovod/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-req-build-vlcqfrb8/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-db5v6nte/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-req-build-vlcqfrb8/
You are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

It is complaining about the MPI ??

@alsrgv
Copy link
Member

alsrgv commented Oct 2, 2018

@abidmalikwaterloo, do you have HOROVOD_MPICXX_SHOW set? It appears that it's set to /software/openmpi/3.0.0-gnu/. Can you un-set it, and if that does not help, set it to export HOROVOD_MPICXX_SHOW="/software/openmpi/3.0.0-gnu/bin/mpicxx -show"?

@abidmalikwaterloo
Copy link

@alsrgv It seems that I didn't get any error yet with this new setting. I also changed the virtual environment. Currently, I am testing it extensively with different runtime variables just to ensure that if the breaking has to do with the nondeterministic behavior.

@abidmalikwaterloo
Copy link

abidmalikwaterloo commented Oct 4, 2018

@alsrgv Finally
got the same error. FYI, I ran a successful training using 5 GPUs with 10 epochs. Now I tried with 6 GPUs and got the error. I think this shows an un-deterministic behavior. I am attaching the error log for your convenience.

err_4559.log.pdf

@andfoy
Copy link
Contributor

andfoy commented Oct 6, 2018

@alsrgv I managed to replicate the error once again using your debugging build, here is the error traceback:

Traceback (most recent call last):
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 193, in _run_module_as_main
Traceback (most recent call last):
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 85, in _run_code
    "__main__", mod_spec)
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
    exec(code, run_globals)
  File "/media/SSD1/score-textseg/ref_score_net/train.py", line 495, in <module>
  File "/media/SSD1/score-textseg/ref_score_net/train.py", line 495, in <module>
    train_loss = train(epoch)
  File "/media/SSD1/score-textseg/ref_score_net/train.py", line 363, in train
    train_loss = train(epoch)
  File "/media/SSD1/score-textseg/ref_score_net/train.py", line 363, in train
    optimizer.step()
    optimizer.step()
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 88, in step
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 88, in step
    self.synchronize()
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 84, in synchronize
    self.synchronize()
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/__init__.py", line 84, in synchronize
    synchronize(handle)
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
    synchronize(handle)
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
    mpi_lib.horovod_torch_wait_and_clear(handle)
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 197, in safe_call
    mpi_lib.horovod_torch_wait_and_clear(handle)
  File "/home/eamargffoy/anaconda3/envs/parallel/lib/python3.6/site-packages/torch/utils/ffi/__init__.py", line 197, in safe_call
    result = torch._C._safe_call(*args, **kwargs)
    result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: cudaMemcpyAsync1 failed: invalid argument
torch.FatalError: cudaMemcpyAsync1 failed: invalid argument

@alsrgv
Copy link
Member

alsrgv commented Oct 7, 2018

@abidmalikwaterloo, @andfoy, thanks for reproducing this issue. It certainly narrows it down to a single cudaMemcpyAsync statement. I've updated the debug_before_memcpy branch with additional debug information, could you re-install and reproduce again?

@abidmalikwaterloo
Copy link

@alsrgv FYI, Running the experiments. Unable to get the resources because of the long queue on the cluster. Will update as soon as I see the crash.

@ppwwyyxx
Copy link
Contributor Author

ppwwyyxx commented Sep 1, 2019

Haven't seen such errors afterwards. So closing

@ppwwyyxx ppwwyyxx closed this as completed Sep 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants