You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey @anweshpanda, what version of PyTorch are you using? Does this show up when you use fp16 compression with one of our example scripts, or only using a custom training script?
It is the example script for mnist....pytorch version is 1.3.1....
I just changed the line .....
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
to......
compression = hvd.Compression.fp16
Looks like this is a known incompatibility in PyTorch when using half-precision on CPU: pytorch/pytorch#36318.
However, it sounds like they did recently add support for CPU ops with half precision, so you may want to try upgrading to the latest version of PyTorch to use fp16 compression. But in practice, fp16 will not usually give you much benefit without GPUs, as you're unlikely to be bound by network as opposed to forward/backward passes on CPU.
Hi
I am using horovod with pytorch. With the given mnist example if I am using compression fp16 instead of none , I am getting the following error
[1,0]:terminate called after throwing an instance of 'c10::Error'
[1,0]: what(): "div_cpu" not implemented for 'Half' (operator() at /opt/anaconda/conda-bld/pytorch-base_1588647739240/work/build/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp.AVX.cpp:95)
[1,0]:frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x6d (0x7fa324a9bd2d in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libc10.so)
[1,0]:frame #1: + 0x20ba455 (0x7fa2e7882455 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #2: + 0x10e9633 (0x7fa2e68b1633 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #3: at::native::div_out(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x5f (0x7fa2e68a9d1f in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #4: + 0x152c260 (0x7fa2e6cf4260 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #5: at::Tensor::div(at::Tensor const&) const + 0x110 (0x7fa2e68b4620 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #6: at::native::div(at::Tensor&, c10::Scalar) + 0x46 (0x7fa2e68ab7c6 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #7: + 0x16bc55c (0x7fa2e6e8455c in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #8: + 0x32d0fd1 (0x7fa2e8a98fd1 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #9: + 0xbacfa (0x7fa283289cfa in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #10: + 0xaf07b (0x7fa28327e07b in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #11: + 0x5d35b (0x7fa28322c35b in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #12: + 0xc819d (0x7fa32418019d in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6)
[1,0]:frame #13: + 0x84f9 (0x7fa32bad04f9 in /lib64/libpthread.so.0)
[1,0]:frame #14: clone + 0x3f (0x7fa32b808f2f in /lib64/libc.so.6)
[1,0]:
[1,0]:[login1:32123] *** Process received signal ***
[1,0]:[login1:32123] Signal: Aborted (6)
[1,0]:[login1:32123] Signal code: (-6)
[1,0]:[login1:32123] [ 0] /lib64/libpthread.so.0(+0x132d0)[0x7fa32badb2d0]
[1,0]:[login1:32123] [ 1] /lib64/libc.so.6(gsignal+0x110)[0x7fa32b746520]
[1,0]:[login1:32123] [ 2] /lib64/libc.so.6(abort+0x151)[0x7fa32b747b01]
[1,0]:[login1:32123] [ 3] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7fa32416584a]
[1,0]:[login1:32123] [ 4] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(+0xabf47)[0x7fa324163f47]
[1,0]:[login1:32123] [ 5] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(+0xabf7d)[0x7fa324163f7d]
[1,0]:[login1:32123] [ 6] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(__cxa_rethrow+0x0)[0x7fa32416415a]
[1,0]:[login1:32123] [ 7] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x20ba4d0)[0x7fa2e78824d0]
[1,0]:[login1:32123] [ 8] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x10e9633)[0x7fa2e68b1633]
I am running my code on multiple CPUs... Any suggestion how to get rid of the error will be of great help...
The text was updated successfully, but these errors were encountered: