Error while trying to use gradient compression #2108

anweshpanda · 2020-07-15T05:26:01Z

Hi
I am using horovod with pytorch. With the given mnist example if I am using compression fp16 instead of none , I am getting the following error

[1,0]:terminate called after throwing an instance of 'c10::Error'
[1,0]: what(): "div_cpu" not implemented for 'Half' (operator() at /opt/anaconda/conda-bld/pytorch-base_1588647739240/work/build/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp.AVX.cpp:95)
[1,0]:frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x6d (0x7fa324a9bd2d in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libc10.so)
[1,0]:frame #1: + 0x20ba455 (0x7fa2e7882455 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #2: + 0x10e9633 (0x7fa2e68b1633 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #3: at::native::div_out(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x5f (0x7fa2e68a9d1f in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #4: + 0x152c260 (0x7fa2e6cf4260 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #5: at::Tensor::div(at::Tensor const&) const + 0x110 (0x7fa2e68b4620 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #6: at::native::div(at::Tensor&, c10::Scalar) + 0x46 (0x7fa2e68ab7c6 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #7: + 0x16bc55c (0x7fa2e6e8455c in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #8: + 0x32d0fd1 (0x7fa2e8a98fd1 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #9: + 0xbacfa (0x7fa283289cfa in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #10: + 0xaf07b (0x7fa28327e07b in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #11: + 0x5d35b (0x7fa28322c35b in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #12: + 0xc819d (0x7fa32418019d in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6)
[1,0]:frame #13: + 0x84f9 (0x7fa32bad04f9 in /lib64/libpthread.so.0)
[1,0]:frame #14: clone + 0x3f (0x7fa32b808f2f in /lib64/libc.so.6)
[1,0]:
[1,0]:[login1:32123] * Process received signal *
[1,0]:[login1:32123] Signal: Aborted (6)
[1,0]:[login1:32123] Signal code: (-6)
[1,0]:[login1:32123] [ 0] /lib64/libpthread.so.0(+0x132d0)[0x7fa32badb2d0]
[1,0]:[login1:32123] [ 1] /lib64/libc.so.6(gsignal+0x110)[0x7fa32b746520]
[1,0]:[login1:32123] [ 2] /lib64/libc.so.6(abort+0x151)[0x7fa32b747b01]
[1,0]:[login1:32123] [ 3] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7fa32416584a]
[1,0]:[login1:32123] [ 4] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(+0xabf47)[0x7fa324163f47]
[1,0]:[login1:32123] [ 5] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(+0xabf7d)[0x7fa324163f7d]
[1,0]:[login1:32123] [ 6] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(__cxa_rethrow+0x0)[0x7fa32416415a]
[1,0]:[login1:32123] [ 7] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x20ba4d0)[0x7fa2e78824d0]
[1,0]:[login1:32123] [ 8] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x10e9633)[0x7fa2e68b1633]

I am running my code on multiple CPUs... Any suggestion how to get rid of the error will be of great help...

tgaddair · 2020-07-15T20:14:37Z

Hey @anweshpanda, what version of PyTorch are you using? Does this show up when you use fp16 compression with one of our example scripts, or only using a custom training script?

anweshpanda · 2020-07-16T06:08:50Z

It is the example script for mnist....pytorch version is 1.3.1....
I just changed the line .....
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
to......
compression = hvd.Compression.fp16

tgaddair · 2020-07-16T20:09:57Z

Looks like this is a known incompatibility in PyTorch when using half-precision on CPU: pytorch/pytorch#36318.

However, it sounds like they did recently add support for CPU ops with half precision, so you may want to try upgrading to the latest version of PyTorch to use fp16 compression. But in practice, fp16 will not usually give you much benefit without GPUs, as you're unlikely to be bound by network as opposed to forward/backward passes on CPU.

anweshpanda · 2020-07-18T16:39:28Z

Thanks @tgaddair....

anweshpanda added the question label Jul 15, 2020

anweshpanda closed this as completed Jul 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while trying to use gradient compression #2108

Error while trying to use gradient compression #2108

anweshpanda commented Jul 15, 2020

tgaddair commented Jul 15, 2020

anweshpanda commented Jul 16, 2020

tgaddair commented Jul 16, 2020

anweshpanda commented Jul 18, 2020

Error while trying to use gradient compression #2108

Error while trying to use gradient compression #2108

Comments

anweshpanda commented Jul 15, 2020

Hi I am using horovod with pytorch. With the given mnist example if I am using compression fp16 instead of none , I am getting the following error

tgaddair commented Jul 15, 2020

anweshpanda commented Jul 16, 2020

tgaddair commented Jul 16, 2020

anweshpanda commented Jul 18, 2020

Hi
I am using horovod with pytorch. With the given mnist example if I am using compression fp16 instead of none , I am getting the following error