Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while trying to use gradient compression #2108

Closed
anweshpanda opened this issue Jul 15, 2020 · 4 comments
Closed

Error while trying to use gradient compression #2108

anweshpanda opened this issue Jul 15, 2020 · 4 comments
Labels

Comments

@anweshpanda
Copy link

Hi
I am using horovod with pytorch. With the given mnist example if I am using compression fp16 instead of none , I am getting the following error

[1,0]:terminate called after throwing an instance of 'c10::Error'
[1,0]: what(): "div_cpu" not implemented for 'Half' (operator() at /opt/anaconda/conda-bld/pytorch-base_1588647739240/work/build/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp.AVX.cpp:95)
[1,0]:frame #0: c10::Error::Error(c10::SourceLocation, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x6d (0x7fa324a9bd2d in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libc10.so)
[1,0]:frame #1: + 0x20ba455 (0x7fa2e7882455 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #2: + 0x10e9633 (0x7fa2e68b1633 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #3: at::native::div_out(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x5f (0x7fa2e68a9d1f in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #4: + 0x152c260 (0x7fa2e6cf4260 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #5: at::Tensor::div
(at::Tensor const&) const + 0x110 (0x7fa2e68b4620 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #6: at::native::div
(at::Tensor&, c10::Scalar) + 0x46 (0x7fa2e68ab7c6 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #7: + 0x16bc55c (0x7fa2e6e8455c in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #8: + 0x32d0fd1 (0x7fa2e8a98fd1 in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so)
[1,0]:frame #9: + 0xbacfa (0x7fa283289cfa in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #10: + 0xaf07b (0x7fa28327e07b in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #11: + 0x5d35b (0x7fa28322c35b in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,0]:frame #12: + 0xc819d (0x7fa32418019d in /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6)
[1,0]:frame #13: + 0x84f9 (0x7fa32bad04f9 in /lib64/libpthread.so.0)
[1,0]:frame #14: clone + 0x3f (0x7fa32b808f2f in /lib64/libc.so.6)
[1,0]:
[1,0]:[login1:32123] *** Process received signal ***
[1,0]:[login1:32123] Signal: Aborted (6)
[1,0]:[login1:32123] Signal code: (-6)
[1,0]:[login1:32123] [ 0] /lib64/libpthread.so.0(+0x132d0)[0x7fa32badb2d0]
[1,0]:[login1:32123] [ 1] /lib64/libc.so.6(gsignal+0x110)[0x7fa32b746520]
[1,0]:[login1:32123] [ 2] /lib64/libc.so.6(abort+0x151)[0x7fa32b747b01]
[1,0]:[login1:32123] [ 3] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x7fa32416584a]
[1,0]:[login1:32123] [ 4] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(+0xabf47)[0x7fa324163f47]
[1,0]:[login1:32123] [ 5] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(+0xabf7d)[0x7fa324163f7d]
[1,0]:[login1:32123] [ 6] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/../../../../libstdc++.so.6(__cxa_rethrow+0x0)[0x7fa32416415a]
[1,0]:[login1:32123] [ 7] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x20ba4d0)[0x7fa2e78824d0]
[1,0]:[login1:32123] [ 8] /home/mas/20/cdsanwesk/miniconda3/envs/fresh_env/lib/python3.7/site-packages/torch/lib/libtorch.so(+0x10e9633)[0x7fa2e68b1633]

I am running my code on multiple CPUs... Any suggestion how to get rid of the error will be of great help...

@tgaddair
Copy link
Collaborator

Hey @anweshpanda, what version of PyTorch are you using? Does this show up when you use fp16 compression with one of our example scripts, or only using a custom training script?

@anweshpanda
Copy link
Author

It is the example script for mnist....pytorch version is 1.3.1....
I just changed the line .....
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
to......
compression = hvd.Compression.fp16

@tgaddair
Copy link
Collaborator

Looks like this is a known incompatibility in PyTorch when using half-precision on CPU: pytorch/pytorch#36318.

However, it sounds like they did recently add support for CPU ops with half precision, so you may want to try upgrading to the latest version of PyTorch to use fp16 compression. But in practice, fp16 will not usually give you much benefit without GPUs, as you're unlikely to be bound by network as opposed to forward/backward passes on CPU.

@anweshpanda
Copy link
Author

Thanks @tgaddair....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants