Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault: running MXNet with Horovod on Ubuntu Linux 16.04 #884

Open
apeforest opened this Issue Mar 5, 2019 · 1 comment

Comments

1 participant
@apeforest
Copy link
Contributor

apeforest commented Mar 5, 2019

Environment:

  1. Framework: MXNet
  2. Framework version: 1.4.0
  3. Horovod version: 0.16.0
  4. MPI version: 3.1.1
  5. CUDA version: 9.2
  6. NCCL version: 2.2.13
  7. Python version: 3.5
  8. OS and version: Linux Ubuntu 16.04

Checklist:

  1. Did you search issues to find if somebody asked this question before?
    Yes
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?

Bug report:
When running Horovod with MXNet on Linux Ubuntu 16.04, there will be a segmentation fault.

@apeforest apeforest added the bug label Mar 5, 2019

@apeforest

This comment has been minimized.

Copy link
Contributor Author

apeforest commented Mar 5, 2019

The rootcause of this issue is because the pip release of MXNet is built using GCC 4.8.4 and Horovod pip release is built using GCC 5.x. There is a function call from Horovod to MXNet using std::function() which has different function signature between GCC 4.x and GCC 5.x.

A similar issue was disovered in Tensorflow: tensorflow/tensorflow#13308 (comment)

We are working to resolve this incompatibility. In the meanwhile, please build MXNet from source following this guide to workaround the segmentation fault. Sorry for the inconvenience!

@alsrgv alsrgv pinned this issue Mar 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.