Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c #1040

Closed
ddkang opened this issue May 1, 2019 · 7 comments
Labels

Comments

@ddkang
Copy link

ddkang commented May 1, 2019

Environment:

  1. Framework: PyTorch
  2. Framework version: 1.0.1.post2
  3. Horovod version: 0.16.1
  4. MPI version: 4.0.0
  5. CUDA version: Cuda compilation tools, release 9.0, V9.0.176
  6. NCCL version: N/A
  7. Python version: 3.5.2
  8. OS and version: Ubuntu 16.04

Checklist:

  1. Did you search issues to find if somebody asked this question before?

Yes, but there was no resolution: #368

Bug report:

I tried running the pytorch_imagenet_resnet50.py script and got the following error:

--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[25215,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------
[future5.stanford.edu:12508] [[25215,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355

I made sure to configure and install OpenMPI as in the dockerfile.

@ddkang ddkang added the bug label May 1, 2019
@abditag2
Copy link
Collaborator

abditag2 commented May 2, 2019

It does seem that your MPI configuration has some issue. Take a look at this and let me know if it provides any help: open-mpi/ompi#4437

@ddkang
Copy link
Author

ddkang commented May 2, 2019

I am not using hwloc, I didn't see other solutions?

@alsrgv
Copy link
Member

alsrgv commented May 7, 2019

@ddkang, can you share output of dpkg -l | grep hwloc?

@ddkang
Copy link
Author

ddkang commented May 7, 2019

@alsrgv

future4:

ii  hwloc-nox                              1.11.2-3                                   amd64        Hierarchical view of the machine - non-X version of utilities
ii  libhwloc-dev:amd64                     1.11.2-3                                   amd64        Hierarchical view of the machine - static libs and headers
ii  libhwloc-plugins                       1.11.2-3                                   amd64        Hierarchical view of the machine - plugins
ii  libhwloc5:amd64                        1.11.2-3                                   amd64        Hierarchical view of the machine - shared libs

future5:

ii  hwloc-nox                             1.11.2-3                                   amd64        Hierarchical view of the machine - non-X version of utilities
ii  libhwloc-dev:amd64                    1.11.2-3                                   amd64        Hierarchical view of the machine - static libs and headers
ii  libhwloc-plugins                      1.11.2-3                                   amd64        Hierarchical view of the machine - plugins
ii  libhwloc5:amd64                       1.11.2-3                                   amd64        Hierarchical view of the machine - shared libs

Please let me know if you need anything else!

@alsrgv
Copy link
Member

alsrgv commented May 16, 2019

@ddkang, sorry for the delayed response. Looks like this is the same version as the one flagged in open-mpi/ompi#4437 as buggy. Could you try uninstalling hwloc (apt purge hwloc-nox libhwloc-dev libhwloc-plugins libhwloc5) and reinstalling Open MPI?

@ddkang
Copy link
Author

ddkang commented May 17, 2019

That worked, thank you! It would be nice to have this in the documentation.

@alsrgv
Copy link
Member

alsrgv commented May 17, 2019

Good idea, added to #1084. Will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants