-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in HCOLL/UCX #7391
Comments
@benmenadue For now, we'll continue working on it through Mellanox support as an HCOLL issue |
@benmenadue is that possible to get reproducer? How many nodes are used in the run? |
You should be able to reproduce this by running https://github.com/BenWibking/quokka/blob/development/scripts/shell-64nodes.pbs (without the flag to disable hcoll multicast, of course). This was a 64 node run, 4x GPUs per node. |
is it possible to run the app on CPU? is it only reproduced when running on GPU? |
It can be run on CPU as well. I haven't tried running at that scale on CPU. The crashes also appear to be somewhat nondeterministic. |
Describe the bug
Some applications are failing with a segfault in hcoll callback
mcast_ucx_recv_completion_cb
. Reported to Mellanox Support (since that's part of hcoll, case 00956842), and they suggested opening this here as well. Traceback isSteps to Reproduce
For the above traceback:
mpirun -np 256 --map-by numa:SPAN --bind-to numa --mca pml ucx ...
Setup and versions
CentOS Linux release 8.4.2105
Kernel 4.18.0-305.7.1.el8.nci.x86_64
Mellanox OFED 5.1-2.5.8.0
CUDA 11.2 / 460.73.01
OpenMPI 4.1.1
GPUs: 4 x V100-SMX2 32GB
IB: 1 x HDR100 (
ibv_devinfo -vv
attached)ibv_devinfo.txt
The text was updated successfully, but these errors were encountered: