Segmentation fault in HCOLL/UCX #7391

benmenadue · 2021-09-13T02:53:53Z

Describe the bug

Some applications are failing with a segfault in hcoll callback mcast_ucx_recv_completion_cb. Reported to Mellanox Support (since that's part of hcoll, case 00956842), and they suggested opening this here as well. Traceback is

[gadi-gpu-v100-0036:2629319:0:2629319] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x39)
==== backtrace (tid:2629319) ====
 0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 1 0x000000000001de09 mcast_ucx_recv_completion_cb()  ???:0
 2 0x000000000005ec69 ucp_eager_only_handler()  ???:0
 3 0x000000000004de0c uct_dc_mlx5_iface_progress_ll()  :0
 4 0x0000000000038cda ucp_worker_progress()  ???:0
 5 0x0000000000015c76 hmca_bcol_ucx_p2p_progress_fast()  bcol_ucx_p2p_component.c:0
 6 0x0000000000063223 hcoll_ml_progress_impl()  ???:0
 7 0x00000000001f2ec3 opal_progress()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/opal/../../source/openmpi-4.1.1/opal/runtime/opal_progress.c:231
 8 0x0000000000008c10 wait_callback()  vmc.c:0
 9 0x000000000001f0d4 mcast_p2p_recv()  bcol_ucx_p2p_module.c:0
10 0x000000000000c44d do_bcast()  vmc.c:0
11 0x000000000000d6c1 vmc_bcast_multiroot()  ???:0
12 0x00000000000030c0 hmca_mcast_vmc_bcast_multiroot()  mcast_vmc.c:0
13 0x0000000000013710 hmca_bcol_ucx_p2p_bcast_mcast_multiroot()  ???:0
14 0x00000000000156cf hmca_bcol_ucx_p2p_barrier_selector_init()  bcol_ucx_p2p_barrier.c:0
15 0x0000000000049a05 hmca_coll_ml_barrier_intra()  ???:0
16 0x00000000001b4f1a mca_coll_hcoll_barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/../../../../../source/openmpi-4.1.1/ompi/mca/coll/hcoll/coll_hcoll_ops.c:29
17 0x00000000002254b8 PMPI_Barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/pbarrier.c:74
18 0x00000000002254b8 PMPI_Barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/pbarrier.c:40
19 0x00000000004b5710 amrex::ParallelDescriptor::Barrier()  ???:0
20 0x000000000046071e AMRSimulation<ShellProblem>::evolve()  ???:0
21 0x0000000000423c98 problem_main()  ???:0
22 0x000000000041b110 main()  ???:0
23 0x0000000000023493 __libc_start_main()  ???:0
24 0x000000000041f66e _start()  ???:0
=================================

Steps to Reproduce

For the above traceback:

Command line:
mpirun -np 256 --map-by numa:SPAN --bind-to numa --mca pml ucx ...
UCX version used:

# UCT version=1.11.0 revision fa84605
# configured with: --prefix=/apps/ucx/1.11.0 --disable-dependency-tracking --enable-shared --disable-static --enable-ucg --enable-compiler-opt=3 --enable-optimizations --disable-assertions --enable-cma --disable-params-check --enable-mt --enable-experimental-api --without-fuse3 --without-java --with-cuda --without-rocm --without-gdrcopy --with-verbs --with-rc --with-ud --with-dc --with-mlx5-dv --with-ib-hw-tm --with-dm --with-rdmacm --with-knem --without-xpmem --without-ugni

Setup and versions

Software environment
CentOS Linux release 8.4.2105
Kernel 4.18.0-305.7.1.el8.nci.x86_64
Mellanox OFED 5.1-2.5.8.0
CUDA 11.2 / 460.73.01
OpenMPI 4.1.1
Hardware environment:
GPUs: 4 x V100-SMX2 32GB
IB: 1 x HDR100 (ibv_devinfo -vv attached)

ibv_devinfo.txt

The text was updated successfully, but these errors were encountered:

yosefe · 2021-09-14T11:08:23Z

@benmenadue For now, we'll continue working on it through Mellanox support as an HCOLL issue

vspetrov · 2021-11-18T07:27:05Z

@benmenadue is that possible to get reproducer? How many nodes are used in the run?

BenWibking · 2021-11-18T07:46:43Z

You should be able to reproduce this by running https://github.com/BenWibking/quokka/blob/development/scripts/shell-64nodes.pbs (without the flag to disable hcoll multicast, of course). This was a 64 node run, 4x GPUs per node.

vspetrov · 2021-11-18T08:00:15Z

is it possible to run the app on CPU? is it only reproduced when running on GPU?

BenWibking · 2021-11-18T08:02:36Z

It can be run on CPU as well. I haven't tried running at that scale on CPU. The crashes also appear to be somewhat nondeterministic.

benmenadue added the Bug label Sep 13, 2021

yosefe added the External label Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in HCOLL/UCX #7391

Segmentation fault in HCOLL/UCX #7391

benmenadue commented Sep 13, 2021

yosefe commented Sep 14, 2021

vspetrov commented Nov 18, 2021

BenWibking commented Nov 18, 2021

vspetrov commented Nov 18, 2021

BenWibking commented Nov 18, 2021

Segmentation fault in HCOLL/UCX #7391

Segmentation fault in HCOLL/UCX #7391

Comments

benmenadue commented Sep 13, 2021

Describe the bug

Steps to Reproduce

Setup and versions

yosefe commented Sep 14, 2021

vspetrov commented Nov 18, 2021

BenWibking commented Nov 18, 2021

vspetrov commented Nov 18, 2021

BenWibking commented Nov 18, 2021