Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in HCOLL/UCX #7391

Open
benmenadue opened this issue Sep 13, 2021 · 5 comments
Open

Segmentation fault in HCOLL/UCX #7391

benmenadue opened this issue Sep 13, 2021 · 5 comments

Comments

@benmenadue
Copy link

Describe the bug

Some applications are failing with a segfault in hcoll callback mcast_ucx_recv_completion_cb. Reported to Mellanox Support (since that's part of hcoll, case 00956842), and they suggested opening this here as well. Traceback is

[gadi-gpu-v100-0036:2629319:0:2629319] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x39)
==== backtrace (tid:2629319) ====
 0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 1 0x000000000001de09 mcast_ucx_recv_completion_cb()  ???:0
 2 0x000000000005ec69 ucp_eager_only_handler()  ???:0
 3 0x000000000004de0c uct_dc_mlx5_iface_progress_ll()  :0
 4 0x0000000000038cda ucp_worker_progress()  ???:0
 5 0x0000000000015c76 hmca_bcol_ucx_p2p_progress_fast()  bcol_ucx_p2p_component.c:0
 6 0x0000000000063223 hcoll_ml_progress_impl()  ???:0
 7 0x00000000001f2ec3 opal_progress()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/opal/../../source/openmpi-4.1.1/opal/runtime/opal_progress.c:231
 8 0x0000000000008c10 wait_callback()  vmc.c:0
 9 0x000000000001f0d4 mcast_p2p_recv()  bcol_ucx_p2p_module.c:0
10 0x000000000000c44d do_bcast()  vmc.c:0
11 0x000000000000d6c1 vmc_bcast_multiroot()  ???:0
12 0x00000000000030c0 hmca_mcast_vmc_bcast_multiroot()  mcast_vmc.c:0
13 0x0000000000013710 hmca_bcol_ucx_p2p_bcast_mcast_multiroot()  ???:0
14 0x00000000000156cf hmca_bcol_ucx_p2p_barrier_selector_init()  bcol_ucx_p2p_barrier.c:0
15 0x0000000000049a05 hmca_coll_ml_barrier_intra()  ???:0
16 0x00000000001b4f1a mca_coll_hcoll_barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/../../../../../source/openmpi-4.1.1/ompi/mca/coll/hcoll/coll_hcoll_ops.c:29
17 0x00000000002254b8 PMPI_Barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/pbarrier.c:74
18 0x00000000002254b8 PMPI_Barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/pbarrier.c:40
19 0x00000000004b5710 amrex::ParallelDescriptor::Barrier()  ???:0
20 0x000000000046071e AMRSimulation<ShellProblem>::evolve()  ???:0
21 0x0000000000423c98 problem_main()  ???:0
22 0x000000000041b110 main()  ???:0
23 0x0000000000023493 __libc_start_main()  ???:0
24 0x000000000041f66e _start()  ???:0
=================================

Steps to Reproduce

For the above traceback:

  • Command line:
    mpirun -np 256 --map-by numa:SPAN --bind-to numa --mca pml ucx ...
  • UCX version used:
# UCT version=1.11.0 revision fa84605
# configured with: --prefix=/apps/ucx/1.11.0 --disable-dependency-tracking --enable-shared --disable-static --enable-ucg --enable-compiler-opt=3 --enable-optimizations --disable-assertions --enable-cma --disable-params-check --enable-mt --enable-experimental-api --without-fuse3 --without-java --with-cuda --without-rocm --without-gdrcopy --with-verbs --with-rc --with-ud --with-dc --with-mlx5-dv --with-ib-hw-tm --with-dm --with-rdmacm --with-knem --without-xpmem --without-ugni

Setup and versions

  • Software environment
    CentOS Linux release 8.4.2105
    Kernel 4.18.0-305.7.1.el8.nci.x86_64
    Mellanox OFED 5.1-2.5.8.0
    CUDA 11.2 / 460.73.01
    OpenMPI 4.1.1
  • Hardware environment:
    GPUs: 4 x V100-SMX2 32GB
    IB: 1 x HDR100 (ibv_devinfo -vv attached)

ibv_devinfo.txt

@benmenadue benmenadue added the Bug label Sep 13, 2021
@yosefe
Copy link
Contributor

yosefe commented Sep 14, 2021

@benmenadue For now, we'll continue working on it through Mellanox support as an HCOLL issue

@vspetrov
Copy link

@benmenadue is that possible to get reproducer? How many nodes are used in the run?

@BenWibking
Copy link

You should be able to reproduce this by running https://github.com/BenWibking/quokka/blob/development/scripts/shell-64nodes.pbs (without the flag to disable hcoll multicast, of course). This was a 64 node run, 4x GPUs per node.

@vspetrov
Copy link

is it possible to run the app on CPU? is it only reproduced when running on GPU?

@BenWibking
Copy link

It can be run on CPU as well. I haven't tried running at that scale on CPU. The crashes also appear to be somewhat nondeterministic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants