-
Notifications
You must be signed in to change notification settings - Fork 68
configury: UCX should use CPPFLAGS (instead of CFLAGS) #1021
configury: UCX should use CPPFLAGS (instead of CFLAGS) #1021
Conversation
(back-ported from commit open-mpi/ompi@a93b849)
|
|
Test FAILed. |
|
@alex-mikheev - could you please check failure: http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1431/console 07:25:27 + taskset -c 10,11 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/bin/oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_DEVICES=mlx4_0:1 --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1 --mca spml ucx -mca pml ucx /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/examples/hello_oshmem
07:25:28 [jenkins01:08758] Error spml_ucx.c:336 - mca_spml_ucx_register() failed to unpack rkey
07:25:28 [jenkins01:08758] Error base/memheap_base_register.c:131 - _reg_segment() Failed to register segment
07:25:28 [jenkins01:08758] Error: pshmem_init.c:71 - _shmem_init() SHMEM failed to initialize - aborting |
|
:bot:assign: @miked-mellanox |
|
@miked-mellanox currently only rc ucx transport can be used for shmem, Using UCX_TLS=rc,cm will fix it |
|
thanks, fixed jenkins w/ this flag |
|
bot:retest |
|
|
Test FAILed. |
|
bot:retest |
|
EDIT: I updated the issue title (it was backwards: using CPPFLAGS in this case is good). |
|
it was not an issue title, but a commit title ... |
|
|
@miked-mellanox Can we trim these Coverity notices on PRs? Since Open MPI is not Coverity clean, these numbers unfortunately just end up being noise. |
|
Test FAILed. |
|
yep, done. |
|
bot:retest |
|
|
Test FAILed. |
|
@miked-mellanox Can you check what the UCX failures are on your jenkins? |
|
yep, fixing it now |
|
bot:retest |
|
|
Test FAILed. |
|
bot:retest I'm doing one more retest. Other PRs are doing okay. @ggouaillardet what's the goal here? Do you actually have a MLNX system in house where you are testing this, or is just a theoretical problem? |
|
Test FAILed. |
|
now it is valgrind issues, 2 in pmix, 1 in ucx @alex-mikheev - please take a look on ucx vg 04:47:29 + /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun -mca coll '^hcoll' -np 1 -mca spml ucx -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -x LD_PRELOAD=/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucp.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucm.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucs.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libuct.so valgrind --suppressions=/var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/share/openmpi/openmpi-valgrind.supp --suppressions=/scrap/jenkins/jenkins/jobs/gh-ompi-release-pr/workspace/jenkins_scripts/jenkins/ompi/vg.supp --error-exitcode=3 --track-origins=yes -q /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/examples/oshmem_shmalloc
04:47:34 ==1635== Warning: client syscall shmdt tried to modify addresses 0xffffffffffffffff-0xffffffffffffffff
04:47:36 ==1635== Thread 2:
04:47:36 ==1635== Syscall param write(buf) points to uninitialised byte(s)
04:47:36 ==1635== at 0x3D6980E6FD: ??? (in /lib64/libpthread-2.12.so)
04:47:36 ==1635== by 0x7B811C2: send_bytes (usock_sendrecv.c:52)
04:47:36 ==1635== by 0x7B81837: opal_pmix_pmix112_pmix_usock_send_handler (usock_sendrecv.c:195)
04:47:36 ==1635== by 0x64861F4: opal_libevent2022_event_base_loop (event.c:1321)
04:47:36 ==1635== by 0x7B7D1E4: progress_engine (progress_threads.c:49)
04:47:36 ==1635== by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)
04:47:36 ==1635== by 0x87C86FF: ???
04:47:36 ==1635== Address 0xb864834 is 356 bytes inside a block of size 1,024 alloc'd
04:47:36 ==1635== at 0x4A06C9C: realloc (vg_replace_malloc.c:687)
04:47:36 ==1635== by 0x6471B2F: opal_realloc (malloc.c:165)
04:47:36 ==1635== by 0x7B6B3D5: opal_pmix_pmix112_pmix_bfrop_buffer_extend (internal_functions.c:65)
04:47:36 ==1635== by 0x7B6E509: opal_pmix_pmix112_pmix_bfrop_pack_byte (pack.c:190)
04:47:36 ==1635== by 0x7B6F587: opal_pmix_pmix112_pmix_bfrop_pack_buf (pack.c:626)
04:47:36 ==1635== by 0x7B6E2C0: opal_pmix_pmix112_pmix_bfrop_pack_buffer (pack.c:86)
04:47:36 ==1635== by 0x7B6E1F0: opal_pmix_pmix112_pmix_bfrop_pack (pack.c:60)
04:47:36 ==1635== by 0x7B86827: _commitfn (pmix_client.c:664)
04:47:36 ==1635== by 0x6485F2B: opal_libevent2022_event_base_loop (event.c:1370)
04:47:36 ==1635== by 0x7B7D1E4: progress_engine (progress_threads.c:49)
04:47:36 ==1635== by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)
04:47:36 ==1635== by 0x87C86FF: ???
04:47:36 ==1635==
04:47:39 ==1635== Thread 1:
04:47:39 ==1635== Uninitialised byte(s) found during client check request
04:47:39 ==1635== at 0x64CC811: valgrind_module_isdefined (memchecker_valgrind_module.c:111)
04:47:39 ==1635== by 0x64CC1DF: opal_memchecker_base_isdefined (memchecker_base_wrappers.c:33)
04:47:39 ==1635== by 0x59A71E4: memchecker_call (memchecker.h:110)
04:47:39 ==1635== by 0x59A741B: PMPI_Allgatherv (pallgatherv.c:52)
04:47:39 ==1635== by 0x55C4040: oshmem_shmem_allgatherv (oshmem_shmem_exchange.c:40)
04:47:39 ==1635== by 0x12CCF1FC: oshmem_shmem_xchng (spml_ucx.c:143)
04:47:39 ==1635== by 0x12CCF38B: mca_spml_ucx_add_procs (spml_ucx.c:200)
04:47:39 ==1635== by 0x55C3693: _shmem_init (oshmem_shmem_init.c:321)
04:47:39 ==1635== by 0x55C3244: oshmem_shmem_init (oshmem_shmem_init.c:159)
04:47:39 ==1635== by 0x55C6428: _shmem_init (pshmem_init.c:68)
04:47:39 ==1635== by 0x55C63CE: start_pes (pshmem_init.c:47)
04:47:39 ==1635== by 0x4006C5: main (oshmem_shmalloc.c:24)
04:47:39 ==1635== Address 0x132cf064 is 52 bytes inside a block of size 122 alloc'd
04:47:39 ==1635== at 0x4A06C9C: realloc (vg_replace_malloc.c:687)
04:47:39 ==1635== by 0x4C18758: ucp_worker_get_address (ucp_worker.c:493)
04:47:39 ==1635== by 0x12CCF33A: mca_spml_ucx_add_procs (spml_ucx.c:194)
04:47:39 ==1635== by 0x55C3693: _shmem_init (oshmem_shmem_init.c:321)
04:47:39 ==1635== by 0x55C3244: oshmem_shmem_init (oshmem_shmem_init.c:159)
04:47:39 ==1635== by 0x55C6428: _shmem_init (pshmem_init.c:68)
04:47:39 ==1635== by 0x55C63CE: start_pes (pshmem_init.c:47)
04:47:39 ==1635== by 0x4006C5: main (oshmem_shmalloc.c:24)
04:47:39 ==1635== |
|
@hppritcha at first, I saw MLNX Jenkins failure (build failure on master) and the root cause is their Jenkins have ucx in a non standard place, and ompi is configure'd with --with-ucx=... fwiw @hjelmn comment is at open-mpi/ompi@a93b849#commitcomment-16687208 |
|
bot:retest |
|
Test FAILed. |
|
LANL cray xc error can be ignored. |
|
Test PASSed. |
|
okay let's try lanl bot again. |
|
Test FAILed. |
|
Test PASSed. |
|
:bot:retest |
|
Test PASSed. |
|
@miked-mellanox I've investigated the PMIx portion of the Valgrind error and I think that there is no error there.
See here http://valgrind.org/docs/manual/mc-manual.html#mc-manual.bad-syscall-args for the typical Valgrind output in case of syscall unintialized error.
Why this address is considered uninitialized? will be clear after looking into the launch command: -x LD_PRELOAD=/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucp.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucm.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucs.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libuct.so Since those libraries was preloaded they probably use malloc directly bypassing Valgrind As proof of concept: If I attach to Valgrind debug server and overwrite 108 bytes of the message starting from 316th byte I'm getting following message that states that now 712th byte is wrong (which I think was also comes from UCX but was inserted into the remote_cache). |
|
@artpol84 Do you want to sprinkle Regardless, these valgrind errors have nothing to do with the CPP flags changes in this PR. 👍 |
|
@hppritcha Good to go. |
|
@jsquyres The problem was slightly different. Subsequent digging showed following: only 82 out of 108 bytes has to be overwritten to cheat Valgrind. And Valgrind vision of the address region was following: So it was a problem in UCX, @yosefe fixed it: openucx/ucx#690. |
configury: UCX should use CPPFLAGS (instead of CFLAGS)
(back-ported from commit open-mpi/ompi@a93b849)