Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@ggouaillardet
Copy link
Contributor

(back-ported from commit open-mpi/ompi@a93b849)

@mellanox-github
Copy link

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1431/ for details.

@mike-dubman
Copy link
Member

@alex-mikheev - could you please check failure:

http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1431/console

07:25:27 + taskset -c 10,11 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/bin/oshrun -np 8 --bind-to core -x SHMEM_SYMMETRIC_HEAP_SIZE=1024M --mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_DEVICES=mlx4_0:1 --mca rmaps_base_dist_hca mlx4_0:1 --mca sshmem_verbs_hca_name mlx4_0:1 --mca spml ucx -mca pml ucx /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/examples/hello_oshmem
07:25:28 [jenkins01:08758] Error spml_ucx.c:336 - mca_spml_ucx_register() failed to unpack rkey
07:25:28 [jenkins01:08758] Error base/memheap_base_register.c:131 - _reg_segment() Failed to register segment
07:25:28 [jenkins01:08758] Error: pshmem_init.c:71 - _shmem_init() SHMEM failed to initialize - aborting

@ggouaillardet
Copy link
Contributor Author

:bot:assign: @miked-mellanox
:bot:milestone:v2.0.0
:bot🏷️bug

@alex-mikheev
Copy link
Contributor

@miked-mellanox currently only rc ucx transport can be used for shmem, Using UCX_TLS=rc,cm will fix it

@mike-dubman
Copy link
Member

thanks, fixed jenkins w/ this flag

@mike-dubman
Copy link
Member

bot:retest

@mellanox-github
Copy link

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1433/ for details.

@mike-dubman
Copy link
Member

bot:retest

@jsquyres jsquyres changed the title configury: UCX uses CPPFLAGS (instead of CFLAGS) configury: UCX should use CPPFLAGS (instead of CFLAGS) Mar 14, 2016
@jsquyres
Copy link
Member

EDIT: I updated the issue title (it was backwards: using CPPFLAGS in this case is good).

@ggouaillardet
Copy link
Contributor Author

it was not an issue title, but a commit title ...
I ll try to do better next time

@mellanox-github
Copy link

@jsquyres
Copy link
Member

@miked-mellanox Can we trim these Coverity notices on PRs? Since Open MPI is not Coverity clean, these numbers unfortunately just end up being noise.

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1434/ for details.

@mike-dubman
Copy link
Member

yep, done.

@jsquyres
Copy link
Member

bot:retest

@mellanox-github
Copy link

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1435/ for details.

@jsquyres
Copy link
Member

@miked-mellanox Can you check what the UCX failures are on your jenkins?

@mike-dubman
Copy link
Member

yep, fixing it now

@mike-dubman
Copy link
Member

bot:retest

@mellanox-github
Copy link

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1436/ for details.

@hppritcha
Copy link
Member

bot:retest

I'm doing one more retest. Other PRs are doing okay. @ggouaillardet what's the goal here? Do you actually have a MLNX system in house where you are testing this, or is just a theoretical problem?

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1437/ for details.

@mike-dubman
Copy link
Member

now it is valgrind issues, 2 in pmix, 1 in ucx

@alex-mikheev - please take a look on ucx vg
@artpol84 - please take a look on vg pmix issue

04:47:29 + /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun -mca coll '^hcoll' -np 1 -mca spml ucx -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -x LD_PRELOAD=/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucp.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucm.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucs.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libuct.so valgrind --suppressions=/var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/share/openmpi/openmpi-valgrind.supp --suppressions=/scrap/jenkins/jenkins/jobs/gh-ompi-release-pr/workspace/jenkins_scripts/jenkins/ompi/vg.supp --error-exitcode=3 --track-origins=yes -q /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/examples/oshmem_shmalloc
04:47:34 ==1635== Warning: client syscall shmdt tried to modify addresses 0xffffffffffffffff-0xffffffffffffffff
04:47:36 ==1635== Thread 2:
04:47:36 ==1635== Syscall param write(buf) points to uninitialised byte(s)
04:47:36 ==1635==    at 0x3D6980E6FD: ??? (in /lib64/libpthread-2.12.so)
04:47:36 ==1635==    by 0x7B811C2: send_bytes (usock_sendrecv.c:52)
04:47:36 ==1635==    by 0x7B81837: opal_pmix_pmix112_pmix_usock_send_handler (usock_sendrecv.c:195)
04:47:36 ==1635==    by 0x64861F4: opal_libevent2022_event_base_loop (event.c:1321)
04:47:36 ==1635==    by 0x7B7D1E4: progress_engine (progress_threads.c:49)
04:47:36 ==1635==    by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)
04:47:36 ==1635==    by 0x87C86FF: ???
04:47:36 ==1635==  Address 0xb864834 is 356 bytes inside a block of size 1,024 alloc'd
04:47:36 ==1635==    at 0x4A06C9C: realloc (vg_replace_malloc.c:687)
04:47:36 ==1635==    by 0x6471B2F: opal_realloc (malloc.c:165)
04:47:36 ==1635==    by 0x7B6B3D5: opal_pmix_pmix112_pmix_bfrop_buffer_extend (internal_functions.c:65)
04:47:36 ==1635==    by 0x7B6E509: opal_pmix_pmix112_pmix_bfrop_pack_byte (pack.c:190)
04:47:36 ==1635==    by 0x7B6F587: opal_pmix_pmix112_pmix_bfrop_pack_buf (pack.c:626)
04:47:36 ==1635==    by 0x7B6E2C0: opal_pmix_pmix112_pmix_bfrop_pack_buffer (pack.c:86)
04:47:36 ==1635==    by 0x7B6E1F0: opal_pmix_pmix112_pmix_bfrop_pack (pack.c:60)
04:47:36 ==1635==    by 0x7B86827: _commitfn (pmix_client.c:664)
04:47:36 ==1635==    by 0x6485F2B: opal_libevent2022_event_base_loop (event.c:1370)
04:47:36 ==1635==    by 0x7B7D1E4: progress_engine (progress_threads.c:49)
04:47:36 ==1635==    by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)
04:47:36 ==1635==    by 0x87C86FF: ???
04:47:36 ==1635== 
04:47:39 ==1635== Thread 1:
04:47:39 ==1635== Uninitialised byte(s) found during client check request
04:47:39 ==1635==    at 0x64CC811: valgrind_module_isdefined (memchecker_valgrind_module.c:111)
04:47:39 ==1635==    by 0x64CC1DF: opal_memchecker_base_isdefined (memchecker_base_wrappers.c:33)
04:47:39 ==1635==    by 0x59A71E4: memchecker_call (memchecker.h:110)
04:47:39 ==1635==    by 0x59A741B: PMPI_Allgatherv (pallgatherv.c:52)
04:47:39 ==1635==    by 0x55C4040: oshmem_shmem_allgatherv (oshmem_shmem_exchange.c:40)
04:47:39 ==1635==    by 0x12CCF1FC: oshmem_shmem_xchng (spml_ucx.c:143)
04:47:39 ==1635==    by 0x12CCF38B: mca_spml_ucx_add_procs (spml_ucx.c:200)
04:47:39 ==1635==    by 0x55C3693: _shmem_init (oshmem_shmem_init.c:321)
04:47:39 ==1635==    by 0x55C3244: oshmem_shmem_init (oshmem_shmem_init.c:159)
04:47:39 ==1635==    by 0x55C6428: _shmem_init (pshmem_init.c:68)
04:47:39 ==1635==    by 0x55C63CE: start_pes (pshmem_init.c:47)
04:47:39 ==1635==    by 0x4006C5: main (oshmem_shmalloc.c:24)
04:47:39 ==1635==  Address 0x132cf064 is 52 bytes inside a block of size 122 alloc'd
04:47:39 ==1635==    at 0x4A06C9C: realloc (vg_replace_malloc.c:687)
04:47:39 ==1635==    by 0x4C18758: ucp_worker_get_address (ucp_worker.c:493)
04:47:39 ==1635==    by 0x12CCF33A: mca_spml_ucx_add_procs (spml_ucx.c:194)
04:47:39 ==1635==    by 0x55C3693: _shmem_init (oshmem_shmem_init.c:321)
04:47:39 ==1635==    by 0x55C3244: oshmem_shmem_init (oshmem_shmem_init.c:159)
04:47:39 ==1635==    by 0x55C6428: _shmem_init (pshmem_init.c:68)
04:47:39 ==1635==    by 0x55C63CE: start_pes (pshmem_init.c:47)
04:47:39 ==1635==    by 0x4006C5: main (oshmem_shmalloc.c:24)
04:47:39 ==1635== 

@ggouaillardet
Copy link
Contributor Author

@hppritcha at first, I saw MLNX Jenkins failure (build failure on master) and the root cause is their Jenkins have ucx in a non standard place, and ompi is configure'd with --with-ucx=...
when I see a real problem I can fix, then I fix it, regardless it affects me or not.

fwiw @hjelmn comment is at open-mpi/ompi@a93b849#commitcomment-16687208

@mike-dubman
Copy link
Member

bot:retest

@lanl-ompi
Copy link
Contributor

Test FAILed.

@hppritcha
Copy link
Member

LANL cray xc error can be ignored.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1441/ for details.

@hppritcha
Copy link
Member

okay let's try lanl bot again.
bot:retest

@lanl-ompi
Copy link
Contributor

Test FAILed.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1442/ for details.

@artpol84
Copy link
Contributor

:bot:retest

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1448/ for details.

@artpol84
Copy link
Contributor

@miked-mellanox

I've investigated the PMIx portion of the Valgrind error and I think that there is no error there.
Here is the reason:

  • First of all this is just one error (not 2):
Syscall param write(buf) points to uninitialised byte(s)
04:47:36 ==1635==    at 0x3D6980E6FD: ??? (in /lib64/libpthread-2.12.so)
04:47:36 ==1635==    by 0x7B811C2: send_bytes (usock_sendrecv.c:52)
04:47:36 ==1635==    by 0x7B81837: opal_pmix_pmix112_pmix_usock_send_handler (usock_sendrecv.c:195)
04:47:36 ==1635==    by 0x64861F4: opal_libevent2022_event_base_loop (event.c:1321)
04:47:36 ==1635==    by 0x7B7D1E4: progress_engine (progress_threads.c:49)
04:47:36 ==1635==    by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)
04:47:36 ==1635==    by 0x87C86FF: ???
04:47:36 ==1635==  Address 0xb864834 is 356 bytes inside a block of size 1,024 alloc'd
04:47:36 ==1635==    at 0x4A06C9C: realloc (vg_replace_malloc.c:687)
04:47:36 ==1635==    by 0x6471B2F: opal_realloc (malloc.c:165)
04:47:36 ==1635==    by 0x7B6B3D5: opal_pmix_pmix112_pmix_bfrop_buffer_extend (internal_functions.c:65)
04:47:36 ==1635==    by 0x7B6E509: opal_pmix_pmix112_pmix_bfrop_pack_byte (pack.c:190)
04:47:36 ==1635==    by 0x7B6F587: opal_pmix_pmix112_pmix_bfrop_pack_buf (pack.c:626)
04:47:36 ==1635==    by 0x7B6E2C0: opal_pmix_pmix112_pmix_bfrop_pack_buffer (pack.c:86)
04:47:36 ==1635==    by 0x7B6E1F0: opal_pmix_pmix112_pmix_bfrop_pack (pack.c:60)
04:47:36 ==1635==    by 0x7B86827: _commitfn (pmix_client.c:664)
04:47:36 ==1635==    by 0x6485F2B: opal_libevent2022_event_base_loop (event.c:1370)
04:47:36 ==1635==    by 0x7B7D1E4: progress_engine (progress_threads.c:49)
04:47:36 ==1635==    by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)
04:47:36 ==1635==    by 0x87C86FF: ???

See here http://valgrind.org/docs/manual/mc-manual.html#mc-manual.bad-syscall-args for the typical Valgrind output in case of syscall unintialized error.

  • Regarding to this error. According to Valgrind the unintialized data arizes in _commitfn, in this function we prepare all keys we submitted and send them to the local-node server.
    But there is no error in _commitfn, the thing is that (it seems to me) Valgring accounts data through all of its lifecycle, i.e.:
    if you have buffer A with unitialized bytes and you copy it's content to the buffer B, despite the fact that you've overwritten all B's bytes Valgrind will still know that it has uninitialized data becase they was originated from A.
    I carefuly counted bytes that form the _commitfn message and it turns out that uninitialized region falls precisely on the UCX address value of size 108 byte that we are getting and submitting to PMIx here:
    https://github.com/open-mpi/ompi-release/blob/v1.10/ompi/mca/pml/ucx/pml_ucx.c#L81:L91.

Why this address is considered uninitialized? will be clear after looking into the launch command:

-x LD_PRELOAD=/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucp.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucm.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libucs.so:/hpc/local/benchmarks/hpc-stack-gcc/install/ucx/debug/lib/libuct.so

Since those libraries was preloaded they probably use malloc directly bypassing Valgrind

As proof of concept: If I attach to Valgrind debug server and overwrite 108 bytes of the message starting from 316th byte I'm getting following message that states that now 712th byte is wrong (which I think was also comes from UCX but was inserted into the remote_cache).

 Thread 2:
==11628== Syscall param write(buf) points to uninitialised byte(s)
==11628==    at 0x3D6980E6FD: ??? (in /lib64/libpthread-2.12.so)
==11628==    by 0x6DC31B6: send_bytes (usock_sendrecv.c:52)
==11628==    by 0x6DC382B: opal_pmix_pmix112_pmix_usock_send_handler (usock_sendrecv.c:195)
==11628==    by 0x56C8B74: opal_libevent2022_event_base_loop (event.c:1321)
==11628==    by 0x6DBF1D8: progress_engine (progress_threads.c:49)
==11628==    by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)
==11628==    by 0x7A0A6FF: ???
==11628==  Address 0xb5f4f28 is 712 bytes inside a block of size 1,024 alloc'd
==11628==    at 0x4A0577B: calloc (vg_replace_malloc.c:593)
==11628==    by 0x56B43B7: opal_calloc (malloc.c:131)
==11628==    by 0x6DAD3B6: opal_pmix_pmix112_pmix_bfrop_buffer_extend (internal_functions.c:66)
==11628==    by 0x6DB04FD: opal_pmix_pmix112_pmix_bfrop_pack_byte (pack.c:190)
==11628==    by 0x6DB157B: opal_pmix_pmix112_pmix_bfrop_pack_buf (pack.c:626)
==11628==    by 0x6DB02B4: opal_pmix_pmix112_pmix_bfrop_pack_buffer (pack.c:86)
==11628==    by 0x6DB01E4: opal_pmix_pmix112_pmix_bfrop_pack (pack.c:60)
==11628==    by 0x6DC883A: _commitfn (pmix_client.c:671)
==11628==    by 0x56C88AB: opal_libevent2022_event_base_loop (event.c:1370)
==11628==    by 0x6DBF1D8: progress_engine (progress_threads.c:49)
==11628==    by 0x3D698079D0: start_thread (in /lib64/libpthread-2.12.so)

@jsquyres
Copy link
Member

@artpol84 Do you want to sprinkle opal_memchecker_base_mem_defined() calls in relevant places to avoid such issues?

Regardless, these valgrind errors have nothing to do with the CPP flags changes in this PR. 👍

@jsquyres
Copy link
Member

@hppritcha Good to go.

@artpol84
Copy link
Contributor

@jsquyres The problem was slightly different. Subsequent digging showed following: only 82 out of 108 bytes has to be overwritten to cheat Valgrind. And Valgrind vision of the address region was following:

(gdb) bt
#0  _putfn (sd=-1, args=4, cbdata=0xb5f37b0) at src/client/pmix_client.c:554
#1  0x00000000056c88ac in event_process_active_single_queue (base=0x5dd2330, flags=1) at event.c:1370
#2  event_process_active (base=0x5dd2330, flags=1) at event.c:1440
#3  opal_libevent2022_event_base_loop (base=0x5dd2330, flags=1) at event.c:1644
#4  0x0000000006dbf1d9 in progress_engine (obj=0x5dd2330) at src/util/progress_threads.c:49
#5  0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003d690e8b6d in clone () from /lib64/libc.so.6
(gdb) frame 0
#0  _putfn (sd=-1, args=4, cbdata=0xb5f37b0) at src/client/pmix_client.c:554
554    rc = pmix_value_xfer(kv->value, cb->value);
(gdb) p kv->key
$13 = 0xb5f3b30 "pml.ucx.2.0"
(gdb) monitor get_vbits cb->value.data.bo.bytes 108
missing or malformed address
(gdb) p cb->value.data.bo.bytes
$14 = 0xb5f3700 "`ܚ\361\365\200\313P"
(gdb) monitor get_vbits 0xb5f3700 108
00000000 00000000 00000000 ffff0000 00000000 00000000 00000000 00000000
00000000 0000ffff ffff0000 00000000 0000ffff ffffffff 00000000 00000000
00000000 00000000 00000000 ffffffff 00000000 00000000 ffffffff ffff0000
00000000 00000000 00000000 <--- EXACTLY 82 BYTES!!!

So it was a problem in UCX, @yosefe fixed it: openucx/ucx#690.

hppritcha added a commit that referenced this pull request Mar 17, 2016
configury: UCX should use CPPFLAGS (instead of CFLAGS)
@hppritcha hppritcha merged commit 3c4f09c into open-mpi:v2.x Mar 17, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants