Skip to content

OSHMEM MTT failures #2499

@jsquyres

Description

@jsquyres

Last night (1 Dec 2016), Cisco's MTT saw 3210 OpenSHMEM failures on master: https://mtt.open-mpi.org/index.php?do_redir=2380

@jladd-mlnx Per discussion this past Tuesday, can you investigate?

Here's some of the signatures -- this one is a segv:

Element size: 4 bytes
Addresses: 1st element 0xff0000d8
           2nd element 0xff0000dc
Iterations: 1000000   target PE: 0   other active PE: 31
[mpi019:19645] *** Process received signal ***[mpi019:19645] Signal: Segmentation fault (11)
[mpi019:19645] Signal code: Address not mapped (1)
[mpi019:19645] Failing at address: 0x18[mpi019:19645] [ 0] /lib64/libpthread.so.0[0x36e920f710]
[mpi019:19645] [ 1] /home/mpiteam/scratches/community/2016-12-01cron/cEW8/installs/cRor/install/lib/openmpi/mca_spml_yoda.so(mca_spml_yoda_get+0x65e)[0x2aaac5d92be2]
[mpi019:19645] [ 2] /home/mpiteam/scratches/community/2016-12-01cron/cEW8/installs/cRor/install/lib/liboshmem.so.0(shmem_int_g+0x251)[0x2aaaaaaf4e17]
[mpi019:19645] [ 3] examples/adjacent_32bit_amo.x[0x400b83]
[mpi019:19645] [ 4] /lib64/libc.so.6(__libc_start_main+0xfd)[0x36e8e1ed1d]
[mpi019:19645] [ 5] examples/adjacent_32bit_amo.x[0x400929]
[mpi019:19645] *** End of error message ***

This one is a little odd -- something chose to abort:

[mpi030:01565] *** Process received signal ***
[mpi030:01565] Signal: Aborted (6)
[mpi030:01565] Signal code:  (-6)
[mpi030:01565] Error spml_yoda_getreq.c:121 - mca_spml_yoda_get_response_completion() FATAL get completion error
[mpi030:01565] [ 0] /lib64/libpthread.so.0[0x3b7460f710]
[mpi030:01565] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3b74232925]
[mpi030:01565] [ 2] /lib64/libc.so.6(abort+0x175)[0x3b74234105]
[mpi030:01565] [ 3] /home/mpiteam/scratches/community/2016-12-01cron/L7d3/installs/S11W/install/lib/liboshmem.so.0(mca_spml_yoda_get_response_completion+0x88)[0x2aaaaaba4027]
[mpi030:01565] [ 4] /home/mpiteam/scratches/community/2016-12-01cron/L7d3/installs/S11W/install/lib/libopen-pal.so.0(mca_btl_tcp_endpoint_close+0x2c1)[0x2aaaac00fc74]
[mpi030:01565] [ 5] /home/mpiteam/scratches/community/2016-12-01cron/L7d3/installs/S11W/install/lib/libopen-pal.so.0(mca_btl_tcp_frag_recv+0x310)[0x2aaaac013f0d]

When running tests like test_shmem_shpalloc_03_double.x, I see errors like:

[mpi006:31636] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31636] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
[mpi006:31640] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31640] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
[mpi006:31654] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31654] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
[mpi006:31662] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31662] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..[mpi006:31634] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31634] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
...

When running tests like test_shmem_finc_05_int4.x, I see errors like:

[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
...

This is just a snapshot of some of the errors; I did not attempt to classify all of the errors that are available on that MTT report here on this issue.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions