-
Notifications
You must be signed in to change notification settings - Fork 914
Closed
Description
Last night (1 Dec 2016), Cisco's MTT saw 3210 OpenSHMEM failures on master: https://mtt.open-mpi.org/index.php?do_redir=2380
@jladd-mlnx Per discussion this past Tuesday, can you investigate?
Here's some of the signatures -- this one is a segv:
Element size: 4 bytes
Addresses: 1st element 0xff0000d8
2nd element 0xff0000dc
Iterations: 1000000 target PE: 0 other active PE: 31
[mpi019:19645] *** Process received signal ***[mpi019:19645] Signal: Segmentation fault (11)
[mpi019:19645] Signal code: Address not mapped (1)
[mpi019:19645] Failing at address: 0x18[mpi019:19645] [ 0] /lib64/libpthread.so.0[0x36e920f710]
[mpi019:19645] [ 1] /home/mpiteam/scratches/community/2016-12-01cron/cEW8/installs/cRor/install/lib/openmpi/mca_spml_yoda.so(mca_spml_yoda_get+0x65e)[0x2aaac5d92be2]
[mpi019:19645] [ 2] /home/mpiteam/scratches/community/2016-12-01cron/cEW8/installs/cRor/install/lib/liboshmem.so.0(shmem_int_g+0x251)[0x2aaaaaaf4e17]
[mpi019:19645] [ 3] examples/adjacent_32bit_amo.x[0x400b83]
[mpi019:19645] [ 4] /lib64/libc.so.6(__libc_start_main+0xfd)[0x36e8e1ed1d]
[mpi019:19645] [ 5] examples/adjacent_32bit_amo.x[0x400929]
[mpi019:19645] *** End of error message ***
This one is a little odd -- something chose to abort:
[mpi030:01565] *** Process received signal ***
[mpi030:01565] Signal: Aborted (6)
[mpi030:01565] Signal code: (-6)
[mpi030:01565] Error spml_yoda_getreq.c:121 - mca_spml_yoda_get_response_completion() FATAL get completion error
[mpi030:01565] [ 0] /lib64/libpthread.so.0[0x3b7460f710]
[mpi030:01565] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3b74232925]
[mpi030:01565] [ 2] /lib64/libc.so.6(abort+0x175)[0x3b74234105]
[mpi030:01565] [ 3] /home/mpiteam/scratches/community/2016-12-01cron/L7d3/installs/S11W/install/lib/liboshmem.so.0(mca_spml_yoda_get_response_completion+0x88)[0x2aaaaaba4027]
[mpi030:01565] [ 4] /home/mpiteam/scratches/community/2016-12-01cron/L7d3/installs/S11W/install/lib/libopen-pal.so.0(mca_btl_tcp_endpoint_close+0x2c1)[0x2aaaac00fc74]
[mpi030:01565] [ 5] /home/mpiteam/scratches/community/2016-12-01cron/L7d3/installs/S11W/install/lib/libopen-pal.so.0(mca_btl_tcp_frag_recv+0x310)[0x2aaaac013f0d]
When running tests like test_shmem_shpalloc_03_double.x
, I see errors like:
[mpi006:31636] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31636] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
[mpi006:31640] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31640] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
[mpi006:31654] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31654] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
[mpi006:31662] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31662] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..[mpi006:31634] Error: pshpalloc_f.c:49 - shpalloc_f() could not allocate -198967296 bytes in symmetric heap
[mpi006:31634] Error: pshpalloc_f.c:52 - shpalloc_f() nonzero abort value, aborting..
...
When running tests like test_shmem_finc_05_int4.x
, I see errors like:
[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
[mpi021:24233] Error spml_yoda.c:876 - mca_spml_yoda_put_internal() shmem error
[mpi021:24233] Error spml_yoda.c:880 - mca_spml_yoda_put_internal() shmem error: ret = -12, send_pe = 31, dest_pe = 0
[mpi021:24233] Error spml_yoda.c:1207 - mca_spml_yoda_get() oshmem_get: error -12
...
This is just a snapshot of some of the errors; I did not attempt to classify all of the errors that are available on that MTT report here on this issue.