ompi/request: fix performance regression #1240

hjelmn · 2016-06-21T20:49:04Z

This commit fixes a performance regression introduced by the request
rework. We were always using the multi-thread path because
OPAL_ENABLE_MULTI_THREADS is either not defined or always defined to 1
depending on the Open MPI version. To fix this I removed the
conditional and added a conditional on opal_using_threads(). This path
will be optimized out in 2.0.0 in a non-thread-multiple build as
opal_using_threads is #defined to false in that case.

Fixes open-mpi/ompi#1806

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

(cherry picked from commit open-mpi/ompi@544adb9)

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit fixes a performance regression introduced by the request rework. We were always using the multi-thread path because OPAL_ENABLE_MULTI_THREADS is either not defined or always defined to 1 depending on the Open MPI version. To fix this I removed the conditional and added a conditional on opal_using_threads(). This path will be optimized out in 2.0.0 in a non-thread-multiple build as opal_using_threads is #defined to false in that case. Fixes open-mpi/ompi#1806 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit open-mpi/ompi@544adb9) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn · 2016-06-21T20:49:27Z

:bot🏷️bug
:bot:assign: @nysal
:bot:milestone:v2.0.0

ompiteam-bot · 2016-06-21T20:49:28Z

Something has gone wrong (error 422).

@jsquyres Please have a look at it!

jsquyres · 2016-06-21T21:31:00Z

Very strange -- Github won't allow assigning this issue to @nysal, even though he's got all the right permissions...

mellanox-github · 2016-06-21T21:35:04Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1799/ for details.

nysal · 2016-06-22T13:41:55Z

@hjelmn This does improve things a bit for me. I had a question about opal/threads/wait_sync.h. Shouldn't those be OMPI_ENABLE_THREAD_MULTIPLE instead of OPAL_ENABLE_MULTI_THREADS. The single threaded versions of SYNC_WAIT etc. will never be called. SYNC_WAITis used in the other variants of the MPI_Wait* calls. I'm still seeing a little bit of overhead with opal_progress(), but thats likely a different issue.

hjelmn · 2016-06-22T13:55:07Z

@nysal Yeah. I have a fix for that as well. It is slightly lower priority and I will open a PR for it once it is tested. Willing to leave that for 2.0.1 if I don't finish testing today.

nysal · 2016-06-22T14:44:04Z

@hjelmn adding that to v2.0.1 is fine with me. By the way my opal_progress() overhead seems to stem from:

    if ((OPAL_THREAD_ADD64((volatile int64_t *) &num_calls, 1) & callbacks_lp_mask) == 0) {
        /* run low priority callbacks once every 8 calls to opal_progress() */
        for (i = 0 ; i < callbacks_lp_len ; ++i) {
            events += (callbacks_lp[i])();
        }
    }

Should the above code be guarded by a "if (callbacks_lp_len > 0)". I think only the BTLs register a low priority callback at the moment. The atomic increment (or store if single threaded) can be avoided if we are running with PMLs that don't have any low priority callbacks. I understand this is a separate issue, but I thought I'd bring it up here. This PR looks good to me 👍

hjelmn · 2016-06-22T15:00:12Z

@nysal Hmm, yeah. I will add the extra item to the conditional. We will only have low-priority callbacks in a very small number of situations. They are meant to allow progress for connections.

The OPAL_ENABLE_MULTI_THREADS macro is always defined as 1. This was causing us to always use the multi-thread path for synchronization objects. The code has been updated to use the opal_using_threads() function. When MPI_THREAD_MULTIPLE support is disabled at build time (2.x only) this function is a macro evaluating to false so the compiler will optimize out the MT-path in this case. The OPAL_ATOMIC_ADD_32 macro has been removed and replaced by the existing OPAL_THREAD_ADD32 macro. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@143a93f) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn · 2016-06-22T16:50:19Z

@nysal Fixed a couple more regressions and this should be good to go now. Please +1 review and +1 this.

This commit adds another check to the low-priority callback conditional that short-circuits the atomic-add if there are no low-priority callbacks. This should improve performance in the common case. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@e4f920f) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

lanl-ompi · 2016-06-22T16:57:33Z

Test FAILed.

lanl-ompi · 2016-06-22T17:01:30Z

Test FAILed.

ibm-ompi · 2016-06-22T17:02:10Z

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

lanl-ompi · 2016-06-22T17:02:18Z

Test FAILed.

ibm-ompi · 2016-06-22T17:14:05Z

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

nysal · 2016-06-22T17:18:29Z

@hjelmn the code looks fine to me. However I see some compiler warnings introduced in the jenkins build logs

mellanox-github · 2016-06-22T17:34:04Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1803/ for details.

hjelmn · 2016-06-22T17:51:18Z

@nysal Things look as clean as they normally are on the Mellanox jenkins. Can you give some examples?

jjhursey · 2016-06-22T17:52:50Z

The IBM-CI (GNU Compiler) was a network problem (timed out pulling the repo from gitlab was very slow for some reason). Feel free to ask it to retest, but if the XL compiler passed the GNU will likely pass clean too.

hjelmn · 2016-06-22T17:56:54Z

:bot:retest:

mellanox-github · 2016-06-22T18:39:29Z

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1804/ for details.

hjelmn · 2016-06-22T18:51:37Z

False failure from Mellanox jenkins.

mike-dubman · 2016-06-22T18:55:21Z

not false, real

21:39:25 ++ /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 2 -tune /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/test_tune.conf -mca mca_base_env_list 'XXX_A=7;XXX_B=8' /scrap/jenkins/jenkins/jobs/gh-ompi-release-pr/workspace/jenkins_scripts/jenkins/ompi/env_mpi
21:39:25 ++ sed -e ':a;N;$!ba;s/\n/+/g'
21:39:25 ++ bc
21:39:25 [jenkins01:14395] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
21:39:25 [jenkins01:14395] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
21:39:25 + val=59
21:39:25 + '[' 59 -ne 54 ']'
21:39:25 + exit 1

hjelmn · 2016-06-22T18:55:52Z

@miked-mellanox It is false. A CPC could not be found. Please look into it.

jsquyres · 2016-06-22T20:54:44Z

@hjelmn Travis is taking forever on its Mac builds, but it already failed the Linux compiles -- you might want to check them out.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit open-mpi/ompi@55d1933) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

mellanox-github · 2016-06-22T21:15:15Z

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1809/ for details.

hjelmn · 2016-06-22T21:51:47Z

bot:retest

mellanox-github · 2016-06-22T23:54:02Z

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1810/ for details.

mike-dubman · 2016-06-23T04:23:19Z

when executed this command for 1000 iterations - it failed 5 times on segv

timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-4/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-4/ompi_install1/thread_tests/thread-tests-1.1/overlap 8

unfortunately, mpirun was compiled w/o debug symbols, will fix jenkins to run in debug

mike-dubman · 2016-06-23T04:30:56Z

btw, is following flag still actual for MT? --enable-opal-multi-threads

00:58:45 + ./configure --with-platform=contrib/platform/mellanox/optimized --with-ompi-param-check --enable-picky --enable-mpi-thread-multiple --enable-opal-multi-threads --prefix=/var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1
00:58:45 configure: WARNING: unrecognized options: --with-ompi-param-check, --enable-opal-multi-threads
00:58:45 Loaded platform arguments for contrib/platform/mellanox/optimized

mellanox-github · 2016-06-23T08:11:40Z

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1811/ for details.

artpol84 · 2016-06-23T08:59:02Z

This PR addresses ompi request logic. And I see hangs in overlap with the following backtraces:

In overlap test we have 2 threads:
Main thread is issuing a set of non-blocking MPI_Irecv/MPI_Isend’s and afterwards does MPI_Send

(gdb) bt
#0  0x0000003d6980c392 in ?? () from /lib64/libpthread.so.0
#1  0x00007fffedbbc247 in __mxm_spin_lock (req=0xb2c780) at ./mxm/util/sys/spinlock.h:69
#2  __mxm_async_thread_lock (req=0xb2c780) at ./mxm/core/async.h:135
#3  __mxm_async_block (req=0xb2c780) at ./mxm/core/async.h:167
#4  mxm_req_recv (req=0xb2c780) at mxm/proto/proto_match.c:264
#5  0x00007fffedec3969 in mca_pml_yalla_irecv (buf=0x7fff931b6010, count=4194304, datatype=0x601c40, src=0, tag=100, comm=0x601640, request=0x7fffffffd1b0) at pml_yalla.c:356
#6  0x00007ffff7cee1c5 in PMPI_Irecv (buf=0x7fff931b6010, count=4194304, type=0x601c40, source=0, tag=100, comm=0x601640, request=0x7fffffffd1b0) at pirecv.c:78
#7  0x0000000000400ef0 in main ()

Second thread is waiting in the blocking MPI_Recv:

(gdb) bt
#0  0x0000003d6980e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d6980beb1 in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffd030, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2c830, with_signal=true) at ../../../../ompi/request/request.h:423
#4  0x00007fffedec6ba0 in mca_pml_yalla_recv_completion_cb (context=0xb2c830) at pml_yalla_request.c:208
#5  0x00007fffedbb64e6 in mxm_proto_progress (context=0x782fc0) at mxm/core/mxm.c:89
#6  mxm_progress (context=0x782fc0) at mxm/core/mxm.c:347
#7  0x00007fffedec31a6 in mca_pml_yalla_progress () at pml_yalla.c:290
#8  0x00007ffff765b8c0 in opal_progress () at runtime/opal_progress.c:225
#9  0x00007fffedec3bc8 in mca_pml_yalla_recv (buf=0x0, count=0, datatype=0x601c40, src=1, tag=34532, comm=0x601640, status=0x0) at pml_yalla.c:383
#10 0x00007ffff7cfdd90 in PMPI_Recv (buf=0x0, count=0, type=0x601c40, source=1, tag=34532, comm=0x601640, status=0x0) at precv.c:77
#11 0x0000000000400c63 in threadfunc ()
#12 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003d690e8b6d in clone () from /lib64/libc.so.6

So it seems to me that request changes might directly affect this hang.

artpol84 · 2016-06-23T09:00:23Z

I guess that wait_sync_update shouldn't block forever. Probably mxm is holding the lock in the second thread not giving the main thread to proceed.

artpol84 · 2016-06-23T10:37:32Z

Some more info:

In main thread we are posting non-blocking receive from rank = 0:

(gdb) thr 1
[Switching to thread 1 (Thread 0x7ffff741c700 (LWP 9257))]#0  0x0000003d6980c392 in ?? () from /lib64/libpthread.so.0
(gdb) frame 7
#7  0x0000000000400f30 in main (argc=2, argv=0x7fffffffd2e8) at overlap.c:100
100                 MPI_Irecv(buf2[i], SIZE, MPI_BYTE, src[i], 100, MPI_COMM_WORLD, &req[cnt]);
(gdb) p *src@4
$16 = {0, 3, 2, 1}
(gdb) p i
$17 = 0

int the second thread we processing receive from the rank = 2 (out-of-order):

(gdb) thr 2
[Switching to thread 2 (Thread 0x7fff94fbb700 (LWP 9302))]#0  0x0000003d6980e264 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003d6980e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d6980beb1 in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffce90, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2da68, with_signal=true) at ../../../../ompi/request/request.h:423
#4  0x00007fffedec6ba0 in mca_pml_yalla_recv_completion_cb (context=0xb2da68) at pml_yalla_request.c:208
#5  0x00007fffedbb64e6 in mxm_proto_progress (context=0x783b50) at mxm/core/mxm.c:89
#6  mxm_progress (context=0x783b50) at mxm/core/mxm.c:347
#7  0x00007fffedec31a6 in mca_pml_yalla_progress () at pml_yalla.c:290
#8  0x00007ffff765b8c0 in opal_progress () at runtime/opal_progress.c:225
#9  0x00007fffedec3bc8 in mca_pml_yalla_recv (buf=0x0, count=0, datatype=0x601dc0, src=1, tag=34532, comm=0x6017c0, status=0x0) at pml_yalla.c:383
#10 0x00007ffff7cfdd90 in PMPI_Recv (buf=0x0, count=0, type=0x601dc0, source=1, tag=34532, comm=0x6017c0, status=0x0) at precv.c:77
#11 0x0000000000401270 in threadfunc (foo=0x0) at overlap.c:162
#12 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003d690e8b6d in clone () from /lib64/libc.so.6
(gdb) frame 3
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2da68, with_signal=true) at ../../../../ompi/request/request.h:423
423                     wait_sync_update(tmp_sync, 1, request->req_status.MPI_ERROR);
(gdb) p *request
$18 = {super = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7fffee0cc340, obj_reference_count = 1, cls_init_file_name = 0x7ffff7710e0c "class/opal_free_list.c", cls_init_lineno =
    232}, opal_list_next = 0xb2d8f0, opal_list_prev = 0x0, item_free = 0, opal_list_item_refcount = 0, opal_list_item_belong_to = 0x0}, registration = 0x0, ptr = 0x0}, req_type = OMPI_REQUEST_PML,
  req_status = {MPI_SOURCE = 2, MPI_TAG = 100, MPI_ERROR = 0, _cancelled = 0, _ucount = 4194304}, req_complete = 0x1, req_state = OMPI_REQUEST_INVALID, req_persistent = false, req_f_to_c_index = -32766,
  req_free = 0x7fffedec64f0 <mca_pml_yalla_recv_request_free>, req_cancel = 0x7fffedec65a5 <mca_pml_yalla_recv_request_cancel>, req_complete_cb = 0, req_complete_cb_data = 0x0, req_mpi_object = {comm =
    0x6017c0, file = 0x6017c0, win = 0x6017c0}}

It seems that condition field of sync wasn't initialized:

(gdb) frame 2
#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffce90, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
91          WAIT_SYNC_SIGNAL(sync);
(gdb) p sync->condition
$19 = {__data = {__lock = 6299072, __futex = 0, __total_seq = 4194304, __wakeup_seq = 140735661432848, __woken_seq = 140735661432848, __mutex = 0x0, __nwaiters = 1, __broadcast_seq = 0}, __size =
    "\300\035`\000\000\000\000\000\000\000@\000\000\000\000\000\020`\033\223\377\177\000\000\020`\033\223\377\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000", __align = 6299072}

nwaiters == 1 while it's obviously no thread is waiting for this condition - this is an out-of-order receive for which no recv was even posted.

nysal · 2016-06-23T12:54:32Z

I think the ompi_wait_sync_t object is initialized (and thus also the pthread condition variable) if the blocking recv waits on the request via ompi_request_wait_completion(). Yalla seems to bypass this? I see a PML_YALLA_WAIT_MXM_REQ which seems to indicate this. Since the condition variable is not initialized you hang while trying to take a lock on the condition variable's internal mutex.

hjelmn · 2016-06-23T13:45:14Z

@artpol84 If there is a wait object then either 1) someone is waiting on it, or 2) the req_complete field was not correctly updated when the request was allocated.

hjelmn · 2016-06-23T14:04:00Z

Hmm, or I am wrong. There is no ompi_request_t in the blocking send path..... The request being completed is from an irecv.

hjelmn · 2016-06-23T15:06:04Z

@artpol84 Thanks for digging into this. Whatever the problem is I have to assume it affects master too. If not then there might be a commit missing. I haven't been able to find one but I am not looking at the entire code base.

hjelmn · 2016-06-23T15:35:44Z

@jsquyres Since this PR has nothing to do with the pml/yalla problem we should go ahead and merge.

artpol84 · 2016-06-23T15:42:19Z

I was unable to test v2.x without this PR and the problem is with requests which is related to this PR.
I plan to do that tomorrow, can we wait till then?

hjelmn · 2016-06-23T15:43:06Z

@artpol84 All PRs are failing on 2.x. Including one that had nothing to do with requests.

jsquyres · 2016-06-23T15:49:13Z

@hppritcha I vote for merging this PR.

@jladd-mlnx @artpol84 Do you guys want to wait for v2.0.1 for yalla threading fixes? Or can you get a fix in the very near term?

jsquyres · 2016-06-23T16:42:29Z

Please continue the discussion about the overlap test on open-mpi/ompi#1813. Discussion on this PR should now be about the performance regression fix. Thanks.

artpol84 · 2016-06-23T16:44:12Z

@hjelmn @jsquyres

I disagree about "All PRs are failing on 2.x" in the same way. For example issue #1237 segfults right at the start.

http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1796/console

18:54:44 + taskset -c 0,1 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_receive_queues P,65536,256,192,128:S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 -mca btl_openib_cpc_include rdmacm -mca pml '^ucx' -mca btl self,openib -mca btl_if_include mlx4_0:2 /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
18:54:45 Time per iteration on each process (ms)
18:54:45 Time    Compute time    Comm time
18:54:47 [jenkins01:10915:0] Caught signal 11 (Segmentation fault)
18:54:47 ==== backtrace ====
18:54:47  2 0x000000000005a42c mxm_handle_error()  /hpc/local/benchmarks/hpc-stack-gcc-Monday/src/install/mxm-master/src/mxm/util/debug/debug.c:641
18:54:47  3 0x000000000005a59c mxm_error_signal_handler()  /hpc/local/benchmarks/hpc-stack-gcc-Monday/src/install/mxm-master/src/mxm/util/debug/debug.c:616
18:54:47  4 0x0000003d690329a0 killpg()  ??:0
18:54:47  5 0x0000000000040589 hmca_coll_ml_barrier_launch()  coll_ml_barrier.c:0
18:54:47  6 0x00000000000410a9 hmca_coll_ml_barrier_intra()  ??:0
18:54:47  7 0x0000000000008a2a mca_coll_hcoll_barrier()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mca/coll/hcoll/coll_hcoll_ops.c:28
18:54:47  8 0x0000000000078ca6 PMPI_Barrier()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mpi/c/profile/pbarrier.c:61
18:54:47  9 0x0000000000401009 main()  ??:0
18:54:47 10 0x0000003d6901ed1d __libc_start_main()  ??:0
18:54:47 11 0x0000000000400b69 _start()  ??:0
18:54:47 ===================
18:54:47 --------------------------------------------------------------------------
18:54:47 mpirun noticed that process rank 3 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
18:54:47 --------------------------------------------------------------------------

http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1793/console

01:03:32 + taskset -c 6,7 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
01:03:32 Time per iteration on each process (ms)
01:03:32 Time    Compute time    Comm time
01:03:34 [jenkins01:12443:0] Caught signal 11 (Segmentation fault)
01:03:34 --------------------------------------------------------------------------
01:03:34 mpirun noticed that process rank 1 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
01:03:34 --------------------------------------------------------------------------

And for this issue overlap is hanging (see the timestamps):

first FAIL #1:

01:29:27 + taskset -c 8,9 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
01:29:28 Time per iteration on each process (ms)
01:29:28 Time    Compute time    Comm time
01:39:27 [jenkins01:18732] *** Process received signal ***
01:39:27 mpirun: Forwarding signal 18 to job
01:39:27 [jenkins01:18732] Signal: Segmentation fault (11)
01:39:27 [jenkins01:18732] Signal code:  (0)

FAIL #2:

09:45:38 + taskset -c 0,1 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
09:45:38 Time per iteration on each process (ms)
09:45:38 Time    Compute time    Comm time
09:55:38 [jenkins01:10210] *** Process received signal ***
09:55:38 [jenkins01:10210] Signal: Segmentation fault (11)
09:55:38 [jenkins01:10210] Signal code:  (0)

hjelmn · 2016-06-23T16:46:12Z

@artpol84 All failures involve yalla from what I can tell so they are all probably from the same bug. Note the one that should be using openib has -mca pml ^ucx so if yalla is higher priority than ob1 it is using yalla.

hjelmn · 2016-06-23T16:47:04Z

Yup, yalla priority is 50. So it wins. All failures are pml/yalla.

jsquyres · 2016-06-23T16:47:30Z

@hjelmn @artpol84 Please move your discussion to open-mpi/ompi#1813.

hppritcha · 2016-06-23T17:24:15Z

I'll give mlnx jenkins another chance, then merge unless @jsquyres objects.
bot:retest

hppritcha · 2016-06-23T18:19:10Z

Discussion of the mlnx jenkins failure indicates its very unlikely to be related to code changes from this PR. merging.

jsquyres added the bug label Jun 21, 2016

jsquyres added this to the v2.0.0 milestone Jun 21, 2016

ompiteam-bot added the reviewed label Jun 22, 2016

hjelmn force-pushed the v2.x_request_performance branch from 45706ca to 6711e0f Compare June 22, 2016 16:51

opal/sync: fix warnings

fe5643d

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit open-mpi/ompi@55d1933) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

jsquyres mentioned this pull request Jun 23, 2016

Overlap test sometimes failing in Mellanox Jenkins with Yalla open-mpi/ompi#1813

Closed

hppritcha added the RM-Approved label Jun 23, 2016

hppritcha merged commit 9ea5bf7 into open-mpi:v2.x Jun 23, 2016

artpol84 mentioned this pull request Jun 28, 2016

opal/sync: fix race condition open-mpi/ompi#1816

Merged

ompi/request: fix performance regression #1240

ompi/request: fix performance regression #1240

Uh oh!

Conversation

hjelmn commented Jun 21, 2016

Uh oh!

hjelmn commented Jun 21, 2016

Uh oh!

ompiteam-bot commented Jun 21, 2016

Uh oh!

jsquyres commented Jun 21, 2016

Uh oh!

mellanox-github commented Jun 21, 2016

Uh oh!

nysal commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

nysal commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

lanl-ompi commented Jun 22, 2016

Uh oh!

lanl-ompi commented Jun 22, 2016

Uh oh!

ibm-ompi commented Jun 22, 2016

Uh oh!

lanl-ompi commented Jun 22, 2016

Uh oh!

ibm-ompi commented Jun 22, 2016

Uh oh!

nysal commented Jun 22, 2016

Uh oh!

mellanox-github commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

jjhursey commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

mellanox-github commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

mike-dubman commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

jsquyres commented Jun 22, 2016

Uh oh!

mellanox-github commented Jun 22, 2016

Uh oh!

hjelmn commented Jun 22, 2016

Uh oh!

mellanox-github commented Jun 22, 2016

Uh oh!

mike-dubman commented Jun 23, 2016

Uh oh!

mike-dubman commented Jun 23, 2016

Uh oh!

mellanox-github commented Jun 23, 2016

Uh oh!

artpol84 commented Jun 23, 2016

Uh oh!

artpol84 commented Jun 23, 2016

Uh oh!

artpol84 commented Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nysal commented Jun 23, 2016

Uh oh!

hjelmn commented Jun 23, 2016

Uh oh!

hjelmn commented Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artpol84 commented Jun 23, 2016 •

edited

Loading

hjelmn commented Jun 23, 2016 •

edited

Loading

artpol84 commented Jun 23, 2016 •

edited

Loading