Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 21, 2016

This commit fixes a performance regression introduced by the request
rework. We were always using the multi-thread path because
OPAL_ENABLE_MULTI_THREADS is either not defined or always defined to 1
depending on the Open MPI version. To fix this I removed the
conditional and added a conditional on opal_using_threads(). This path
will be optimized out in 2.0.0 in a non-thread-multiple build as
opal_using_threads is #defined to false in that case.

Fixes open-mpi/ompi#1806

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

(cherry picked from commit open-mpi/ompi@544adb9)

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit fixes a performance regression introduced by the request
rework. We were always using the multi-thread path because
OPAL_ENABLE_MULTI_THREADS is either not defined or always defined to 1
depending on the Open MPI version. To fix this I removed the
conditional and added a conditional on opal_using_threads(). This path
will be optimized out in 2.0.0 in a non-thread-multiple build as
opal_using_threads is #defined to false in that case.

Fixes open-mpi/ompi#1806

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@544adb9)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 21, 2016

:bot🏷️bug
:bot:assign: @nysal
:bot:milestone:v2.0.0

@ompiteam-bot
Copy link

Something has gone wrong (error 422).

@jsquyres Please have a look at it!

@jsquyres jsquyres added the bug label Jun 21, 2016
@jsquyres jsquyres added this to the v2.0.0 milestone Jun 21, 2016
@jsquyres
Copy link
Member

Very strange -- Github won't allow assigning this issue to @nysal, even though he's got all the right permissions...

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1799/ for details.

@nysal
Copy link
Member

nysal commented Jun 22, 2016

@hjelmn This does improve things a bit for me. I had a question about opal/threads/wait_sync.h. Shouldn't those be OMPI_ENABLE_THREAD_MULTIPLE instead of OPAL_ENABLE_MULTI_THREADS. The single threaded versions of SYNC_WAIT etc. will never be called. SYNC_WAITis used in the other variants of the MPI_Wait* calls. I'm still seeing a little bit of overhead with opal_progress(), but thats likely a different issue.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

@nysal Yeah. I have a fix for that as well. It is slightly lower priority and I will open a PR for it once it is tested. Willing to leave that for 2.0.1 if I don't finish testing today.

@nysal
Copy link
Member

nysal commented Jun 22, 2016

@hjelmn adding that to v2.0.1 is fine with me. By the way my opal_progress() overhead seems to stem from:

    if ((OPAL_THREAD_ADD64((volatile int64_t *) &num_calls, 1) & callbacks_lp_mask) == 0) {
        /* run low priority callbacks once every 8 calls to opal_progress() */
        for (i = 0 ; i < callbacks_lp_len ; ++i) {
            events += (callbacks_lp[i])();
        }
    }

Should the above code be guarded by a "if (callbacks_lp_len > 0)". I think only the BTLs register a low priority callback at the moment. The atomic increment (or store if single threaded) can be avoided if we are running with PMLs that don't have any low priority callbacks. I understand this is a separate issue, but I thought I'd bring it up here. This PR looks good to me 👍

@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

@nysal Hmm, yeah. I will add the extra item to the conditional. We will only have low-priority callbacks in a very small number of situations. They are meant to allow progress for connections.

The OPAL_ENABLE_MULTI_THREADS macro is always defined as 1. This was
causing us to always use the multi-thread path for synchronization
objects. The code has been updated to use the opal_using_threads()
function. When MPI_THREAD_MULTIPLE support is disabled at build time
(2.x only) this function is a macro evaluating to false so the
compiler will optimize out the MT-path in this case. The
OPAL_ATOMIC_ADD_32 macro has been removed and replaced by the existing
OPAL_THREAD_ADD32 macro.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@143a93f)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

@nysal Fixed a couple more regressions and this should be good to go now. Please +1 review and +1 this.

This commit adds another check to the low-priority callback
conditional that short-circuits the atomic-add if there are no
low-priority callbacks. This should improve performance in the common
case.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from open-mpi/ompi@e4f920f)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn hjelmn force-pushed the v2.x_request_performance branch from 45706ca to 6711e0f Compare June 22, 2016 16:51
@lanl-ompi
Copy link
Contributor

Test FAILed.

1 similar comment
@lanl-ompi
Copy link
Contributor

Test FAILed.

@ibm-ompi
Copy link

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

@lanl-ompi
Copy link
Contributor

Test FAILed.

@ibm-ompi
Copy link

Build Failed with GNU compiler! Please review the log, and get in touch if you have questions.

@nysal
Copy link
Member

nysal commented Jun 22, 2016

@hjelmn the code looks fine to me. However I see some compiler warnings introduced in the jenkins build logs

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1803/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

@nysal Things look as clean as they normally are on the Mellanox jenkins. Can you give some examples?

@jjhursey
Copy link
Member

The IBM-CI (GNU Compiler) was a network problem (timed out pulling the repo from gitlab was very slow for some reason). Feel free to ask it to retest, but if the XL compiler passed the GNU will likely pass clean too.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

:bot:retest:

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1804/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

False failure from Mellanox jenkins.

@mike-dubman
Copy link
Member

not false, real

21:39:25 ++ /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 2 -tune /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/test_tune.conf -mca mca_base_env_list 'XXX_A=7;XXX_B=8' /scrap/jenkins/jenkins/jobs/gh-ompi-release-pr/workspace/jenkins_scripts/jenkins/ompi/env_mpi
21:39:25 ++ sed -e ':a;N;$!ba;s/\n/+/g'
21:39:25 ++ bc
21:39:25 [jenkins01:14395] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
21:39:25 [jenkins01:14395] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
21:39:25 + val=59
21:39:25 + '[' 59 -ne 54 ']'
21:39:25 + exit 1

@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

@miked-mellanox It is false. A CPC could not be found. Please look into it.

@jsquyres
Copy link
Member

@hjelmn Travis is taking forever on its Mac builds, but it already failed the Linux compiles -- you might want to check them out.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit open-mpi/ompi@55d1933)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1809/ for details.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 22, 2016

bot:retest

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1810/ for details.

@mike-dubman
Copy link
Member

  • when executed this command for 1000 iterations - it failed 5 times on segv
timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-4/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-4/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
  • unfortunately, mpirun was compiled w/o debug symbols, will fix jenkins to run in debug

@mike-dubman
Copy link
Member

btw, is following flag still actual for MT? --enable-opal-multi-threads

00:58:45 + ./configure --with-platform=contrib/platform/mellanox/optimized --with-ompi-param-check --enable-picky --enable-mpi-thread-multiple --enable-opal-multi-threads --prefix=/var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1
00:58:45 configure: WARNING: unrecognized options: --with-ompi-param-check, --enable-opal-multi-threads
00:58:45 Loaded platform arguments for contrib/platform/mellanox/optimized

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1811/ for details.

@artpol84
Copy link
Contributor

This PR addresses ompi request logic. And I see hangs in overlap with the following backtraces:

In overlap test we have 2 threads:
Main thread is issuing a set of non-blocking MPI_Irecv/MPI_Isend’s and afterwards does MPI_Send

(gdb) bt
#0  0x0000003d6980c392 in ?? () from /lib64/libpthread.so.0
#1  0x00007fffedbbc247 in __mxm_spin_lock (req=0xb2c780) at ./mxm/util/sys/spinlock.h:69
#2  __mxm_async_thread_lock (req=0xb2c780) at ./mxm/core/async.h:135
#3  __mxm_async_block (req=0xb2c780) at ./mxm/core/async.h:167
#4  mxm_req_recv (req=0xb2c780) at mxm/proto/proto_match.c:264
#5  0x00007fffedec3969 in mca_pml_yalla_irecv (buf=0x7fff931b6010, count=4194304, datatype=0x601c40, src=0, tag=100, comm=0x601640, request=0x7fffffffd1b0) at pml_yalla.c:356
#6  0x00007ffff7cee1c5 in PMPI_Irecv (buf=0x7fff931b6010, count=4194304, type=0x601c40, source=0, tag=100, comm=0x601640, request=0x7fffffffd1b0) at pirecv.c:78
#7  0x0000000000400ef0 in main ()

Second thread is waiting in the blocking MPI_Recv:

(gdb) bt
#0  0x0000003d6980e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d6980beb1 in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffd030, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2c830, with_signal=true) at ../../../../ompi/request/request.h:423
#4  0x00007fffedec6ba0 in mca_pml_yalla_recv_completion_cb (context=0xb2c830) at pml_yalla_request.c:208
#5  0x00007fffedbb64e6 in mxm_proto_progress (context=0x782fc0) at mxm/core/mxm.c:89
#6  mxm_progress (context=0x782fc0) at mxm/core/mxm.c:347
#7  0x00007fffedec31a6 in mca_pml_yalla_progress () at pml_yalla.c:290
#8  0x00007ffff765b8c0 in opal_progress () at runtime/opal_progress.c:225
#9  0x00007fffedec3bc8 in mca_pml_yalla_recv (buf=0x0, count=0, datatype=0x601c40, src=1, tag=34532, comm=0x601640, status=0x0) at pml_yalla.c:383
#10 0x00007ffff7cfdd90 in PMPI_Recv (buf=0x0, count=0, type=0x601c40, source=1, tag=34532, comm=0x601640, status=0x0) at precv.c:77
#11 0x0000000000400c63 in threadfunc ()
#12 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003d690e8b6d in clone () from /lib64/libc.so.6

So it seems to me that request changes might directly affect this hang.

@artpol84
Copy link
Contributor

I guess that wait_sync_update shouldn't block forever. Probably mxm is holding the lock in the second thread not giving the main thread to proceed.

@artpol84
Copy link
Contributor

artpol84 commented Jun 23, 2016

Some more info:

In main thread we are posting non-blocking receive from rank = 0:

(gdb) thr 1
[Switching to thread 1 (Thread 0x7ffff741c700 (LWP 9257))]#0  0x0000003d6980c392 in ?? () from /lib64/libpthread.so.0
(gdb) frame 7
#7  0x0000000000400f30 in main (argc=2, argv=0x7fffffffd2e8) at overlap.c:100
100                 MPI_Irecv(buf2[i], SIZE, MPI_BYTE, src[i], 100, MPI_COMM_WORLD, &req[cnt]);
(gdb) p *src@4
$16 = {0, 3, 2, 1}
(gdb) p i
$17 = 0

int the second thread we processing receive from the rank = 2 (out-of-order):

(gdb) thr 2
[Switching to thread 2 (Thread 0x7fff94fbb700 (LWP 9302))]#0  0x0000003d6980e264 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003d6980e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d6980beb1 in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffce90, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2da68, with_signal=true) at ../../../../ompi/request/request.h:423
#4  0x00007fffedec6ba0 in mca_pml_yalla_recv_completion_cb (context=0xb2da68) at pml_yalla_request.c:208
#5  0x00007fffedbb64e6 in mxm_proto_progress (context=0x783b50) at mxm/core/mxm.c:89
#6  mxm_progress (context=0x783b50) at mxm/core/mxm.c:347
#7  0x00007fffedec31a6 in mca_pml_yalla_progress () at pml_yalla.c:290
#8  0x00007ffff765b8c0 in opal_progress () at runtime/opal_progress.c:225
#9  0x00007fffedec3bc8 in mca_pml_yalla_recv (buf=0x0, count=0, datatype=0x601dc0, src=1, tag=34532, comm=0x6017c0, status=0x0) at pml_yalla.c:383
#10 0x00007ffff7cfdd90 in PMPI_Recv (buf=0x0, count=0, type=0x601dc0, source=1, tag=34532, comm=0x6017c0, status=0x0) at precv.c:77
#11 0x0000000000401270 in threadfunc (foo=0x0) at overlap.c:162
#12 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003d690e8b6d in clone () from /lib64/libc.so.6
(gdb) frame 3
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2da68, with_signal=true) at ../../../../ompi/request/request.h:423
423                     wait_sync_update(tmp_sync, 1, request->req_status.MPI_ERROR);
(gdb) p *request
$18 = {super = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7fffee0cc340, obj_reference_count = 1, cls_init_file_name = 0x7ffff7710e0c "class/opal_free_list.c", cls_init_lineno =
    232}, opal_list_next = 0xb2d8f0, opal_list_prev = 0x0, item_free = 0, opal_list_item_refcount = 0, opal_list_item_belong_to = 0x0}, registration = 0x0, ptr = 0x0}, req_type = OMPI_REQUEST_PML,
  req_status = {MPI_SOURCE = 2, MPI_TAG = 100, MPI_ERROR = 0, _cancelled = 0, _ucount = 4194304}, req_complete = 0x1, req_state = OMPI_REQUEST_INVALID, req_persistent = false, req_f_to_c_index = -32766,
  req_free = 0x7fffedec64f0 <mca_pml_yalla_recv_request_free>, req_cancel = 0x7fffedec65a5 <mca_pml_yalla_recv_request_cancel>, req_complete_cb = 0, req_complete_cb_data = 0x0, req_mpi_object = {comm =
    0x6017c0, file = 0x6017c0, win = 0x6017c0}}

It seems that condition field of sync wasn't initialized:

(gdb) frame 2
#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffce90, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
91          WAIT_SYNC_SIGNAL(sync);
(gdb) p sync->condition
$19 = {__data = {__lock = 6299072, __futex = 0, __total_seq = 4194304, __wakeup_seq = 140735661432848, __woken_seq = 140735661432848, __mutex = 0x0, __nwaiters = 1, __broadcast_seq = 0}, __size =
    "\300\035`\000\000\000\000\000\000\000@\000\000\000\000\000\020`\033\223\377\177\000\000\020`\033\223\377\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000", __align = 6299072}

nwaiters == 1 while it's obviously no thread is waiting for this condition - this is an out-of-order receive for which no recv was even posted.

@nysal
Copy link
Member

nysal commented Jun 23, 2016

I think the ompi_wait_sync_t object is initialized (and thus also the pthread condition variable) if the blocking recv waits on the request via ompi_request_wait_completion(). Yalla seems to bypass this? I see a PML_YALLA_WAIT_MXM_REQ which seems to indicate this. Since the condition variable is not initialized you hang while trying to take a lock on the condition variable's internal mutex.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

@artpol84 If there is a wait object then either 1) someone is waiting on it, or 2) the req_complete field was not correctly updated when the request was allocated.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

Hmm, or I am wrong. There is no ompi_request_t in the blocking send path..... The request being completed is from an irecv.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

@artpol84 Thanks for digging into this. Whatever the problem is I have to assume it affects master too. If not then there might be a commit missing. I haven't been able to find one but I am not looking at the entire code base.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

@jsquyres Since this PR has nothing to do with the pml/yalla problem we should go ahead and merge.

@artpol84
Copy link
Contributor

I was unable to test v2.x without this PR and the problem is with requests which is related to this PR.
I plan to do that tomorrow, can we wait till then?

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

@artpol84 All PRs are failing on 2.x. Including one that had nothing to do with requests.

@jsquyres
Copy link
Member

@hppritcha I vote for merging this PR.

@jladd-mlnx @artpol84 Do you guys want to wait for v2.0.1 for yalla threading fixes? Or can you get a fix in the very near term?

@jsquyres
Copy link
Member

Please continue the discussion about the overlap test on open-mpi/ompi#1813. Discussion on this PR should now be about the performance regression fix. Thanks.

@artpol84
Copy link
Contributor

artpol84 commented Jun 23, 2016

@hjelmn @jsquyres

I disagree about "All PRs are failing on 2.x" in the same way. For example issue #1237 segfults right at the start.

http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1796/console

18:54:44 + taskset -c 0,1 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_receive_queues P,65536,256,192,128:S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 -mca btl_openib_cpc_include rdmacm -mca pml '^ucx' -mca btl self,openib -mca btl_if_include mlx4_0:2 /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
18:54:45 Time per iteration on each process (ms)
18:54:45 Time    Compute time    Comm time
18:54:47 [jenkins01:10915:0] Caught signal 11 (Segmentation fault)
18:54:47 ==== backtrace ====
18:54:47  2 0x000000000005a42c mxm_handle_error()  /hpc/local/benchmarks/hpc-stack-gcc-Monday/src/install/mxm-master/src/mxm/util/debug/debug.c:641
18:54:47  3 0x000000000005a59c mxm_error_signal_handler()  /hpc/local/benchmarks/hpc-stack-gcc-Monday/src/install/mxm-master/src/mxm/util/debug/debug.c:616
18:54:47  4 0x0000003d690329a0 killpg()  ??:0
18:54:47  5 0x0000000000040589 hmca_coll_ml_barrier_launch()  coll_ml_barrier.c:0
18:54:47  6 0x00000000000410a9 hmca_coll_ml_barrier_intra()  ??:0
18:54:47  7 0x0000000000008a2a mca_coll_hcoll_barrier()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mca/coll/hcoll/coll_hcoll_ops.c:28
18:54:47  8 0x0000000000078ca6 PMPI_Barrier()  /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi/mpi/c/profile/pbarrier.c:61
18:54:47  9 0x0000000000401009 main()  ??:0
18:54:47 10 0x0000003d6901ed1d __libc_start_main()  ??:0
18:54:47 11 0x0000000000400b69 _start()  ??:0
18:54:47 ===================
18:54:47 --------------------------------------------------------------------------
18:54:47 mpirun noticed that process rank 3 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
18:54:47 --------------------------------------------------------------------------

http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1793/console

01:03:32 + taskset -c 6,7 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
01:03:32 Time per iteration on each process (ms)
01:03:32 Time    Compute time    Comm time
01:03:34 [jenkins01:12443:0] Caught signal 11 (Segmentation fault)
01:03:34 --------------------------------------------------------------------------
01:03:34 mpirun noticed that process rank 1 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
01:03:34 --------------------------------------------------------------------------

And for this issue overlap is hanging (see the timestamps):

first FAIL #1:

01:29:27 + taskset -c 8,9 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
01:29:28 Time per iteration on each process (ms)
01:29:28 Time    Compute time    Comm time
01:39:27 [jenkins01:18732] *** Process received signal ***
01:39:27 mpirun: Forwarding signal 18 to job
01:39:27 [jenkins01:18732] Signal: Segmentation fault (11)
01:39:27 [jenkins01:18732] Signal code:  (0)

FAIL #2:

09:45:38 + taskset -c 0,1 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
09:45:38 Time per iteration on each process (ms)
09:45:38 Time    Compute time    Comm time
09:55:38 [jenkins01:10210] *** Process received signal ***
09:55:38 [jenkins01:10210] Signal: Segmentation fault (11)
09:55:38 [jenkins01:10210] Signal code:  (0)

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

@artpol84 All failures involve yalla from what I can tell so they are all probably from the same bug. Note the one that should be using openib has -mca pml ^ucx so if yalla is higher priority than ob1 it is using yalla.

@hjelmn
Copy link
Member Author

hjelmn commented Jun 23, 2016

Yup, yalla priority is 50. So it wins. All failures are pml/yalla.

@jsquyres
Copy link
Member

@hjelmn @artpol84 Please move your discussion to open-mpi/ompi#1813.

@hppritcha
Copy link
Member

I'll give mlnx jenkins another chance, then merge unless @jsquyres objects.
bot:retest

@hppritcha
Copy link
Member

Discussion of the mlnx jenkins failure indicates its very unlikely to be related to code changes from this PR. merging.

@hppritcha hppritcha merged commit 9ea5bf7 into open-mpi:v2.x Jun 23, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Single threaded latency with latest request changes