-
Notifications
You must be signed in to change notification settings - Fork 68
ompi/request: fix performance regression #1240
Conversation
This commit fixes a performance regression introduced by the request rework. We were always using the multi-thread path because OPAL_ENABLE_MULTI_THREADS is either not defined or always defined to 1 depending on the Open MPI version. To fix this I removed the conditional and added a conditional on opal_using_threads(). This path will be optimized out in 2.0.0 in a non-thread-multiple build as opal_using_threads is #defined to false in that case. Fixes open-mpi/ompi#1806 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit open-mpi/ompi@544adb9) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
:bot🏷️bug |
|
Something has gone wrong (error 422). @jsquyres Please have a look at it! |
|
Very strange -- Github won't allow assigning this issue to @nysal, even though he's got all the right permissions... |
|
Test PASSed. |
|
@hjelmn This does improve things a bit for me. I had a question about opal/threads/wait_sync.h. Shouldn't those be OMPI_ENABLE_THREAD_MULTIPLE instead of OPAL_ENABLE_MULTI_THREADS. The single threaded versions of SYNC_WAIT etc. will never be called. SYNC_WAITis used in the other variants of the MPI_Wait* calls. I'm still seeing a little bit of overhead with opal_progress(), but thats likely a different issue. |
|
@nysal Yeah. I have a fix for that as well. It is slightly lower priority and I will open a PR for it once it is tested. Willing to leave that for 2.0.1 if I don't finish testing today. |
|
@hjelmn adding that to v2.0.1 is fine with me. By the way my opal_progress() overhead seems to stem from: Should the above code be guarded by a "if (callbacks_lp_len > 0)". I think only the BTLs register a low priority callback at the moment. The atomic increment (or store if single threaded) can be avoided if we are running with PMLs that don't have any low priority callbacks. I understand this is a separate issue, but I thought I'd bring it up here. This PR looks good to me 👍 |
|
@nysal Hmm, yeah. I will add the extra item to the conditional. We will only have low-priority callbacks in a very small number of situations. They are meant to allow progress for connections. |
The OPAL_ENABLE_MULTI_THREADS macro is always defined as 1. This was causing us to always use the multi-thread path for synchronization objects. The code has been updated to use the opal_using_threads() function. When MPI_THREAD_MULTIPLE support is disabled at build time (2.x only) this function is a macro evaluating to false so the compiler will optimize out the MT-path in this case. The OPAL_ATOMIC_ADD_32 macro has been removed and replaced by the existing OPAL_THREAD_ADD32 macro. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@143a93f) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
@nysal Fixed a couple more regressions and this should be good to go now. Please +1 review and +1 this. |
This commit adds another check to the low-priority callback conditional that short-circuits the atomic-add if there are no low-priority callbacks. This should improve performance in the common case. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from open-mpi/ompi@e4f920f) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
45706ca to
6711e0f
Compare
|
Test FAILed. |
1 similar comment
|
Test FAILed. |
|
Build Failed with GNU compiler! Please review the log, and get in touch if you have questions. |
|
Test FAILed. |
|
Build Failed with GNU compiler! Please review the log, and get in touch if you have questions. |
|
@hjelmn the code looks fine to me. However I see some compiler warnings introduced in the jenkins build logs |
|
Test PASSed. |
|
@nysal Things look as clean as they normally are on the Mellanox jenkins. Can you give some examples? |
|
The IBM-CI (GNU Compiler) was a network problem (timed out pulling the repo from gitlab was very slow for some reason). Feel free to ask it to retest, but if the XL compiler passed the GNU will likely pass clean too. |
|
:bot:retest: |
|
Test FAILed. |
|
False failure from Mellanox jenkins. |
|
not false, real 21:39:25 ++ /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 2 -tune /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/test_tune.conf -mca mca_base_env_list 'XXX_A=7;XXX_B=8' /scrap/jenkins/jenkins/jobs/gh-ompi-release-pr/workspace/jenkins_scripts/jenkins/ompi/env_mpi
21:39:25 ++ sed -e ':a;N;$!ba;s/\n/+/g'
21:39:25 ++ bc
21:39:25 [jenkins01:14395] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
21:39:25 [jenkins01:14395] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
21:39:25 + val=59
21:39:25 + '[' 59 -ne 54 ']'
21:39:25 + exit 1 |
|
@miked-mellanox It is false. A CPC could not be found. Please look into it. |
|
@hjelmn Travis is taking forever on its Mac builds, but it already failed the Linux compiles -- you might want to check them out. |
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit open-mpi/ompi@55d1933) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
|
Test FAILed. |
|
bot:retest |
|
Test FAILed. |
timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-4/ompi_install1/bin/mpirun -np 4 -bind-to core -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-4/ompi_install1/thread_tests/thread-tests-1.1/overlap 8
|
|
btw, is following flag still actual for MT? --enable-opal-multi-threads |
|
Test FAILed. |
|
This PR addresses ompi request logic. And I see hangs in overlap with the following backtraces: In overlap test we have 2 threads: Second thread is waiting in the blocking MPI_Recv: So it seems to me that request changes might directly affect this hang. |
|
I guess that |
|
Some more info: In main thread we are posting non-blocking receive from rank = 0: int the second thread we processing receive from the rank = 2 (out-of-order): It seems that
|
|
I think the ompi_wait_sync_t object is initialized (and thus also the pthread condition variable) if the blocking recv waits on the request via ompi_request_wait_completion(). Yalla seems to bypass this? I see a PML_YALLA_WAIT_MXM_REQ which seems to indicate this. Since the condition variable is not initialized you hang while trying to take a lock on the condition variable's internal mutex. |
|
@artpol84 If there is a wait object then either 1) someone is waiting on it, or 2) the req_complete field was not correctly updated when the request was allocated. |
|
Hmm, or I am wrong. There is no ompi_request_t in the blocking send path..... The request being completed is from an irecv. |
|
@artpol84 Thanks for digging into this. Whatever the problem is I have to assume it affects master too. If not then there might be a commit missing. I haven't been able to find one but I am not looking at the entire code base. |
|
@jsquyres Since this PR has nothing to do with the pml/yalla problem we should go ahead and merge. |
|
I was unable to test v2.x without this PR and the problem is with requests which is related to this PR. |
|
@artpol84 All PRs are failing on 2.x. Including one that had nothing to do with requests. |
|
@hppritcha I vote for merging this PR. @jladd-mlnx @artpol84 Do you guys want to wait for v2.0.1 for yalla threading fixes? Or can you get a fix in the very near term? |
|
Please continue the discussion about the overlap test on open-mpi/ompi#1813. Discussion on this PR should now be about the performance regression fix. Thanks. |
|
I disagree about "All PRs are failing on 2.x" in the same way. For example issue #1237 segfults right at the start. http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1796/console http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/1793/console And for this issue overlap is hanging (see the timestamps): first FAIL #1: FAIL #2: |
|
@artpol84 All failures involve yalla from what I can tell so they are all probably from the same bug. Note the one that should be using openib has |
|
Yup, yalla priority is 50. So it wins. All failures are pml/yalla. |
|
@hjelmn @artpol84 Please move your discussion to open-mpi/ompi#1813. |
|
I'll give mlnx jenkins another chance, then merge unless @jsquyres objects. |
|
Discussion of the mlnx jenkins failure indicates its very unlikely to be related to code changes from this PR. merging. |
This commit fixes a performance regression introduced by the request
rework. We were always using the multi-thread path because
OPAL_ENABLE_MULTI_THREADS is either not defined or always defined to 1
depending on the Open MPI version. To fix this I removed the
conditional and added a conditional on opal_using_threads(). This path
will be optimized out in 2.0.0 in a non-thread-multiple build as
opal_using_threads is #defined to false in that case.
Fixes open-mpi/ompi#1806
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov
(cherry picked from commit open-mpi/ompi@544adb9)
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov