-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: Fix PMIx_Server_Finalize hang #1246
Conversation
Closing #1244 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/02c8b7101461f37c6993db6555457602 |
The IBM CI (Cross-version) build failed! Please review the log, linked below. Gist: https://gist.github.com/291e4f15bbed4c7d0024ee8374eaf060 |
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/3b405f52a58603976344888da7fe3bba |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/40aacd251cd11254413016bdf8f72a03 |
The IBM CI (Cross-version) build failed! Please review the log, linked below. Gist: https://gist.github.com/5f6fa344e3df212ab3510463ccd2be1c |
I'm wondering if we might not be using libevent correctly here. This is from their manual:
So as I read this, it appears to me that we can do one of the following:
Of the two, I'm leaning towards the first as they explicitly make the point about adding events from other threads. Thoughts? |
FWIW: the reference manual is here: http://www.wangafu.net/~nickm/libevent-book/Ref3_eventloop.html |
Needs to be ported into v3.1, v3.0, v2.2, v2,1 |
I think the issue here is that loop may not be active at the time loop break is being called (as I depicted on my picture). If loop is not active - there is nothing to exit/break. |
Ok, I see now about the first one. |
kewl - thanks, both for chasing this down (which must have been a @!$#$!#@ to do) and for the fix. |
|
Agreed - really surprising that this hasn't bit us in 15 years of OMPI (though we have had some strange times when |
I have updated the PR. |
one sec, will need to fix something |
The IBM CI (Cross-version) build failed! Please review the log, linked below. Gist: https://gist.github.com/c3307c42d77e4fa8fad1535e4dd494d7 |
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/cd9a5d8488b954e38bef8fd0cf562fb2 |
The IBM CI (XL Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/3557b2edb70fb93a927672e8675b69fe |
The IBM CI (GNU Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/76cb30da203a75804a954f879d792e97 |
Seems like "EVLOOP_NO_EXIT_ON_EMPTY" is only introduced in libevent 2.1.x series. |
Checking how loopexit will work |
a7fdfac
to
2a4bbbf
Compare
exitloop seems to help as well. |
I've got 1000 iters of make check with this + #1240 |
The hang was quite rare and appears as the result of the race condition between PMIx progress thread and main thread calling PMIx_Server_finalize. The following sequence is possible: | main thread | Progress thread | | | while(ev_active){ | | ev_active=0 | | | ev_break_loop | | | | ev_loop() | According to libevent manual, in this situation, libevent will ignore ev_break_loop as it wasn't in the loop at the time ev_break_loop() was called (see (b) in the libevent excerpt below) So the progress thread will enter the loop and hang. To fix this use event_base_loopexit that have desired behavior (See section (a) of the excerpt below) **excerpt from the libevent manual**: ``` ... Note also that event_base_loopexit(base,NULL) and event_base_loopbreak(base) act differently when no event loop is running: (a) loopexit schedules the next instance of the event loop to stop right after the next round of callbacks are run (as if it had been invoked with EVLOOP_ONCE) (b) whereas loopbreak only stops a currently running loop, and has no effect if the event loop isn’t running. ... ``` Signed-off-by: Artem Polyakov <artpol84@gmail.com>
@karasevb , please cherrypick to all release branches starting from v2.0 |
The hang was quite rare and appears as the result of the race condition
between PMIx progress thread and main thread calling
PMIx_Server_finalize.
The following sequence is possible:
According to libevent manual, in this situation, libevent will
ignore ev_break_loop as it wasn't in the loop at the time
ev_break_loop() was called (see (b) in the libevent excerpt below)
So the progress thread will enter the loop and hang.
To fix this use event_base_loopexit that have desired behavior
(See section (a) of the excerpt below)
excerpt from the libevent manual: