Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Fix PMIx_Server_Finalize hang #1246

Merged
merged 1 commit into from
May 14, 2019
Merged

Conversation

artpol84
Copy link
Contributor

@artpol84 artpol84 commented May 10, 2019

The hang was quite rare and appears as the result of the race condition
between PMIx progress thread and main thread calling
PMIx_Server_finalize.

The following sequence is possible:

| main thread  | Progress thread   |
|              | while(ev_active){ |
|ev_active=0   |                   |
|ev_break_loop |                   |
|              | ev_loop()         |

According to libevent manual, in this situation, libevent will
ignore ev_break_loop as it wasn't in the loop at the time
ev_break_loop() was called (see (b) in the libevent excerpt below)
So the progress thread will enter the loop and hang.

To fix this use event_base_loopexit that have desired behavior
(See section (a) of the excerpt below)

excerpt from the libevent manual:

...
Note also that event_base_loopexit(base,NULL) and event_base_loopbreak(base)
act differently when no event loop is running:

(a) loopexit schedules the next instance of the event loop to stop right
after the next round of callbacks are run (as if it had been invoked with
EVLOOP_ONCE)

(b) whereas loopbreak only stops a currently running loop, and has no
effect if the event loop isn’t running.
...

@artpol84 artpol84 requested review from rhc54 and jjhursey May 10, 2019 23:43
@artpol84 artpol84 added the bug label May 10, 2019
@artpol84
Copy link
Contributor Author

Closing #1244

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/02c8b7101461f37c6993db6555457602

@ibm-ompi
Copy link

The IBM CI (Cross-version) build failed! Please review the log, linked below.

Gist: https://gist.github.com/291e4f15bbed4c7d0024ee8374eaf060

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/3b405f52a58603976344888da7fe3bba

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/40aacd251cd11254413016bdf8f72a03

@ibm-ompi
Copy link

The IBM CI (Cross-version) build failed! Please review the log, linked below.

Gist: https://gist.github.com/5f6fa344e3df212ab3510463ccd2be1c

@rhc54
Copy link
Contributor

rhc54 commented May 11, 2019

I'm wondering if we might not be using libevent correctly here. This is from their manual:

Ordinarily, the loop will exit as soon as it has no pending or active events. You can override this
behavior by passing the EVLOOP_NO_EXIT_ON_EMPTY flag---for example, if you’re going to be
adding events from some other thread. If you do set EVLOOP_NO_EXIT_ON_EMPTY, the loop
will keep running until somebody calls event_base_loopbreak(), or calls event_base_loopexit(),
or an error occurs.

Note also that event_base_loopexit(base,NULL) and event_base_loopbreak(base) act differently
when no event loop is running: loopexit schedules the next instance of the event loop to stop
right after the next round of callbacks are run (as if it had been invoked with EVLOOP_ONCE)
whereas loopbreak only stops a currently running loop, and has no effect if the event loop isn’t
running.

So as I read this, it appears to me that we can do one of the following:

  • pass the EVLOOP_NO_EXIT_ON_EMPTY flag so the event loop remains active as we definitely add events from other threads, and continue to use event_base_loopbreak to exit it since the loop will always be active; or

  • leave the current libevent flag but switch to using event_base_loopexit to stop the loop by ensuring it becomes active if it currently isn'g

Of the two, I'm leaning towards the first as they explicitly make the point about adding events from other threads. Thoughts?

@rhc54
Copy link
Contributor

rhc54 commented May 11, 2019

FWIW: the reference manual is here: http://www.wangafu.net/~nickm/libevent-book/Ref3_eventloop.html

@artpol84
Copy link
Contributor Author

Needs to be ported into v3.1, v3.0, v2.2, v2,1

@artpol84
Copy link
Contributor Author

I think the issue here is that loop may not be active at the time loop break is being called (as I depicted on my picture).

If loop is not active - there is nothing to exit/break.

@artpol84
Copy link
Contributor Author

Ok, I see now about the first one.
Yes, we can do that instead.

@rhc54
Copy link
Contributor

rhc54 commented May 11, 2019

kewl - thanks, both for chasing this down (which must have been a @!$#$!#@ to do) and for the fix.

@artpol84
Copy link
Contributor Author

loopbreak only stops a currently running loop, and has no effect if the event loop isn’t running.
Looks like this is the reason. this confirms my assumption in the description.

@rhc54
Copy link
Contributor

rhc54 commented May 11, 2019

Agreed - really surprising that this hasn't bit us in 15 years of OMPI (though we have had some strange times when mpirun failed to exit...hmmm...). Will have to port this to them.

@artpol84
Copy link
Contributor Author

I have updated the PR.
Will test it and post the results

@artpol84
Copy link
Contributor Author

one sec, will need to fix something

@ibm-ompi
Copy link

The IBM CI (Cross-version) build failed! Please review the log, linked below.

Gist: https://gist.github.com/c3307c42d77e4fa8fad1535e4dd494d7

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/cd9a5d8488b954e38bef8fd0cf562fb2

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/3557b2edb70fb93a927672e8675b69fe

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/76cb30da203a75804a954f879d792e97

@artpol84
Copy link
Contributor Author

Seems like "EVLOOP_NO_EXIT_ON_EMPTY" is only introduced in libevent 2.1.x series.
I don't see it in 2.0.x

@artpol84
Copy link
Contributor Author

Checking how loopexit will work

@artpol84 artpol84 force-pushed the fix/srv_fini branch 2 times, most recently from a7fdfac to 2a4bbbf Compare May 11, 2019 02:45
@artpol84
Copy link
Contributor Author

exitloop seems to help as well.
I suggest keeping this solution as it is portable to previous

@artpol84
Copy link
Contributor Author

I've got 1000 iters of make check with this + #1240

The hang was quite rare and appears as the result of the race condition
between PMIx progress thread and main thread calling
PMIx_Server_finalize.

The following sequence is possible:

| main thread   | Progress thread   |
|               | while(ev_active){ |
| ev_active=0   |                   |
| ev_break_loop |                   |
|               | ev_loop()         |

According to libevent manual, in this situation, libevent will
ignore ev_break_loop as it wasn't in the loop at the time
ev_break_loop() was called (see (b) in the libevent excerpt below)
So the progress thread will enter the loop and hang.

To fix this use event_base_loopexit that have desired behavior
(See section (a) of the excerpt below)

**excerpt from the libevent manual**:
```
...
Note also that event_base_loopexit(base,NULL) and event_base_loopbreak(base)
act differently when no event loop is running:

(a) loopexit schedules the next instance of the event loop to stop right
after the next round of callbacks are run (as if it had been invoked with
EVLOOP_ONCE)

(b) whereas loopbreak only stops a currently running loop, and has no
effect if the event loop isn’t running.
...
```

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
@rhc54 rhc54 merged commit 7f8d375 into openpmix:master May 14, 2019
@artpol84
Copy link
Contributor Author

@karasevb , please cherrypick to all release branches starting from v2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants