Use eventfd_read to close EpollEventLoop shutdown/wakeup race #9476

njhill · 2019-08-17T03:11:07Z

Motivation

@carl-mastrangelo discovered a non-hypothetical race condition during EpollEventLoop shutdown where wakeup writes can complete after the eventfd has been closed and subsequently reassigned by the kernel.

This fix is an alternative to #9388, using eventfd_read to hopefully close the gap completely, and without involving an additional CAS during wakeup.

Modification

After waking from epollWait, CAS the wakenUp atomic from 0 to 1. The times that a value of 1 is encountered here (CAS fail) correspond 1-1 with prior CAS wins by other threads in the wakeup(...) method, which correspond 1-1 with eventfd_write(1) calls (even if the most recent write is yet to happen).

Thus we can locally maintain a precise total count of those writes (eventFdWriteCount) which will be constant while the EL is awake - no further writes can happen until we reset wakenUp back to 0.

Since eventfd is a counter, when shutting down we just need to read from it until the sum of read values equals the known total write count. At this point all the writes must have completed and no more can happen.

Result

Race condition eliminated. Fixes #9362

netty-bot · 2019-08-17T03:14:05Z

Can one of the admins verify this patch?

njhill · 2019-08-17T03:17:52Z

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java

+                // Cap number of iterations to avoid indefinite spin in the case of a failed write.
+                for (int i = 0; eventFdWriteCount != 0L && i < 10000; i++) {
+                    eventFdWriteCount -= Native.eventFdRead(eventFd.intValue());
+                }


I thought spinning on read here would be simpler/safer than using epoll, it should rarely ever actually happen. Polling is still an option if others think that's preferable, though we would want probably want to use timerfd as an equivalent safeguard.

Btw by "polling" in the comment above I meant blocking wait using epoll_wait ... sorry if that was misleading.

I was considering calling epollWait here (with a timeout, sure; and we'd need to do epollFd.close() after this). That by itself would be broken, because epollWaitNow() is called from closeAll(). But I can't figure out why that call to epoll is even there. The best idea I have is that it became vestigial via d63c9f2 ; previously the ready return value was used.

Oh yeah I'd wondered about the purpose of that epollWaitNow in closeAll, hard to imagine it has one now... hopefully @normanmaurer could confirm your suspicion that it's obsolete.

You seemed to imply that epollWait by itself would be sufficient (apologies if misinterpreted). But I think the write/read accounting is still needed, i.e. we would still be calling eventFdRead in a loop here, just that it would be doing a blocking-wait each iteration rather than spinning.

This is becuase wakenUp by itself can't tell us whether a write is still outstanding. Even if we kept track of whether eventfd was in the event array populated by the most recent epollWait call, it's possible that there could be more than one outstanding write (from multiple prior EL iterations) and so receiving one last edge wouldn't be conclusive. Do you agree?

But while trying to think about what you may have been proposing I realized that if careful we could in fact do this just by tracking the edges, and so avoid having to read the eventfd after all. Maybe that's what you had in mind and again apologies if so! The key thing is to make sure we don't reset wakenUp before waiting during normal loop operation in the case we know a wakeup write is pending... that way we know there can't be more than 1 outstanding at a time.

I have made the corresponding changes here e3305bdcfb5a5673c8a0a32a6d754e11b912d955, and actually think it's a superior option. Let me know what you think and I can update this PR accordingly or open another.

You seemed to imply that epollWait by itself would be sufficient (apologies if misinterpreted). But I think the write/read accounting is still needed

Yes, that's what I was thinking. And yes, just you saying this made me realize that wakeups could overlap in a pathological case. Crap.

Even if we kept track of whether eventfd was in the event array populated by the most recent epollWait call, it's possible that there could be more than one outstanding write (from multiple prior EL iterations) and so receiving one last edge wouldn't be conclusive. Do you agree?

Yes, I agree.

But while trying to think about what you may have been proposing I realized that if careful we could in fact do this just by tracking the edges, and so avoid having to read the eventfd after all.

Tell me more!

The key thing is to make sure we don't reset wakenUp before waiting during normal loop operation in the case we know a wakeup write is pending... that way we know there can't be more than 1 outstanding at a time.

Beautiful! I will say I was looking hard at wakenUp, but I hadn't considered anything like this. Honestly, that may make the code easier for us overall, since that sounds less likely to have hidden wakeup races.

Thanks @ejona86... agree that we're likely hosed one way or another in this situation, but was trying to guard against getting a thread getting stuck in epoll_wait indefinitely. Apart from a write itself failing, I wasn't sure if there could be a failure somehow during processReady such that we miss the eventfd and don't reset the pendingWakeup flag. I guess that could also be dealt with by having a try/catch-all around each iteration.

You're right that it's also a small optimization in that we can skip a hasTasks() check and the timerfd adjustment in this case (latter which might involve a syscall).

@njhill I think the epollWaitNow() in closeAll can just be removed... Most likely it was just ported as we also do a selectNow() in the NIO transport.

@normanmaurer let me know if you agree with opening another PR based on e3305bd to replace this one per the above discussion

@njhill sounds good.

@ejona86 @normanmaurer have now opened #9535 which is just e3305bd rebased.

normanmaurer · 2019-08-17T06:41:51Z

@netty-bot test this please

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java

normanmaurer · 2019-08-19T07:21:23Z

@netty-bot test this please

transport-native-epoll/src/main/c/netty_epoll_native.c

normanmaurer · 2019-08-20T09:33:58Z

@netty-bot test this please

@carl-mastrangelo

Motivation @carl-mastrangelo discovered a non-hypothetical race condition during EpollEventLoop shutdown where wakeup writes can complete after the eventFd has been closed and subsequently reassigned by the kernel. This fix is an alternative to netty#9388 which uses eventfd_read to hopefully close the gap completely, and doesn't involve an additional CAS during wakeup. Modification After waking from epollWait, CAS the wakenUp atomic from 0 to 1. The times that a value of 1 is encountered here (CAS fail) correspond 1-1 with prior CAS wins by other threads in the wakeup(...) method, which correspond 1-1 with eventfd_write(1) calls (even if the most recent write is yet to happen). Thus we can locally maintain a precise total count of those writes (eventFdWriteCount) which will be constant while the EL is awake - no further writes can happen until we reset wakenUp back to 0. Since eventFd is a counter, when shutting down we just need to read from it until the sum of read values equals the known total write count. At this point all the writes must have completed and no more can happen. Result Race condition eliminated. Fixes netty#9362

normanmaurer · 2019-08-23T15:39:40Z

@ejona86 can you have a look as well ?

ejona86 · 2019-08-23T17:25:17Z

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java

@@ -546,6 +554,11 @@ protected void cleanup() {
                logger.warn("Failed to close the epoll fd.", e);
            }
            try {
+                // Ensure any inflight wakeup writes have been performed prior to closing eventFd.


It's unclear how this works when eventFdWrite will call eventfd_read:

netty/transport-native-epoll/src/main/c/netty_epoll_native.c

Line 130 in 1fa7a5e

if (eventfd_read(fd, &val) == 0 || errno == EAGAIN) {

@ejona86 thank you, I had forgotten about the existence of this other eventfd_read. But I think that's essentially dead code given we only ever write a value of 1 at a time into 8 bytes.

I'd propose to replace this check with a comment to make things clearer and just treat it as an error case.

Then I think we should explicitly remove it. Norman put it there recently and explicitly. I don't think anything has changed since then that would invalidate it. f6cf681 . I agree that overflowing a 64 bit integer via increments is not going to happen. I don't know if there are other concerns.

ejona86 · 2019-08-23T17:58:32Z

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java

@@ -546,6 +554,11 @@ protected void cleanup() {
                logger.warn("Failed to close the epoll fd.", e);
            }
            try {
+                // Ensure any inflight wakeup writes have been performed prior to closing eventFd.
+                // Cap number of iterations to avoid indefinite spin in the case of a failed write.


This doesn't end up waiting for the write at all. Is the looping just basically having a sleep() and hoping for the best?

@ejona86 do you mean in the case the "safeguard" limit is reached? Given syscall latency is in the order of microsecs, we would have to have waited 10's of millisecs for concurrent writes to complete, which I thought would be long enough to assume they did really complete and something else unexpected must have happened instead (i.e. some prior write failed for whatever reason).

Oh also maybe I should add some comments to make this part clearer too: in almost all cases only a single iteration (single read) will be performed, i.e. first value of eventfd read will already == eventFdWriteCount. The only times it won't will be in a race situation of the kind that we're trying to avoid.

Isn't that assuming that the other thread is running on a different core? If the wakeup-triggering code is running on the same core as this code but was context switched out before doing the write, then this code could run with a full timeslice before the wakeup code resumes.

Fair enough, maybe that's a good reason to go for the epoll_wait option

ejona86 · 2019-08-23T18:01:48Z

I think I expected more to poll on the eventfd once since it is in edge-triggered mode (IIRC). I honestly don't know how painful that is.

njhill · 2019-08-23T18:31:54Z

@ejona86 thanks a lot for the review/comments.

I think I expected more to poll on the eventfd once since it is in edge-triggered mode (IIRC). I honestly don't know how painful that is.

Could you elaborate a little on what you're proposing here? I can't see how that could avoid the race in a robust way but I may be missing or misunderstanding something.

Motivation This is another iteration of netty#9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes netty#9362

njhill · 2019-09-04T16:58:33Z

Closing in favour of #9535 (see comment thread above)

…9535) Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362

Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>

…9586) Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>

(netty#9586) Motivation This is another iteration of netty#9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes netty#9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>

…9586) (#9612) Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>

njhill referenced this pull request Aug 17, 2019

Alternative proposal for eventfd close race fix, take 2

36ff672

njhill commented Aug 17, 2019

View reviewed changes

normanmaurer reviewed Aug 19, 2019

View reviewed changes

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java Show resolved Hide resolved

normanmaurer reviewed Aug 19, 2019

View reviewed changes

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java Show resolved Hide resolved

njhill force-pushed the eventfd-close-race branch from ff03ee7 to eaea12b Compare August 19, 2019 17:49

normanmaurer reviewed Aug 19, 2019

View reviewed changes

transport-native-epoll/src/main/c/netty_epoll_native.c Show resolved Hide resolved

njhill force-pushed the eventfd-close-race branch 2 times, most recently from ff73521 to 8527908 Compare August 19, 2019 19:50

njhill added 4 commits August 21, 2019 21:08

Explicitly exclude Native class from japicmp check to fix build failure

9890216

Add clarifying comment to wakenUp.lazySet()

5316ff6

Additional explanatory comments

8f15508

njhill force-pushed the eventfd-close-race branch from 8527908 to 8f15508 Compare August 22, 2019 04:15

normanmaurer requested review from carl-mastrangelo and ejona86 August 23, 2019 15:39

ejona86 reviewed Aug 23, 2019

View reviewed changes

Small refinement of hasTasks() checking including removed redundant one

ef47ab4

njhill mentioned this pull request Sep 4, 2019

Close eventfd shutdown/wakeup race by closely tracking epoll edges #9535

Merged

njhill closed this Sep 4, 2019

normanmaurer mentioned this pull request Sep 20, 2019

Close eventfd shutdown/wakeup race by closely tracking epoll edges #9586

Merged

njhill mentioned this pull request Sep 26, 2019

Close eventfd shutdown/wakeup race by closely tracking epoll edges (#9586) #9612

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use eventfd_read to close EpollEventLoop shutdown/wakeup race #9476

Use eventfd_read to close EpollEventLoop shutdown/wakeup race #9476

njhill commented Aug 17, 2019

netty-bot commented Aug 17, 2019

njhill Aug 17, 2019

njhill Aug 23, 2019

ejona86 Aug 23, 2019

njhill Aug 23, 2019

ejona86 Aug 24, 2019

njhill Aug 27, 2019

normanmaurer Aug 27, 2019

njhill Aug 30, 2019

normanmaurer Aug 31, 2019

njhill Sep 4, 2019

normanmaurer commented Aug 17, 2019

normanmaurer commented Aug 19, 2019

normanmaurer commented Aug 20, 2019

normanmaurer commented Aug 23, 2019

ejona86 Aug 23, 2019

njhill Aug 23, 2019 •

edited

ejona86 Aug 23, 2019

ejona86 Aug 23, 2019

njhill Aug 23, 2019

njhill Aug 23, 2019

ejona86 Aug 23, 2019

njhill Aug 23, 2019

ejona86 commented Aug 23, 2019

njhill commented Aug 23, 2019

njhill commented Sep 4, 2019

Use eventfd_read to close EpollEventLoop shutdown/wakeup race #9476

Use eventfd_read to close EpollEventLoop shutdown/wakeup race #9476

Conversation

njhill commented Aug 17, 2019

netty-bot commented Aug 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

normanmaurer commented Aug 17, 2019

normanmaurer commented Aug 19, 2019

normanmaurer commented Aug 20, 2019

normanmaurer commented Aug 23, 2019

Choose a reason for hiding this comment

njhill Aug 23, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejona86 commented Aug 23, 2019

njhill commented Aug 23, 2019

njhill commented Sep 4, 2019

njhill Aug 23, 2019 •

edited