Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use eventfd_read to close EpollEventLoop shutdown/wakeup race #9476

Closed
wants to merge 5 commits into from

Conversation

njhill
Copy link
Member

@njhill njhill commented Aug 17, 2019

Motivation

@carl-mastrangelo discovered a non-hypothetical race condition during EpollEventLoop shutdown where wakeup writes can complete after the eventfd has been closed and subsequently reassigned by the kernel.

This fix is an alternative to #9388, using eventfd_read to hopefully close the gap completely, and without involving an additional CAS during wakeup.

Modification

After waking from epollWait, CAS the wakenUp atomic from 0 to 1. The times that a value of 1 is encountered here (CAS fail) correspond 1-1 with prior CAS wins by other threads in the wakeup(...) method, which correspond 1-1 with eventfd_write(1) calls (even if the most recent write is yet to happen).

Thus we can locally maintain a precise total count of those writes (eventFdWriteCount) which will be constant while the EL is awake - no further writes can happen until we reset wakenUp back to 0.

Since eventfd is a counter, when shutting down we just need to read from it until the sum of read values equals the known total write count. At this point all the writes must have completed and no more can happen.

Result

Race condition eliminated. Fixes #9362

@netty-bot
Copy link

Can one of the admins verify this patch?

// Cap number of iterations to avoid indefinite spin in the case of a failed write.
for (int i = 0; eventFdWriteCount != 0L && i < 10000; i++) {
eventFdWriteCount -= Native.eventFdRead(eventFd.intValue());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought spinning on read here would be simpler/safer than using epoll, it should rarely ever actually happen. Polling is still an option if others think that's preferable, though we would want probably want to use timerfd as an equivalent safeguard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw by "polling" in the comment above I meant blocking wait using epoll_wait ... sorry if that was misleading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering calling epollWait here (with a timeout, sure; and we'd need to do epollFd.close() after this). That by itself would be broken, because epollWaitNow() is called from closeAll(). But I can't figure out why that call to epoll is even there. The best idea I have is that it became vestigial via d63c9f2 ; previously the ready return value was used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I'd wondered about the purpose of that epollWaitNow in closeAll, hard to imagine it has one now... hopefully @normanmaurer could confirm your suspicion that it's obsolete.

You seemed to imply that epollWait by itself would be sufficient (apologies if misinterpreted). But I think the write/read accounting is still needed, i.e. we would still be calling eventFdRead in a loop here, just that it would be doing a blocking-wait each iteration rather than spinning.

This is becuase wakenUp by itself can't tell us whether a write is still outstanding. Even if we kept track of whether eventfd was in the event array populated by the most recent epollWait call, it's possible that there could be more than one outstanding write (from multiple prior EL iterations) and so receiving one last edge wouldn't be conclusive. Do you agree?

But while trying to think about what you may have been proposing I realized that if careful we could in fact do this just by tracking the edges, and so avoid having to read the eventfd after all. Maybe that's what you had in mind and again apologies if so! The key thing is to make sure we don't reset wakenUp before waiting during normal loop operation in the case we know a wakeup write is pending... that way we know there can't be more than 1 outstanding at a time.

I have made the corresponding changes here e3305bdcfb5a5673c8a0a32a6d754e11b912d955, and actually think it's a superior option. Let me know what you think and I can update this PR accordingly or open another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seemed to imply that epollWait by itself would be sufficient (apologies if misinterpreted). But I think the write/read accounting is still needed

Yes, that's what I was thinking. And yes, just you saying this made me realize that wakeups could overlap in a pathological case. Crap.

Even if we kept track of whether eventfd was in the event array populated by the most recent epollWait call, it's possible that there could be more than one outstanding write (from multiple prior EL iterations) and so receiving one last edge wouldn't be conclusive. Do you agree?

Yes, I agree.

But while trying to think about what you may have been proposing I realized that if careful we could in fact do this just by tracking the edges, and so avoid having to read the eventfd after all.

Tell me more!

The key thing is to make sure we don't reset wakenUp before waiting during normal loop operation in the case we know a wakeup write is pending... that way we know there can't be more than 1 outstanding at a time.

Beautiful! I will say I was looking hard at wakenUp, but I hadn't considered anything like this. Honestly, that may make the code easier for us overall, since that sounds less likely to have hidden wakeup races.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ejona86... agree that we're likely hosed one way or another in this situation, but was trying to guard against getting a thread getting stuck in epoll_wait indefinitely. Apart from a write itself failing, I wasn't sure if there could be a failure somehow during processReady such that we miss the eventfd and don't reset the pendingWakeup flag. I guess that could also be dealt with by having a try/catch-all around each iteration.

You're right that it's also a small optimization in that we can skip a hasTasks() check and the timerfd adjustment in this case (latter which might involve a syscall).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill I think the epollWaitNow() in closeAll can just be removed... Most likely it was just ported as we also do a selectNow() in the NIO transport.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanmaurer let me know if you agree with opening another PR based on e3305bd to replace this one per the above discussion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill sounds good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejona86 @normanmaurer have now opened #9535 which is just e3305bd rebased.

@normanmaurer
Copy link
Member

@netty-bot test this please

@normanmaurer
Copy link
Member

@netty-bot test this please

@njhill njhill force-pushed the eventfd-close-race branch 2 times, most recently from ff73521 to 8527908 Compare August 19, 2019 19:50
@normanmaurer
Copy link
Member

@netty-bot test this please

Motivation

@carl-mastrangelo discovered a non-hypothetical race condition during
EpollEventLoop shutdown where wakeup writes can complete after the
eventFd has been closed and subsequently reassigned by the kernel.

This fix is an alternative to netty#9388 which uses eventfd_read to hopefully
close the gap completely, and doesn't involve an additional CAS during
wakeup.

Modification

After waking from epollWait, CAS the wakenUp atomic from 0 to 1. The
times that a value of 1 is encountered here (CAS fail) correspond 1-1
with prior CAS wins by other threads in the wakeup(...) method, which
correspond 1-1 with eventfd_write(1) calls (even if the most recent
write is yet to happen).

Thus we can locally maintain a precise total count of those writes
(eventFdWriteCount) which will be constant while the EL is awake - no
further writes can happen until we reset wakenUp back to 0.

Since eventFd is a counter, when shutting down we just need to read from
it until the sum of read values equals the known total write count. At
this point all the writes must have completed and no more can happen.

Result

Race condition eliminated. Fixes netty#9362
@normanmaurer
Copy link
Member

@ejona86 can you have a look as well ?

@@ -546,6 +554,11 @@ protected void cleanup() {
logger.warn("Failed to close the epoll fd.", e);
}
try {
// Ensure any inflight wakeup writes have been performed prior to closing eventFd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear how this works when eventFdWrite will call eventfd_read:

if (eventfd_read(fd, &val) == 0 || errno == EAGAIN) {

Copy link
Member Author

@njhill njhill Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejona86 thank you, I had forgotten about the existence of this other eventfd_read. But I think that's essentially dead code given we only ever write a value of 1 at a time into 8 bytes.

I'd propose to replace this check with a comment to make things clearer and just treat it as an error case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we should explicitly remove it. Norman put it there recently and explicitly. I don't think anything has changed since then that would invalidate it. f6cf681 . I agree that overflowing a 64 bit integer via increments is not going to happen. I don't know if there are other concerns.

@@ -546,6 +554,11 @@ protected void cleanup() {
logger.warn("Failed to close the epoll fd.", e);
}
try {
// Ensure any inflight wakeup writes have been performed prior to closing eventFd.
// Cap number of iterations to avoid indefinite spin in the case of a failed write.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't end up waiting for the write at all. Is the looping just basically having a sleep() and hoping for the best?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejona86 do you mean in the case the "safeguard" limit is reached? Given syscall latency is in the order of microsecs, we would have to have waited 10's of millisecs for concurrent writes to complete, which I thought would be long enough to assume they did really complete and something else unexpected must have happened instead (i.e. some prior write failed for whatever reason).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh also maybe I should add some comments to make this part clearer too: in almost all cases only a single iteration (single read) will be performed, i.e. first value of eventfd read will already == eventFdWriteCount. The only times it won't will be in a race situation of the kind that we're trying to avoid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that assuming that the other thread is running on a different core? If the wakeup-triggering code is running on the same core as this code but was context switched out before doing the write, then this code could run with a full timeslice before the wakeup code resumes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, maybe that's a good reason to go for the epoll_wait option

@ejona86
Copy link
Member

ejona86 commented Aug 23, 2019

I think I expected more to poll on the eventfd once since it is in edge-triggered mode (IIRC). I honestly don't know how painful that is.

@njhill
Copy link
Member Author

njhill commented Aug 23, 2019

@ejona86 thanks a lot for the review/comments.

I think I expected more to poll on the eventfd once since it is in edge-triggered mode (IIRC). I honestly don't know how painful that is.

Could you elaborate a little on what you're proposing here? I can't see how that could avoid the race in a robust way but I may be missing or misunderstanding something.

njhill added a commit to njhill/netty that referenced this pull request Sep 4, 2019
Motivation

This is another iteration of netty#9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes netty#9362
@njhill
Copy link
Member Author

njhill commented Sep 4, 2019

Closing in favour of #9535 (see comment thread above)

@njhill njhill closed this Sep 4, 2019
normanmaurer pushed a commit that referenced this pull request Sep 5, 2019
…9535)

Motivation

This is another iteration of #9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes #9362
normanmaurer added a commit that referenced this pull request Sep 20, 2019
Motivation

This is another iteration of #9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes #9362

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
normanmaurer added a commit that referenced this pull request Sep 23, 2019
…9586)

Motivation

This is another iteration of #9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes #9362

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
njhill added a commit to njhill/netty that referenced this pull request Sep 25, 2019
(netty#9586)

Motivation

This is another iteration of netty#9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes netty#9362


Co-authored-by: Norman Maurer <norman_maurer@apple.com>
normanmaurer added a commit that referenced this pull request Sep 27, 2019
…9586) (#9612)


Motivation

This is another iteration of #9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes #9362


Co-authored-by: Norman Maurer <norman_maurer@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shutdown race in EpollEventLoop
4 participants