New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use eventfd_read to close EpollEventLoop shutdown/wakeup race #9476
Conversation
Can one of the admins verify this patch? |
// Cap number of iterations to avoid indefinite spin in the case of a failed write. | ||
for (int i = 0; eventFdWriteCount != 0L && i < 10000; i++) { | ||
eventFdWriteCount -= Native.eventFdRead(eventFd.intValue()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought spinning on read here would be simpler/safer than using epoll, it should rarely ever actually happen. Polling is still an option if others think that's preferable, though we would want probably want to use timerfd as an equivalent safeguard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw by "polling" in the comment above I meant blocking wait using epoll_wait
... sorry if that was misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering calling epollWait here (with a timeout, sure; and we'd need to do epollFd.close() after this). That by itself would be broken, because epollWaitNow()
is called from closeAll()
. But I can't figure out why that call to epoll is even there. The best idea I have is that it became vestigial via d63c9f2 ; previously the ready
return value was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I'd wondered about the purpose of that epollWaitNow
in closeAll
, hard to imagine it has one now... hopefully @normanmaurer could confirm your suspicion that it's obsolete.
You seemed to imply that epollWait
by itself would be sufficient (apologies if misinterpreted). But I think the write/read accounting is still needed, i.e. we would still be calling eventFdRead
in a loop here, just that it would be doing a blocking-wait each iteration rather than spinning.
This is becuase wakenUp
by itself can't tell us whether a write is still outstanding. Even if we kept track of whether eventfd was in the event array populated by the most recent epollWait
call, it's possible that there could be more than one outstanding write (from multiple prior EL iterations) and so receiving one last edge wouldn't be conclusive. Do you agree?
But while trying to think about what you may have been proposing I realized that if careful we could in fact do this just by tracking the edges, and so avoid having to read the eventfd after all. Maybe that's what you had in mind and again apologies if so! The key thing is to make sure we don't reset wakenUp
before waiting during normal loop operation in the case we know a wakeup write is pending... that way we know there can't be more than 1 outstanding at a time.
I have made the corresponding changes here e3305bdcfb5a5673c8a0a32a6d754e11b912d955, and actually think it's a superior option. Let me know what you think and I can update this PR accordingly or open another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seemed to imply that epollWait by itself would be sufficient (apologies if misinterpreted). But I think the write/read accounting is still needed
Yes, that's what I was thinking. And yes, just you saying this made me realize that wakeups could overlap in a pathological case. Crap.
Even if we kept track of whether eventfd was in the event array populated by the most recent epollWait call, it's possible that there could be more than one outstanding write (from multiple prior EL iterations) and so receiving one last edge wouldn't be conclusive. Do you agree?
Yes, I agree.
But while trying to think about what you may have been proposing I realized that if careful we could in fact do this just by tracking the edges, and so avoid having to read the eventfd after all.
Tell me more!
The key thing is to make sure we don't reset wakenUp before waiting during normal loop operation in the case we know a wakeup write is pending... that way we know there can't be more than 1 outstanding at a time.
Beautiful! I will say I was looking hard at wakenUp
, but I hadn't considered anything like this. Honestly, that may make the code easier for us overall, since that sounds less likely to have hidden wakeup races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ejona86... agree that we're likely hosed one way or another in this situation, but was trying to guard against getting a thread getting stuck in epoll_wait indefinitely. Apart from a write itself failing, I wasn't sure if there could be a failure somehow during processReady
such that we miss the eventfd and don't reset the pendingWakeup
flag. I guess that could also be dealt with by having a try/catch-all around each iteration.
You're right that it's also a small optimization in that we can skip a hasTasks()
check and the timerfd adjustment in this case (latter which might involve a syscall).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill I think the epollWaitNow()
in closeAll can just be removed... Most likely it was just ported as we also do a selectNow() in the NIO transport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanmaurer let me know if you agree with opening another PR based on e3305bd to replace this one per the above discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejona86 @normanmaurer have now opened #9535 which is just e3305bd rebased.
@netty-bot test this please |
transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
Show resolved
Hide resolved
transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
Show resolved
Hide resolved
@netty-bot test this please |
ff03ee7
to
eaea12b
Compare
ff73521
to
8527908
Compare
@netty-bot test this please |
Motivation @carl-mastrangelo discovered a non-hypothetical race condition during EpollEventLoop shutdown where wakeup writes can complete after the eventFd has been closed and subsequently reassigned by the kernel. This fix is an alternative to netty#9388 which uses eventfd_read to hopefully close the gap completely, and doesn't involve an additional CAS during wakeup. Modification After waking from epollWait, CAS the wakenUp atomic from 0 to 1. The times that a value of 1 is encountered here (CAS fail) correspond 1-1 with prior CAS wins by other threads in the wakeup(...) method, which correspond 1-1 with eventfd_write(1) calls (even if the most recent write is yet to happen). Thus we can locally maintain a precise total count of those writes (eventFdWriteCount) which will be constant while the EL is awake - no further writes can happen until we reset wakenUp back to 0. Since eventFd is a counter, when shutting down we just need to read from it until the sum of read values equals the known total write count. At this point all the writes must have completed and no more can happen. Result Race condition eliminated. Fixes netty#9362
8527908
to
8f15508
Compare
@ejona86 can you have a look as well ? |
@@ -546,6 +554,11 @@ protected void cleanup() { | |||
logger.warn("Failed to close the epoll fd.", e); | |||
} | |||
try { | |||
// Ensure any inflight wakeup writes have been performed prior to closing eventFd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear how this works when eventFdWrite will call eventfd_read:
if (eventfd_read(fd, &val) == 0 || errno == EAGAIN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejona86 thank you, I had forgotten about the existence of this other eventfd_read
. But I think that's essentially dead code given we only ever write a value of 1
at a time into 8 bytes.
I'd propose to replace this check with a comment to make things clearer and just treat it as an error case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think we should explicitly remove it. Norman put it there recently and explicitly. I don't think anything has changed since then that would invalidate it. f6cf681 . I agree that overflowing a 64 bit integer via increments is not going to happen. I don't know if there are other concerns.
@@ -546,6 +554,11 @@ protected void cleanup() { | |||
logger.warn("Failed to close the epoll fd.", e); | |||
} | |||
try { | |||
// Ensure any inflight wakeup writes have been performed prior to closing eventFd. | |||
// Cap number of iterations to avoid indefinite spin in the case of a failed write. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't end up waiting for the write at all. Is the looping just basically having a sleep() and hoping for the best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejona86 do you mean in the case the "safeguard" limit is reached? Given syscall latency is in the order of microsecs, we would have to have waited 10's of millisecs for concurrent writes to complete, which I thought would be long enough to assume they did really complete and something else unexpected must have happened instead (i.e. some prior write failed for whatever reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh also maybe I should add some comments to make this part clearer too: in almost all cases only a single iteration (single read) will be performed, i.e. first value of eventfd read will already == eventFdWriteCount
. The only times it won't will be in a race situation of the kind that we're trying to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that assuming that the other thread is running on a different core? If the wakeup-triggering code is running on the same core as this code but was context switched out before doing the write, then this code could run with a full timeslice before the wakeup code resumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, maybe that's a good reason to go for the epoll_wait
option
I think I expected more to poll on the eventfd once since it is in edge-triggered mode (IIRC). I honestly don't know how painful that is. |
@ejona86 thanks a lot for the review/comments.
Could you elaborate a little on what you're proposing here? I can't see how that could avoid the race in a robust way but I may be missing or misunderstanding something. |
Motivation This is another iteration of netty#9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes netty#9362
…9535) Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362
Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>
…9586) Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>
(netty#9586) Motivation This is another iteration of netty#9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes netty#9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>
…9586) (#9612) Motivation This is another iteration of #9476. Modifications Instead of maintaining a count of all writes performed and then using reads during shutdown to ensure all are accounted for, just set a flag after each write and don't reset it until the corresponding event has been returned from epoll_wait. This requires that while a write is still pending we don't reset wakenUp, i.e. continue to block writes from the wakeup() method. Result Race condition eliminated. Fixes #9362 Co-authored-by: Norman Maurer <norman_maurer@apple.com>
Motivation
@carl-mastrangelo discovered a non-hypothetical race condition during
EpollEventLoop
shutdown where wakeup writes can complete after the eventfd has been closed and subsequently reassigned by the kernel.This fix is an alternative to #9388, using
eventfd_read
to hopefully close the gap completely, and without involving an additional CAS during wakeup.Modification
After waking from
epollWait
, CAS thewakenUp
atomic from 0 to 1. The times that a value of 1 is encountered here (CAS fail) correspond 1-1 with prior CAS wins by other threads in thewakeup(...)
method, which correspond 1-1 witheventfd_write(1)
calls (even if the most recent write is yet to happen).Thus we can locally maintain a precise total count of those writes (
eventFdWriteCount
) which will be constant while the EL is awake - no further writes can happen until we resetwakenUp
back to 0.Since eventfd is a counter, when shutting down we just need to read from it until the sum of read values equals the known total write count. At this point all the writes must have completed and no more can happen.
Result
Race condition eliminated. Fixes #9362