-
-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jruby-openssl spins endlessly trying to read from a non-blocking SocketChannel while no data is available. #1280
Comments
Hmmm yes. The blocking and non-blocking state of the underlying SocketChannel is probably getting mucked up or set to non-blocking and not reset. I don't suppose you have a small reproduction for us? |
Well, after some analysis, actually I do have a reproduction: xb@32fa59d |
I am seeing this as well on our production servers using Jruby 1.7.3 and 1.7.8. I've also tried a few versions of excon and all result in threads locking. I would be more than happy to test any code since this starts affecting us within hours of starting our processes. Great find @xb I had no desire to go hunting in the jruby code. at org.jruby.ext.openssl.SSLSocket.readAndUnwrap(SSLSocket.java:549) |
I am having this problem as well. I also have a jstack trace with similar lines: This is Jruby 1.7.9, 1.9 mode
I have the full trace if necessary: https://gist.github.com/amalagaura/7945662 |
Seems related to this where it happens on a write: #1156 |
Ok, we should prioritize this for 1.7.10 or 1.7.11. Looks like we have some good doco now and some reproduction cases. I will try to investigate, but as we're getting into the holiday it may not happen until after xmas. |
Ok, so the real issue here is that the partial TLS header never gets consumed, so we never go back into the blocking select. We keep seeing that there's data available and going ahead with the read, which can't proceed because there's not a complete packet, so it returns nothing, and so on. I'm going to try to fix two things:
|
Ok, so with a few modifications I have gotten it to block properly. This causes the test case from @xb to hang without spinning. Now before you think this is wrong...MRI appears to block too. I'm not sure the given test case is correct. Continuing to explore. |
After reading through the other comments, it seems like what's needed here is just to get sysread to properly block when there's no data or partial data. However, nobody other than @xb has provided something I can use to reproduce the issue, and @xb's case is flawed because it hangs on both JRuby (with my tweaks) and MRI. Does anyone have a reproduction I can use to test that my fixes do what they should? |
As far as I remember, the problem is not the blocking per se, but the spinning while blocking, meaning 100% CPU usage per thread. If the network weather is not perfect (e.g. there is packet loss and hence the TLS client thinks the TCP connection is still alive while the TLS server has already shut down the TCP connection), this may lead to all CPU cores being used up. I tried to produce a testcase, but reproducing the exact situation appeared to be quite difficult. Reproducing the situations in production is also difficult, because it apparently depends on the "network weather". |
@headius I can only speak as someone who submitted a stack trace. If sysread blocks will it result in an eventual timeout with your fix? The problem we were facing is CPU usage without a timeout. Edit: Sorry I don't know if it should block and timeout or just exit since @xb's test case closes the connection. |
@headius I believe you are correct about @xb's test case being flawed: it used the blocking read call from openssl/buffering.rb, so i'm guessing that its correct behavior that its blocking without spinning after your fixes. I've made a copy of @xb's testcase that uses read_nonblock instead, I think that is a better reproduction of the problem: it blocks and spins on jruby (without your fixes), but doesn't spin or block on mri: mohamedhafez@860f0c3 |
The new testcase linked to above (mohamedhafez@860f0c3) is a better reproduction of the issue myself, @amalagaura, and @lardcanoe have been facing: from all of our stack traces it looks like we are all spinning indefinitely on a read_nonblock, so my test above tests that directly. However, I could be mistaken but it looks like @xb's stack trace is for a blocking read call, perhaps a different test would be better to make sure things behave correctly in that situation, though it sounds like your fixes @headius are already blocking without spinning as MRI does in that case. |
@headius if you've got a branch with the fixes you referred to earlier I'd be happy to run it against the test I've written above |
Using 1.7.10, still getting this issue. Some thread dumps: https://gist.github.com/lardcanoe/8447830 No clue how to reproduce other than a large load. (Note sure if experiencing this or #1156, but leaning towards this one.) |
Would anybody be interested in going in on a bounty for this issue? I know some maintainers don't like bounties, are there any objections to that? |
@mohamedhafez You read my mind! Many of our servers have become my pets due to this one. I'm in for a good chunk of dough. |
@headius could you post those tweaks you made earlier that at least partially solved the problem? They may be a good starting point for others... |
…dable Currently in JRuby, trying to read from an SSL socket with only a partial TLS record results in spinning indefinitely until the rest of the record comes in (issue jruby/jruby#1280). This will make sure that a non blocking read correctly results in a IO::WaitReadable in that case instead.
Been running with my fix (in the pull request above) in production for a week now, not a single thread spinning forever, and not a single spike to 100% CPU all week:) |
Reading this thread it would be funny if @headius made the fix but didn't check it against the new testcases which we think do reproduce the issue. Are @headius changes the same as @mohamedhafez 's changes? |
This should be fixed with #1508 from @mohamedhafez, so I'm going to close it. We'll try to get a jruby-openssl release out that works on earlier JRuby versions too. |
Under jruby-1.7.6, I observed multiple threads spinning in this state:
Using Java debugger, it turned out that the SocketChannelImpl object is set to non-blocking mode (the "blocking" flag of the object is set to false) while the method org.jruby.ext.openssl.SSLSocket.readAndUnwrap() is has the blocking parameter set to true.
When considering the current implementation of readAndUnwrap()
it is clear that bytesRead==0 (because the SocketChannel is in non-blocking mode), however, processing does not fail in any way. Rather, readAndUnwrap() returns 0, which in turn results in rr==0 in this code (in org.jruby.ext.openssl.SSLSocket.do_sysread()):
Thus, this while loop spins forever. The comment says that sysread is a blocking read, but apparently it is not.
Thus, it apparently may be that a blocking SocketChannel becomes non-blocking due to weird circumstances, or it starts being non-blocking right away.
I'd recommend catching these bugs by checking whether the SocketChannel is actually blocking when it is expected to be blocking (and throwing an exception if the check fails). A quick workaround would be to ensure that the SocketChannel is always blocking when it is expected to be blocking (however, this hides the bug and may not solve it completely, maybe there is some interfering code which actually sets the SocketChannel to non-blocking mode).
The text was updated successfully, but these errors were encountered: