New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSL failures Netty 4.1.13 -> 4.1.14 #7264
Comments
@rkapsi any exceptions or thread dumps to share? |
@johnou there are no exceptions (I have |
@rkapsi unfortunately there is nothing that comes into my mind :( |
Updating comment, completed bisect that required some intermediate cherry-picking. It's commit f7b3cae. You must cherry-pick commit 74140db to avoid big leak.
|
@rkapsi silly question (just to rule something out), are you using the new naming scheme with your custom static build? http://netty.io/news/2017/09/25/4-0-52-Final-4-1-16-Final.html Native library naming |
@johnou oh that is way later. This issue is strictly about some bug introduced between 4.1.13 (works) and 4.1.14 (doesn't work). |
@rkapsi okay just to clarify though, have you tested 4.1.16 as well? |
@johnou issue is present in 4.1.16 as well |
cc @Scottmitch |
Might also be linked to #2752 (comment) |
@rkapsi and if you disable aggregation? |
I did try without success, but that makes sense in my case since the new CoaelescingQueue is unbounded and does not update isWriteable state at all. The SslHander never writes directly and always goes into the queue. |
@normanmaurer @Scottmitch looks like this PR from @violetagg actually fixes the issue we had with unlimited sslhandler buffering - let us know if that looks fine by you etc and if we can introduce that before 4.1.17.Final. |
@johnou ran 4.1.16 with There appears to be some overlap between f7b3cae and 86e653e in terms of wrapping/unwrapping. Is the latter a superset of the former? Anyways, I want to say that this particular problem started with f7b3cae (but possibly fixed in the next commit when using 0 for wrapDataSize). I haven't really observed issues with writability which should quickly result in running out of memory in our use-case. But our memory graphs do look slightly different from normal when this issue is happening and I haven't waited for long enough to see if the server(s) eventually OOME because SSL is borked way earlier. |
Thanks for the update will check next week
… Am 18.10.2017 um 06:01 schrieb Roger ***@***.***>:
@johnou ran 4.1.16 with SslHandler.setWrapDataSize(0); for a couple of hours and everything was working fine.
There appears to be some overlap between f7b3cae and 86e653e in terms of wrapping/unwrapping. Is the latter a superset of the former? Anyways, I want to say that this particular problem started with f7b3cae (but possibly fixed in the next commit when using 0 for wrapDataSize). I haven't really observed issues with writability which should quickly result in running out of memory in our use-case. But our memory graphs do look slightly different (just not critical).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@rkapsi thats interesting... can you tell me "how big" your buffers are typically that you write. |
@normanmaurer I don't have that info but I'll get it for you. The buffers that get written into SslHandler, right? Hopefully by tomorrow. But generally speaking, we have no aggregation (like HttpObjectAggregator) in our pipeline. Stuff just gets shoved up and down the pipeline. On one end we may have a client with a dialup connection, on the other end a server w/ 10Gbit. Backpressure is managed by listening to writability events. Quite possible to end up with a large buffer from a single |
@rkapsi yeah... I just try to get a better understanding why the aggregation code would mess stuff up. |
sorry for being MIA
@rkapsi - This implies that the aggregation code is at fault here. Would the "couple hours" time frame be sufficiently long to see the issues you are observing? Can you confirm that you never see the issue when aggregation is disabled? What type of buffers are you writing:
What I want to understand is what the control flow looks like through the aggregation code in your use case [1]. |
@Scottmitch: I'll add that info... Hopefully tomorrow. How it usually unfolds: About 1.5 hours after a deploy our HW LB is the first to notice that SSL health checks are occasionally failing on one server. Within minutes more Netty servers are joining and it's becoming more frequent. About 2 hours after a deploy external monitoring services start to fire (i.e. they connect through the HW LB to our Netty servers). At this point it's maybe noticeable by our users and that's when I usually stop. Non-SSL traffic handled by the same Netty servers is fine. In terms of memory the servers seem to continue consuming more of it while 4.1.13 or I'll repeat the test but "couple of hours" was 4 hours IIRC. |
thx
Does this mean the HW LB and users still see connectivity issues even if |
Sorry if it was confusing, everything was OK when using |
@rkapsi - can you explain the graphs:
|
[1]: The data is coming from a simple ChannelOutboundHandler that is sitting between the SslHandler and the protocol codec. |
An update regards this issue. @Scottmitch and I worked directly (email, hangouts) on this and we narrowed it down to this condition #7352 fixes it and #7354 addresses some confusion that emerged during debugging. Thanks everybody. |
thanks for your help tracking this down @rkapsi ! |
I have fallen a little bit behind in the Netty releases and am in the process of upgrading from Netty 4.1.13 w/ netty-tcnative 2.0.5 (linux, openssl, custom static build) to Netty 4.1.14 with same netty-tcnative version.
We're observing a failure where SSL is slowly failing and becoming more unreliable over a course of a couple of hours. Unreliable as in handshakes failing/hanging. It's a degrading failure with no apparent external metrics such as servers running of memory or more CPU being consumed.
I'm going to bisect the 4.1.14 commits and hopefully find the source but if you've got ideas then let me know.
Expected behavior
Actual behavior
Steps to reproduce
Minimal yet complete reproducer code (or URL to code)
Netty version
JVM version (e.g.
java -version
)OS version (e.g.
uname -a
)The text was updated successfully, but these errors were encountered: