SSL failures Netty 4.1.13 -> 4.1.14 #7264

rkapsi · 2017-09-29T16:28:20Z

I have fallen a little bit behind in the Netty releases and am in the process of upgrading from Netty 4.1.13 w/ netty-tcnative 2.0.5 (linux, openssl, custom static build) to Netty 4.1.14 with same netty-tcnative version.

We're observing a failure where SSL is slowly failing and becoming more unreliable over a course of a couple of hours. Unreliable as in handshakes failing/hanging. It's a degrading failure with no apparent external metrics such as servers running of memory or more CPU being consumed.

I'm going to bisect the 4.1.14 commits and hopefully find the source but if you've got ideas then let me know.

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

JVM version (e.g. `java -version`)

OS version (e.g. `uname -a`)

The text was updated successfully, but these errors were encountered:

johnou · 2017-09-29T19:30:39Z

@rkapsi any exceptions or thread dumps to share?

rkapsi · 2017-10-02T12:38:35Z

@johnou there are no exceptions (I have INFO level enabled for the io.netty.handler.ssl and io.netty.internal.tcnative packages). I haven't bothered with thread dumps (yet). My bisect may take a few days (some non Netty things that I need to do first).

normanmaurer · 2017-10-02T17:11:29Z

@rkapsi unfortunately there is nothing that comes into my mind :(

rkapsi · 2017-10-09T22:27:19Z

Updating comment, completed bisect that required some intermediate cherry-picking.

It's commit f7b3cae. You must cherry-pick commit 74140db to avoid big leak.

$ git bisect log 
git bisect start
# bad: [8cc1071881e90b0130bdd35a0441abcd0df6ffa9] [maven-release-plugin] prepare release netty-4.1.14.Final
git bisect bad 8cc1071881e90b0130bdd35a0441abcd0df6ffa9
# good: [2a376eeb1b14b1f2e23e1c30ac2f2a213dbea25b] [maven-release-plugin] prepare for next development iteration
git bisect good 2a376eeb1b14b1f2e23e1c30ac2f2a213dbea25b
# bad: [4af47f0ced39d86a1ef6a644e7c1506d81c0ea1b] AbstractByteBuf.setCharSequence(...) must not expand buffer
git bisect bad 4af47f0ced39d86a1ef6a644e7c1506d81c0ea1b
# bad: [4c14d1198b58e9660c116ce151b077d98b9bd2a2] Add testcase to ensure NioEventLoop.rebuildSelector() works correctly.
git bisect bad 4c14d1198b58e9660c116ce151b077d98b9bd2a2
# bad: [7cfe4161823dec6192543e916b927e7de40190be] Use unbounded queues from JCTools 2.0.2
git bisect bad 7cfe4161823dec6192543e916b927e7de40190be
# good: [df568c739e2a73b4a1aea533a4fea934fdf9d0f7] Use ByteBuf#writeShort/writeMedium instead of writeBytes
git bisect good df568c739e2a73b4a1aea533a4fea934fdf9d0f7
# bad: [86e653e04fb452c92154e39cd7189615dc0ec323] SslHandler aggregation of plaintext data on write
git bisect bad 86e653e04fb452c92154e39cd7189615dc0ec323
# bad: [f7b3caeddc5bb1da75aaafa4a66dec88ed585d69] OpenSslEngine option to wrap/unwrap multiple packets per call
git bisect bad f7b3caeddc5bb1da75aaafa4a66dec88ed585d69
# first bad commit: [f7b3caeddc5bb1da75aaafa4a66dec88ed585d69] OpenSslEngine option to wrap/unwrap multiple packets per call

f7b3caeddc5bb1da75aaafa4a66dec88ed585d69 is the first bad commit
commit f7b3caeddc5bb1da75aaafa4a66dec88ed585d69
Author: Scott Mitchell <scott_mitchell@apple.com>
Date:   Fri Feb 3 17:54:13 2017 -0800

    OpenSslEngine option to wrap/unwrap multiple packets per call
    
    Motivation:
    The JDK SSLEngine documentation says that a call to wrap/unwrap "will attempt to consume one complete SSL/TLS network packet" [1]. This limitation can result in thrashing in the pipeline to decode and encode data that may be spread amongst multiple SSL/TLS network packets.
    ReferenceCountedOpenSslEngine also does not correct account for the overhead introduced by each individual SSL_write call if there are multiple ByteBuffers passed to the wrap() method.
    
    Modifications:
    - OpenSslEngine and SslHandler supports a mode to not comply with the limitation to only deal with a single SSL/TLS network packet per call
    - ReferenceCountedOpenSslEngine correctly accounts for the overhead of each call to SSL_write
    - SslHandler shouldn't cache maxPacketBufferSize as aggressively because this value may change before/after the handshake.
    
    Result:
    OpenSslEngine and SslHanadler can handle multiple SSL/TLS network packet per call.
    
    [1] https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html

:040000 040000 9b028b9a007cd584c809eee96aa57bbb7a0c9a19 fd38ca1f94766d5cb29267e2a572f2965c3ab439 M	handler

johnou · 2017-10-10T10:26:11Z

@rkapsi silly question (just to rule something out), are you using the new naming scheme with your custom static build?

http://netty.io/news/2017/09/25/4-0-52-Final-4-1-16-Final.html

Native library naming
This releases changed the naming scheme of the native .so and .dynlib files that are includes in the native jars to also include the architecture on which these were compiled. This was done as part of (#7163) to ensure we not load the native transport libs when used on architectures that are not supported. If you use any rules to shade these libs you may need to adjust these to take also the architecture into account.

rkapsi · 2017-10-10T13:05:54Z

@johnou oh that is way later. This issue is strictly about some bug introduced between 4.1.13 (works) and 4.1.14 (doesn't work).

johnou · 2017-10-10T13:55:13Z

@rkapsi okay just to clarify though, have you tested 4.1.16 as well?

rkapsi · 2017-10-10T20:27:53Z

I have updated my original comment to better reflect the result of the bisect (please see above). Bug is in commit f7b3cae.

@johnou I have not but I'll give it a spin as next.

rkapsi · 2017-10-11T16:25:52Z

@johnou issue is present in 4.1.16 as well

johnou · 2017-10-12T18:30:00Z

cc @Scottmitch

smaldini · 2017-10-13T20:45:08Z

Might also be linked to #2752 (comment)

normanmaurer · 2017-10-14T00:15:55Z

I am currently on Vacation but will have a look once back

…

Am 14.10.2017 um 04:45 schrieb Stephane Maldini ***@***.***>: Might also be linked to #2752 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

johnou · 2017-10-14T06:49:29Z

@rkapsi and if you disable aggregation? SslHandler.setWrapDataSize(0);

smaldini · 2017-10-14T16:56:58Z

I did try without success, but that makes sense in my case since the new CoaelescingQueue is unbounded and does not update isWriteable state at all. The SslHander never writes directly and always goes into the queue.

smaldini · 2017-10-16T20:00:02Z

@normanmaurer @Scottmitch looks like this PR from @violetagg actually fixes the issue we had with unlimited sslhandler buffering - let us know if that looks fine by you etc and if we can introduce that before 4.1.17.Final.

rkapsi · 2017-10-17T22:00:53Z

@johnou ran 4.1.16 with SslHandler.setWrapDataSize(0); for a couple of hours and everything was working fine.

There appears to be some overlap between f7b3cae and 86e653e in terms of wrapping/unwrapping. Is the latter a superset of the former? Anyways, I want to say that this particular problem started with f7b3cae (but possibly fixed in the next commit when using 0 for wrapDataSize). I haven't really observed issues with writability which should quickly result in running out of memory in our use-case. But our memory graphs do look slightly different from normal when this issue is happening and I haven't waited for long enough to see if the server(s) eventually OOME because SSL is borked way earlier.

normanmaurer · 2017-10-17T22:24:29Z

Thanks for the update will check next week

…

Am 18.10.2017 um 06:01 schrieb Roger ***@***.***>: @johnou ran 4.1.16 with SslHandler.setWrapDataSize(0); for a couple of hours and everything was working fine. There appears to be some overlap between f7b3cae and 86e653e in terms of wrapping/unwrapping. Is the latter a superset of the former? Anyways, I want to say that this particular problem started with f7b3cae (but possibly fixed in the next commit when using 0 for wrapDataSize). I haven't really observed issues with writability which should quickly result in running out of memory in our use-case. But our memory graphs do look slightly different (just not critical). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

normanmaurer · 2017-10-23T04:41:46Z

@rkapsi thats interesting... can you tell me "how big" your buffers are typically that you write.

rkapsi · 2017-10-23T13:07:02Z

@normanmaurer I don't have that info but I'll get it for you. The buffers that get written into SslHandler, right? Hopefully by tomorrow.

But generally speaking, we have no aggregation (like HttpObjectAggregator) in our pipeline. Stuff just gets shoved up and down the pipeline. On one end we may have a client with a dialup connection, on the other end a server w/ 10Gbit. Backpressure is managed by listening to writability events. Quite possible to end up with a large buffer from a single select(). The only thing I can think of that isn't default is our SO_RCVBUF which is 64kb. I believe the HTTP/2 codec (old API) does also some internal aggregation (almost all SSL traffic is over H2).

normanmaurer · 2017-10-23T13:40:55Z

@rkapsi yeah... I just try to get a better understanding why the aggregation code would mess stuff up.

Scottmitch · 2017-10-23T19:02:22Z

sorry for being MIA

ran 4.1.16 with SslHandler.setWrapDataSize(0); for a couple of hours and everything was working fine.

@rkapsi - This implies that the aggregation code is at fault here. Would the "couple hours" time frame be sufficiently long to see the issues you are observing? Can you confirm that you never see the issue when aggregation is disabled? What type of buffers are you writing:

CompositeByteBuf
ByteBuf with max size specified
ByteBuf with out max size specified

What I want to understand is what the control flow looks like through the aggregation code in your use case [1].

[1] https://github.com/netty/netty/blob/4.1/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L1818

rkapsi · 2017-10-23T20:07:40Z

@Scottmitch: I'll add that info... Hopefully tomorrow.

How it usually unfolds: About 1.5 hours after a deploy our HW LB is the first to notice that SSL health checks are occasionally failing on one server. Within minutes more Netty servers are joining and it's becoming more frequent. About 2 hours after a deploy external monitoring services start to fire (i.e. they connect through the HW LB to our Netty servers). At this point it's maybe noticeable by our users and that's when I usually stop. Non-SSL traffic handled by the same Netty servers is fine. In terms of memory the servers seem to continue consuming more of it while 4.1.13 or WrapDataSize(0) is settling on some amount at this point.

I'll repeat the test but "couple of hours" was 4 hours IIRC.

Scottmitch · 2017-10-23T23:43:01Z

I'll add that info... Hopefully tomorrow.

thx

In terms of memory the servers seem to continue consuming more of it while 4.1.13 or WrapDataSize(0) is settling on some amount at this point.

Does this mean the HW LB and users still see connectivity issues even if WrapDataSize(0) is used?

rkapsi · 2017-10-24T14:05:48Z

Sorry if it was confusing, everything was OK when using WrapDataSize(0). Collecting the data regards ByteBuf that are being written to SslHandler.

rkapsi · 2017-10-24T14:08:53Z

Scottmitch · 2017-10-24T17:11:51Z

@rkapsi - can you explain the graphs:

what happens at around 9:12?
- Before this time is a previous version of Netty running with production traffic, and then after this time is when you deploy the new version of Netty?
"ByteBuf types"
- how many instances are contributing to these metrics, and what versions of Netty are they running?
"ByteBuf with and without max capacity"
- is this from 2 different instances taking the same traffic?
- is this just a total count of ByteBuf objects allocated "without" (e.g. WrapDataSize(0)) and "with" (e.g. take the default value)?
"Avg. ByteBuf length"
- how is the metric measured? Is this from ByteBuf allocations in your application, or does this include resize/aggregation in SslHandler?
- how many instances of your application are contributing to metric?

rkapsi · 2017-10-24T17:52:26Z

9:12 was just a deploy. I had a few instances emitting these metrics over night and then deployed it to the whole fleet.
ByteBuf types: Will email you
With/Without capacity: No that is just ByteBuf#maxCapacity() == Integer.MAX_VALUE of ByteBuf that are getting written to SslHandler[1].
Avg. ByteBuf length: Just avg. over ByteBuf#readableBytes() what are being written to SslHandler[1].

[1]: The data is coming from a simple ChannelOutboundHandler that is sitting between the SslHandler and the protocol codec.

rkapsi · 2017-11-02T20:29:24Z

An update regards this issue. @Scottmitch and I worked directly (email, hangouts) on this and we narrowed it down to this condition wrapDataSize > 0 and jdkCompatibilityMode=true that triggered it.

#7352 fixes it and #7354 addresses some confusion that emerged during debugging.

Thanks everybody.

Scottmitch · 2017-11-02T20:37:20Z

thanks for your help tracking this down @rkapsi !

violetagg mentioned this issue Oct 16, 2017

Ensure Channel.isWritable() works with SslHandlerCoalescingBufferQueue #7309

Closed

Scottmitch self-assigned this Oct 23, 2017

Scottmitch mentioned this issue Oct 31, 2017

OpenSslEngine support unwrap plaintext greater than 2^14 and avoid infinite loop #7352

Closed

Scottmitch added the defect label Nov 2, 2017

Scottmitch added this to the 4.0.53.Final milestone Nov 2, 2017

Scottmitch closed this as completed Nov 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSL failures Netty 4.1.13 -> 4.1.14 #7264

SSL failures Netty 4.1.13 -> 4.1.14 #7264

rkapsi commented Sep 29, 2017 •

edited

johnou commented Sep 29, 2017

rkapsi commented Oct 2, 2017

normanmaurer commented Oct 2, 2017

rkapsi commented Oct 9, 2017 •

edited

johnou commented Oct 10, 2017

rkapsi commented Oct 10, 2017

johnou commented Oct 10, 2017

rkapsi commented Oct 10, 2017

rkapsi commented Oct 11, 2017

johnou commented Oct 12, 2017

smaldini commented Oct 13, 2017 •

edited

normanmaurer commented Oct 14, 2017 via email

johnou commented Oct 14, 2017

smaldini commented Oct 14, 2017

smaldini commented Oct 16, 2017

rkapsi commented Oct 17, 2017 •

edited

normanmaurer commented Oct 17, 2017 via email

normanmaurer commented Oct 23, 2017

rkapsi commented Oct 23, 2017 •

edited

normanmaurer commented Oct 23, 2017

Scottmitch commented Oct 23, 2017

rkapsi commented Oct 23, 2017

Scottmitch commented Oct 23, 2017

rkapsi commented Oct 24, 2017

rkapsi commented Oct 24, 2017

Scottmitch commented Oct 24, 2017

rkapsi commented Oct 24, 2017

rkapsi commented Nov 2, 2017

Scottmitch commented Nov 2, 2017

SSL failures Netty 4.1.13 -> 4.1.14 #7264

SSL failures Netty 4.1.13 -> 4.1.14 #7264

Comments

rkapsi commented Sep 29, 2017 • edited

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

JVM version (e.g. java -version)

OS version (e.g. uname -a)

johnou commented Sep 29, 2017

rkapsi commented Oct 2, 2017

normanmaurer commented Oct 2, 2017

rkapsi commented Oct 9, 2017 • edited

johnou commented Oct 10, 2017

rkapsi commented Oct 10, 2017

johnou commented Oct 10, 2017

rkapsi commented Oct 10, 2017

rkapsi commented Oct 11, 2017

johnou commented Oct 12, 2017

smaldini commented Oct 13, 2017 • edited

normanmaurer commented Oct 14, 2017 via email

johnou commented Oct 14, 2017

smaldini commented Oct 14, 2017

smaldini commented Oct 16, 2017

rkapsi commented Oct 17, 2017 • edited

normanmaurer commented Oct 17, 2017 via email

normanmaurer commented Oct 23, 2017

rkapsi commented Oct 23, 2017 • edited

normanmaurer commented Oct 23, 2017

Scottmitch commented Oct 23, 2017

rkapsi commented Oct 23, 2017

Scottmitch commented Oct 23, 2017

rkapsi commented Oct 24, 2017

rkapsi commented Oct 24, 2017

Scottmitch commented Oct 24, 2017

rkapsi commented Oct 24, 2017

rkapsi commented Nov 2, 2017

Scottmitch commented Nov 2, 2017

rkapsi commented Sep 29, 2017 •

edited

JVM version (e.g. `java -version`)

OS version (e.g. `uname -a`)

rkapsi commented Oct 9, 2017 •

edited

smaldini commented Oct 13, 2017 •

edited

rkapsi commented Oct 17, 2017 •

edited

rkapsi commented Oct 23, 2017 •

edited