Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL failures Netty 4.1.13 -> 4.1.14 #7264

Closed
rkapsi opened this issue Sep 29, 2017 · 29 comments
Closed

SSL failures Netty 4.1.13 -> 4.1.14 #7264

rkapsi opened this issue Sep 29, 2017 · 29 comments
Assignees
Labels
Milestone

Comments

@rkapsi
Copy link
Member

rkapsi commented Sep 29, 2017

I have fallen a little bit behind in the Netty releases and am in the process of upgrading from Netty 4.1.13 w/ netty-tcnative 2.0.5 (linux, openssl, custom static build) to Netty 4.1.14 with same netty-tcnative version.

We're observing a failure where SSL is slowly failing and becoming more unreliable over a course of a couple of hours. Unreliable as in handshakes failing/hanging. It's a degrading failure with no apparent external metrics such as servers running of memory or more CPU being consumed.

I'm going to bisect the 4.1.14 commits and hopefully find the source but if you've got ideas then let me know.

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

JVM version (e.g. java -version)

OS version (e.g. uname -a)

@johnou
Copy link
Contributor

johnou commented Sep 29, 2017

@rkapsi any exceptions or thread dumps to share?

@rkapsi
Copy link
Member Author

rkapsi commented Oct 2, 2017

@johnou there are no exceptions (I have INFO level enabled for the io.netty.handler.ssl and io.netty.internal.tcnative packages). I haven't bothered with thread dumps (yet). My bisect may take a few days (some non Netty things that I need to do first).

@normanmaurer
Copy link
Member

@rkapsi unfortunately there is nothing that comes into my mind :(

@rkapsi
Copy link
Member Author

rkapsi commented Oct 9, 2017

Updating comment, completed bisect that required some intermediate cherry-picking.

It's commit f7b3cae. You must cherry-pick commit 74140db to avoid big leak.

$ git bisect log 
git bisect start
# bad: [8cc1071881e90b0130bdd35a0441abcd0df6ffa9] [maven-release-plugin] prepare release netty-4.1.14.Final
git bisect bad 8cc1071881e90b0130bdd35a0441abcd0df6ffa9
# good: [2a376eeb1b14b1f2e23e1c30ac2f2a213dbea25b] [maven-release-plugin] prepare for next development iteration
git bisect good 2a376eeb1b14b1f2e23e1c30ac2f2a213dbea25b
# bad: [4af47f0ced39d86a1ef6a644e7c1506d81c0ea1b] AbstractByteBuf.setCharSequence(...) must not expand buffer
git bisect bad 4af47f0ced39d86a1ef6a644e7c1506d81c0ea1b
# bad: [4c14d1198b58e9660c116ce151b077d98b9bd2a2] Add testcase to ensure NioEventLoop.rebuildSelector() works correctly.
git bisect bad 4c14d1198b58e9660c116ce151b077d98b9bd2a2
# bad: [7cfe4161823dec6192543e916b927e7de40190be] Use unbounded queues from JCTools 2.0.2
git bisect bad 7cfe4161823dec6192543e916b927e7de40190be
# good: [df568c739e2a73b4a1aea533a4fea934fdf9d0f7] Use ByteBuf#writeShort/writeMedium instead of writeBytes
git bisect good df568c739e2a73b4a1aea533a4fea934fdf9d0f7
# bad: [86e653e04fb452c92154e39cd7189615dc0ec323] SslHandler aggregation of plaintext data on write
git bisect bad 86e653e04fb452c92154e39cd7189615dc0ec323
# bad: [f7b3caeddc5bb1da75aaafa4a66dec88ed585d69] OpenSslEngine option to wrap/unwrap multiple packets per call
git bisect bad f7b3caeddc5bb1da75aaafa4a66dec88ed585d69
# first bad commit: [f7b3caeddc5bb1da75aaafa4a66dec88ed585d69] OpenSslEngine option to wrap/unwrap multiple packets per call
f7b3caeddc5bb1da75aaafa4a66dec88ed585d69 is the first bad commit
commit f7b3caeddc5bb1da75aaafa4a66dec88ed585d69
Author: Scott Mitchell <scott_mitchell@apple.com>
Date:   Fri Feb 3 17:54:13 2017 -0800

    OpenSslEngine option to wrap/unwrap multiple packets per call
    
    Motivation:
    The JDK SSLEngine documentation says that a call to wrap/unwrap "will attempt to consume one complete SSL/TLS network packet" [1]. This limitation can result in thrashing in the pipeline to decode and encode data that may be spread amongst multiple SSL/TLS network packets.
    ReferenceCountedOpenSslEngine also does not correct account for the overhead introduced by each individual SSL_write call if there are multiple ByteBuffers passed to the wrap() method.
    
    Modifications:
    - OpenSslEngine and SslHandler supports a mode to not comply with the limitation to only deal with a single SSL/TLS network packet per call
    - ReferenceCountedOpenSslEngine correctly accounts for the overhead of each call to SSL_write
    - SslHandler shouldn't cache maxPacketBufferSize as aggressively because this value may change before/after the handshake.
    
    Result:
    OpenSslEngine and SslHanadler can handle multiple SSL/TLS network packet per call.
    
    [1] https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html

:040000 040000 9b028b9a007cd584c809eee96aa57bbb7a0c9a19 fd38ca1f94766d5cb29267e2a572f2965c3ab439 M	handler

@johnou
Copy link
Contributor

johnou commented Oct 10, 2017

@rkapsi silly question (just to rule something out), are you using the new naming scheme with your custom static build?

http://netty.io/news/2017/09/25/4-0-52-Final-4-1-16-Final.html

Native library naming
This releases changed the naming scheme of the native .so and .dynlib files that are includes in the native jars to also include the architecture on which these were compiled. This was done as part of (#7163) to ensure we not load the native transport libs when used on architectures that are not supported. If you use any rules to shade these libs you may need to adjust these to take also the architecture into account.

@rkapsi
Copy link
Member Author

rkapsi commented Oct 10, 2017

@johnou oh that is way later. This issue is strictly about some bug introduced between 4.1.13 (works) and 4.1.14 (doesn't work).

@johnou
Copy link
Contributor

johnou commented Oct 10, 2017

@rkapsi okay just to clarify though, have you tested 4.1.16 as well?

@rkapsi
Copy link
Member Author

rkapsi commented Oct 10, 2017

I have updated my original comment to better reflect the result of the bisect (please see above). Bug is in commit f7b3cae.

@johnou I have not but I'll give it a spin as next.

@rkapsi
Copy link
Member Author

rkapsi commented Oct 11, 2017

@johnou issue is present in 4.1.16 as well

@johnou
Copy link
Contributor

johnou commented Oct 12, 2017

cc @Scottmitch

@smaldini
Copy link

smaldini commented Oct 13, 2017

Might also be linked to #2752 (comment)

@normanmaurer
Copy link
Member

normanmaurer commented Oct 14, 2017 via email

@johnou
Copy link
Contributor

johnou commented Oct 14, 2017

@rkapsi and if you disable aggregation? SslHandler.setWrapDataSize(0);

@smaldini
Copy link

I did try without success, but that makes sense in my case since the new CoaelescingQueue is unbounded and does not update isWriteable state at all. The SslHander never writes directly and always goes into the queue.

@smaldini
Copy link

@normanmaurer @Scottmitch looks like this PR from @violetagg actually fixes the issue we had with unlimited sslhandler buffering - let us know if that looks fine by you etc and if we can introduce that before 4.1.17.Final.

@rkapsi
Copy link
Member Author

rkapsi commented Oct 17, 2017

@johnou ran 4.1.16 with SslHandler.setWrapDataSize(0); for a couple of hours and everything was working fine.

There appears to be some overlap between f7b3cae and 86e653e in terms of wrapping/unwrapping. Is the latter a superset of the former? Anyways, I want to say that this particular problem started with f7b3cae (but possibly fixed in the next commit when using 0 for wrapDataSize). I haven't really observed issues with writability which should quickly result in running out of memory in our use-case. But our memory graphs do look slightly different from normal when this issue is happening and I haven't waited for long enough to see if the server(s) eventually OOME because SSL is borked way earlier.

bisect

@normanmaurer
Copy link
Member

normanmaurer commented Oct 17, 2017 via email

@normanmaurer
Copy link
Member

@rkapsi thats interesting... can you tell me "how big" your buffers are typically that you write.

@rkapsi
Copy link
Member Author

rkapsi commented Oct 23, 2017

@normanmaurer I don't have that info but I'll get it for you. The buffers that get written into SslHandler, right? Hopefully by tomorrow.

But generally speaking, we have no aggregation (like HttpObjectAggregator) in our pipeline. Stuff just gets shoved up and down the pipeline. On one end we may have a client with a dialup connection, on the other end a server w/ 10Gbit. Backpressure is managed by listening to writability events. Quite possible to end up with a large buffer from a single select(). The only thing I can think of that isn't default is our SO_RCVBUF which is 64kb. I believe the HTTP/2 codec (old API) does also some internal aggregation (almost all SSL traffic is over H2).

@normanmaurer
Copy link
Member

@rkapsi yeah... I just try to get a better understanding why the aggregation code would mess stuff up.

@Scottmitch
Copy link
Member

sorry for being MIA

ran 4.1.16 with SslHandler.setWrapDataSize(0); for a couple of hours and everything was working fine.

@rkapsi - This implies that the aggregation code is at fault here. Would the "couple hours" time frame be sufficiently long to see the issues you are observing? Can you confirm that you never see the issue when aggregation is disabled? What type of buffers are you writing:

What I want to understand is what the control flow looks like through the aggregation code in your use case [1].

[1] https://github.com/netty/netty/blob/4.1/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L1818

@Scottmitch Scottmitch self-assigned this Oct 23, 2017
@rkapsi
Copy link
Member Author

rkapsi commented Oct 23, 2017

@Scottmitch: I'll add that info... Hopefully tomorrow.

How it usually unfolds: About 1.5 hours after a deploy our HW LB is the first to notice that SSL health checks are occasionally failing on one server. Within minutes more Netty servers are joining and it's becoming more frequent. About 2 hours after a deploy external monitoring services start to fire (i.e. they connect through the HW LB to our Netty servers). At this point it's maybe noticeable by our users and that's when I usually stop. Non-SSL traffic handled by the same Netty servers is fine. In terms of memory the servers seem to continue consuming more of it while 4.1.13 or WrapDataSize(0) is settling on some amount at this point.

I'll repeat the test but "couple of hours" was 4 hours IIRC.

@Scottmitch
Copy link
Member

I'll add that info... Hopefully tomorrow.

thx

In terms of memory the servers seem to continue consuming more of it while 4.1.13 or WrapDataSize(0) is settling on some amount at this point.

Does this mean the HW LB and users still see connectivity issues even if WrapDataSize(0) is used?

@rkapsi
Copy link
Member Author

rkapsi commented Oct 24, 2017

Sorry if it was confusing, everything was OK when using WrapDataSize(0). Collecting the data regards ByteBuf that are being written to SslHandler.

@rkapsi
Copy link
Member Author

rkapsi commented Oct 24, 2017

screenshot from 2017-10-24 10-07-02

@Scottmitch
Copy link
Member

@rkapsi - can you explain the graphs:

  • what happens at around 9:12?
    • Before this time is a previous version of Netty running with production traffic, and then after this time is when you deploy the new version of Netty?
  • "ByteBuf types"
    • how many instances are contributing to these metrics, and what versions of Netty are they running?
  • "ByteBuf with and without max capacity"
    • is this from 2 different instances taking the same traffic?
    • is this just a total count of ByteBuf objects allocated "without" (e.g. WrapDataSize(0)) and "with" (e.g. take the default value)?
  • "Avg. ByteBuf length"
    • how is the metric measured? Is this from ByteBuf allocations in your application, or does this include resize/aggregation in SslHandler?
    • how many instances of your application are contributing to metric?

@rkapsi
Copy link
Member Author

rkapsi commented Oct 24, 2017

  • 9:12 was just a deploy. I had a few instances emitting these metrics over night and then deployed it to the whole fleet.

  • ByteBuf types: Will email you

  • With/Without capacity: No that is just ByteBuf#maxCapacity() == Integer.MAX_VALUE of ByteBuf that are getting written to SslHandler[1].

  • Avg. ByteBuf length: Just avg. over ByteBuf#readableBytes() what are being written to SslHandler[1].

[1]: The data is coming from a simple ChannelOutboundHandler that is sitting between the SslHandler and the protocol codec.

@Scottmitch Scottmitch added this to the 4.0.53.Final milestone Nov 2, 2017
@rkapsi
Copy link
Member Author

rkapsi commented Nov 2, 2017

An update regards this issue. @Scottmitch and I worked directly (email, hangouts) on this and we narrowed it down to this condition wrapDataSize > 0 and jdkCompatibilityMode=true that triggered it.

#7352 fixes it and #7354 addresses some confusion that emerged during debugging.

Thanks everybody.

@Scottmitch
Copy link
Member

thanks for your help tracking this down @rkapsi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants