Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM due to SSL key materials cached every time when there is new connection when using OpenSslCachingX509KeyManagerFactory #9747

Closed
lvfangmin opened this issue Nov 1, 2019 · 23 comments · Fixed by #9759

Comments

@lvfangmin
Copy link

lvfangmin commented Nov 1, 2019

Expected behavior

Recently, we found the OOM issue when switching from JDK Ssl to OpenSsl in netty.

We're using OpenSslCachingX509KeyManagerFactory explicity, so Netty will use OpenSslCachingKeyMaterialProvide to cache and reduce the overhead of parsing the chain and the key for generation of the material.

We expect to see performance optimization but shouldn't see OOM issue.

Actual behavior

But with stress test for TLS connection, we saw the memory linearly increasing and eventually OOM.

After debugging into the Netty and OpenJDK ssl code, we found the problem is that every time when there is a new connection, handshake cert selection callback OpenSslClientCertificateCallback is called and it will try to find the alias key materials from the cache, if it doesn't exist it will try to find the match alias from server cert chain, which created a new alias in format of seq_id.builderIndex.keyStoreAlias, like 924450.0.key. And it will parse the chain and key, put into the cache with the new alias, and this retained the refCnt of the key material and prevented the native memory being destroyed, that's why we eventually saw the OOM issue.

Changing to use OpenSslX509KeyManagerFactory solved this problem.

Steps to reproduce

Using OpenSslCachingX509KeyManagerFactory to set up the SSLContext, and keep issuing Issuing lots of TLS connection requests.

Minimal yet complete reproducer code (or URL to code)

Netty version

We're using 4.1.36.Final.

JVM version (e.g. java -version)

java version "10.0.1" 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)

@normanmaurer
Copy link
Member

normanmaurer commented Nov 1, 2019 via email

@lvfangmin
Copy link
Author

@normanmaurer thanks for the quick response, I'm not able to sharing the heap dump due to the security reason, but I can share the issue I saw in the heap and the stack and ssl log.

Let me get those, will update it here.

@lvfangmin
Copy link
Author

Here is the main thing I saw in the heap dump, which holds 67M of cached OpenSsl key materials in
OpenSslCachingKeyMaterialProvide:

Screen Shot 2019-10-31 at 2 00 59 PM

I changed the OpenJDK with more logs like adding the stack trace when chooseAlias happened:

`
ssl: Printing stack trace:

ssl: at sun.security.ssl.X509KeyManagerImpl.getAliases(X509KeyManagerImpl.java:726)
ssl: at sun.security.ssl.X509KeyManagerImpl.chooseAlias(X509KeyManagerImpl.java:388)
ssl: at sun.security.ssl.X509KeyManagerImpl.chooseEngineServerAlias(X509KeyManagerImpl.java:154)
ssl: at io.netty.handler.ssl.OpenSslKeyMaterialManager.chooseServerAlias(OpenSslKeyMaterialManager.java:123)
ssl: at io.netty.handler.ssl.OpenSslKeyMaterialManager.setKeyMaterialServerSide(OpenSslKeyMaterialManager.java:75)
ssl: at io.netty.handler.ssl.ReferenceCountedOpenSslServerContext$OpenSslServerCertificateCallback.handle(ReferenceCountedOpenSslServerContext.java:211)
ssl: at io.netty.internal.tcnative.SSL.readFromSSL(SSL.java:-2)
ssl: at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.readPlaintextData(ReferenceCountedOpenSslEngine.java:570)
ssl: at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1146)
ssl: at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1263)
ssl: at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1306)
ssl: at io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:217)
ssl: at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1332)
ssl: at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1239)
ssl: at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1276)
ssl: at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
ssl: at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
ssl: at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
...
ssl: at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682)
ssl: at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617)
ssl: at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534)
ssl: at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
ssl: at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
ssl: at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
ssl: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
ssl: at java.lang.Thread.run(Thread.java:844)
ssl: KeyMgr: check alias key
ssl: KeyMgr: check keyType EC on alias key true EC
`

@lvfangmin
Copy link
Author

One thing I'm trying to figure out is if this is a Netty issue or it's due to how our cert is being generated and we shouldn't use caching Key metarial provider.

Given we'll generate new alias id even the cert is perfect matching here here, it seems to me there is no benefit of having OpenSslCachingKeyMaterialProvide, are we using it in a wrong way?

@normanmaurer
Copy link
Member

@lvfangmin yeah you should not use the caching one then. That said I think we should most likely bound the maximum number of cached entries.

@lvfangmin
Copy link
Author

lvfangmin commented Nov 1, 2019

@normanmaurer I'm wondering if the cache helps or not in general for other users, since new alias id with unique id will be generated even the cert is perfect matching in X509KeyManagerImpl.

@normanmaurer
Copy link
Member

@lvfangmin also who is "we" here ? I know a lot of different users who use the cache with great benefit. Maybe they just not use so many certs ? Can you share some sort of reproducer ?

@normanmaurer
Copy link
Member

Also this is on the client side or server side ?

@lvfangmin
Copy link
Author

@normanmaurer when I mentioned "we" I meant the JDK X509KeyManagerImpl, internally we only have a single cert, so I don't think it's a cert number issue.

And this code is on server side.

I know a lot of different users who use the cache with great benefit

That's why I'm still wondering if it's because we're not using it in a correct way. But based on the JDK X509KeyManagerImpl code, it seems it's the expected behavior that we'll cache the same cert and key with unique id for every connection, am I understanding correctly?

From the JDK code, sees it's using a unique id to allow us to reliably cache entries between the calls to getCertificateChain() and getPrivateKey() even if tokens are inserted or removed.

@normanmaurer
Copy link
Member

@lvfangmin interesting... can you please share a simple reproducer for this ? I would like to also see how you bootstrap everything etc.

@lvfangmin
Copy link
Author

We added openssl implementation in ZooKeeper, which we haven't upstreamed yet, but you can find a general idea how we set it up here: diff.txt

The method to create SslContext is createNettyOpenSslContext.

@lvfangmin
Copy link
Author

I'll try and see if I can set up a minimal reproducer, meanwhile, @normanmaurer can you give me more context how the OpenSslCachingX509KeyManagerFactory and OpenSslCachingKeyMaterialProvide works, which will help me understand how it improves the performance during handshake.

@normanmaurer
Copy link
Member

@lvfangmin I think the fact that it works for others is that they have their own KeyManager implementations that have "stable" alias between connections. So yes I think for you it is correct to not use OpenSslCachingX509KeyManagerFactory. That said there should be no performance hit when using BoringSSL as it allows us to not parse the key / cert and just use the bytes directly internally.

I think it is still a good idea to limit the cache tho.

@johnou
Copy link
Contributor

johnou commented Nov 4, 2019

@lvfangmin how big is your keystore exactly? not some autogenerated thing with thousands of obsolete / duplicate certificates I hope?

@lvfangmin
Copy link
Author

@normanmaurer @johnou we have a single keystore, which returns a 'key' alias every time, the problem is that the alias chose from the JDK X509KeyManagerImpl is unique every time.

I spent a day to abstract our openssl code and put it into a file here:

NettySslTest.txt

It's a java test, I renamed it to txt to be able to upload here, it's the close way how we use Netty OpenSsl inside ZK. The test is create a simple server, and then issuing 1 million create/close client request. If you run the test, it will you'll see the memory keep growing, and from the dumped memory you'll see the cached key materials in OpenSslCachingX509KeyManagerFactory is keep growing.

@normanmaurer
Copy link
Member

@lvfangmin thanks will have a look

@normanmaurer
Copy link
Member

@lvfangmin can you also share the testKeyStore.jks and testTrustStore.jks files ?

@normanmaurer
Copy link
Member

@lvfangmin after some debugging I think I also know why people usually not see this problem.

By default we use "SunX509" as algorithm when creating the KeyManagerFactory. When this is used the JDK uses SunX509KeyManagerImpl. This one uses "stable" aliases and so the caching works as expected. You specify another algorithm and so it ends up using X509KeyManagerImpl which does not provide stable aliases.

So to fix this I think we should do two things:

  • If X509KeyManagerImpl is used we should not cache if not explicit told so
  • ensure the cache can not grow without bounds..

WDYT ?

normanmaurer added a commit that referenced this issue Nov 6, 2019
… is not bound. We should support an upper limit

Motivation:

At the moment te cache is not bound and so lead to huge memory consumpation. We should ensure its bound by default.

Modifications:

Ensure cache is bound

Result:

Fixes #9747.
@normanmaurer
Copy link
Member

@lvfangmin added the bound #9759...

@lvfangmin
Copy link
Author

@normanmaurer thanks for help debugging the issue here, the improvements you suggested to add protection in Netty to avoid using OpenSslCachingX509KeyManagerFactory if the key manager algorithm is using X509KeyManagerImpl sounds reasonable to me, otherwise it will add overhead instead of making it more efficient. If this is not easy to achieve, maybe we should update the doc of OpenSslCachingX509KeyManagerFactory to add this useful information.

The bound code looks good to me, thanks for quickly implementing that.

Meanwhile, we'll also consider to use the default key manager factory algorithm.

Also, is there any performance difference when using OpenSslCachingX509KeyManagerFactory and OpenSslX509KeyManagerFactory, which one is suggested?

normanmaurer added a commit that referenced this issue Nov 7, 2019
#9759)


Motivation:

At the moment te cache is not bound and so lead to huge memory consumpation. We should ensure its bound by default.

Modifications:

Ensure cache is bound

Result:

Fixes #9747.
normanmaurer added a commit that referenced this issue Nov 7, 2019
Motivation:

sun.security.ssl.X509KeyManagerImpl will not use "stable" aliases and so aliases may be changed during invocations. This means caching is useless. Because of this we should disable the cache if its used.

Modifications:

- Disable caching if sun.security.ssl.X509KeyManagerImpl is used
- Add tests

Result:

More protection against #9747.
@normanmaurer
Copy link
Member

@lvfangmin please check #9762...

yes OpenSslCachingX509KeyManagerFactory is a bit faster as it can cache the generated "native" key material while for OpenSslX509KeyManagerFactory we will need to re-generate it on each invocation.

normanmaurer added a commit that referenced this issue Nov 7, 2019
…ed (#9762)

Motivation:

sun.security.ssl.X509KeyManagerImpl will not use "stable" aliases and so aliases may be changed during invocations. This means caching is useless. Because of this we should disable the cache if its used.

Modifications:

- Disable caching if sun.security.ssl.X509KeyManagerImpl is used
- Add tests

Result:

More protection against #9747.
normanmaurer added a commit that referenced this issue Nov 7, 2019
#9759)

Motivation:

At the moment te cache is not bound and so lead to huge memory consumpation. We should ensure its bound by default.

Modifications:

Ensure cache is bound

Result:

Fixes #9747.
normanmaurer added a commit that referenced this issue Nov 7, 2019
…ed (#9762)

Motivation:

sun.security.ssl.X509KeyManagerImpl will not use "stable" aliases and so aliases may be changed during invocations. This means caching is useless. Because of this we should disable the cache if its used.

Modifications:

- Disable caching if sun.security.ssl.X509KeyManagerImpl is used
- Add tests

Result:

More protection against #9747.
@lvfangmin
Copy link
Author

Cool, thanks for improvement made in 9762.

@normanmaurer
Copy link
Member

@lvfangmin no worries... thanks for reporting :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants