-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM due to SSL key materials cached every time when there is new connection when using OpenSslCachingX509KeyManagerFactory #9747
Comments
Which version of netty? Also can you share a heap dump ?
… Am 01.11.2019 um 19:33 schrieb Allan Lyu ***@***.***>:
Expected behavior
Recently, we found the some OOM issue when switching to OpenSsl netty implementation.
due to SSL key materials cached every time when there is new connection when using OpenSslCachingX509KeyManagerFactory.
Actual behavior
Steps to reproduce
Minimal yet complete reproducer code (or URL to code)
Netty version
JVM version (e.g. java -version)
OS version (e.g. uname -a)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@normanmaurer thanks for the quick response, I'm not able to sharing the heap dump due to the security reason, but I can share the issue I saw in the heap and the stack and ssl log. Let me get those, will update it here. |
Here is the main thing I saw in the heap dump, which holds 67M of cached OpenSsl key materials in I changed the OpenJDK with more logs like adding the stack trace when chooseAlias happened: ` ssl: at sun.security.ssl.X509KeyManagerImpl.getAliases(X509KeyManagerImpl.java:726) |
One thing I'm trying to figure out is if this is a Netty issue or it's due to how our cert is being generated and we shouldn't use caching Key metarial provider. Given we'll generate new alias id even the cert is perfect matching here here, it seems to me there is no benefit of having OpenSslCachingKeyMaterialProvide, are we using it in a wrong way? |
@lvfangmin yeah you should not use the caching one then. That said I think we should most likely bound the maximum number of cached entries. |
@normanmaurer I'm wondering if the cache helps or not in general for other users, since new alias id with unique id will be generated even the cert is perfect matching in X509KeyManagerImpl. |
@lvfangmin also who is "we" here ? I know a lot of different users who use the cache with great benefit. Maybe they just not use so many certs ? Can you share some sort of reproducer ? |
Also this is on the client side or server side ? |
@normanmaurer when I mentioned "we" I meant the JDK X509KeyManagerImpl, internally we only have a single cert, so I don't think it's a cert number issue. And this code is on server side.
That's why I'm still wondering if it's because we're not using it in a correct way. But based on the JDK X509KeyManagerImpl code, it seems it's the expected behavior that we'll cache the same cert and key with unique id for every connection, am I understanding correctly? From the JDK code, sees it's using a unique id to allow us to reliably cache entries between the calls to getCertificateChain() and getPrivateKey() even if tokens are inserted or removed. |
@lvfangmin interesting... can you please share a simple reproducer for this ? I would like to also see how you bootstrap everything etc. |
We added openssl implementation in ZooKeeper, which we haven't upstreamed yet, but you can find a general idea how we set it up here: diff.txt The method to create SslContext is createNettyOpenSslContext. |
I'll try and see if I can set up a minimal reproducer, meanwhile, @normanmaurer can you give me more context how the OpenSslCachingX509KeyManagerFactory and OpenSslCachingKeyMaterialProvide works, which will help me understand how it improves the performance during handshake. |
@lvfangmin I think the fact that it works for others is that they have their own KeyManager implementations that have "stable" alias between connections. So yes I think for you it is correct to not use I think it is still a good idea to limit the cache tho. |
@lvfangmin how big is your keystore exactly? not some autogenerated thing with thousands of obsolete / duplicate certificates I hope? |
@normanmaurer @johnou we have a single keystore, which returns a 'key' alias every time, the problem is that the alias chose from the JDK X509KeyManagerImpl is unique every time. I spent a day to abstract our openssl code and put it into a file here: It's a java test, I renamed it to txt to be able to upload here, it's the close way how we use Netty OpenSsl inside ZK. The test is create a simple server, and then issuing 1 million create/close client request. If you run the test, it will you'll see the memory keep growing, and from the dumped memory you'll see the cached key materials in OpenSslCachingX509KeyManagerFactory is keep growing. |
@lvfangmin thanks will have a look |
@lvfangmin can you also share the |
@lvfangmin after some debugging I think I also know why people usually not see this problem. By default we use "SunX509" as algorithm when creating the KeyManagerFactory. When this is used the JDK uses SunX509KeyManagerImpl. This one uses "stable" aliases and so the caching works as expected. You specify another algorithm and so it ends up using So to fix this I think we should do two things:
WDYT ? |
… is not bound. We should support an upper limit Motivation: At the moment te cache is not bound and so lead to huge memory consumpation. We should ensure its bound by default. Modifications: Ensure cache is bound Result: Fixes #9747.
@lvfangmin added the bound #9759... |
@normanmaurer thanks for help debugging the issue here, the improvements you suggested to add protection in Netty to avoid using OpenSslCachingX509KeyManagerFactory if the key manager algorithm is using X509KeyManagerImpl sounds reasonable to me, otherwise it will add overhead instead of making it more efficient. If this is not easy to achieve, maybe we should update the doc of OpenSslCachingX509KeyManagerFactory to add this useful information. The bound code looks good to me, thanks for quickly implementing that. Meanwhile, we'll also consider to use the default key manager factory algorithm. Also, is there any performance difference when using OpenSslCachingX509KeyManagerFactory and OpenSslX509KeyManagerFactory, which one is suggested? |
Motivation: sun.security.ssl.X509KeyManagerImpl will not use "stable" aliases and so aliases may be changed during invocations. This means caching is useless. Because of this we should disable the cache if its used. Modifications: - Disable caching if sun.security.ssl.X509KeyManagerImpl is used - Add tests Result: More protection against #9747.
@lvfangmin please check #9762... yes |
…ed (#9762) Motivation: sun.security.ssl.X509KeyManagerImpl will not use "stable" aliases and so aliases may be changed during invocations. This means caching is useless. Because of this we should disable the cache if its used. Modifications: - Disable caching if sun.security.ssl.X509KeyManagerImpl is used - Add tests Result: More protection against #9747.
…ed (#9762) Motivation: sun.security.ssl.X509KeyManagerImpl will not use "stable" aliases and so aliases may be changed during invocations. This means caching is useless. Because of this we should disable the cache if its used. Modifications: - Disable caching if sun.security.ssl.X509KeyManagerImpl is used - Add tests Result: More protection against #9747.
Cool, thanks for improvement made in 9762. |
@lvfangmin no worries... thanks for reporting :) |
Expected behavior
Recently, we found the OOM issue when switching from JDK Ssl to OpenSsl in netty.
We're using OpenSslCachingX509KeyManagerFactory explicity, so Netty will use OpenSslCachingKeyMaterialProvide to cache and reduce the overhead of parsing the chain and the key for generation of the material.
We expect to see performance optimization but shouldn't see OOM issue.
Actual behavior
But with stress test for TLS connection, we saw the memory linearly increasing and eventually OOM.
After debugging into the Netty and OpenJDK ssl code, we found the problem is that every time when there is a new connection, handshake cert selection callback OpenSslClientCertificateCallback is called and it will try to find the alias key materials from the cache, if it doesn't exist it will try to find the match alias from server cert chain, which created a new alias in format of seq_id.builderIndex.keyStoreAlias, like 924450.0.key. And it will parse the chain and key, put into the cache with the new alias, and this retained the refCnt of the key material and prevented the native memory being destroyed, that's why we eventually saw the OOM issue.
Changing to use OpenSslX509KeyManagerFactory solved this problem.
Steps to reproduce
Using OpenSslCachingX509KeyManagerFactory to set up the SSLContext, and keep issuing Issuing lots of TLS connection requests.
Minimal yet complete reproducer code (or URL to code)
Netty version
We're using 4.1.36.Final.
JVM version (e.g.
java -version
)java version "10.0.1" 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)
The text was updated successfully, but these errors were encountered: