Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classloaders janino and eclipse.jaxb driving up WeakKey creation #7670

Closed
robododge opened this issue Jan 27, 2023 · 5 comments · Fixed by #7710
Closed

Classloaders janino and eclipse.jaxb driving up WeakKey creation #7670

robododge opened this issue Jan 27, 2023 · 5 comments · Fixed by #7710
Labels
bug Something isn't working

Comments

@robododge
Copy link
Contributor

Describe the bug
These classloaders have been responsible for driving up OpenTelemetry WeakKeys in the AbstractWeakHashMap (weakfreecahce) and we have experienced large-scale outages at in our older monolith applications due to this.

Problematic classloaders:

  • org.codehaus.janino.ByteArrayClassLoader
  • org.eclipse.persistence.internal.jaxb.JaxbClassLoader

Steps to reproduce
Janino: Run java auto-instrumentation against a large monolith or any application that uses an, old, Jboss Drools rules engine embedded. Make your rules engine process a few thousand rules. Check the heap histogram of the running JVM, you will see the exact number of io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey references in your heap, they will not be collected.
jaxb Adding this Jaxb classloader as well, it seems to be attached on a per-thread basis, so if your app has 200 threads in its Tomcat nio handler, you will see $WeakKey count of 200 of these build up after each thread is brought into service during load. Any secondary WeakKey refrences stored in the AbstractWeakHashMap due to ByteBuddy transforms will be cleaned up by the fixes done in #6240 background thread, but it will be a tug of war as keys are created and the background thread will come in and clean them

What did you expect to see?
No WeakKeys created in any AbstractWeakHashMap caches since the classes generated by these classloaders provide no use to distributed tracing.

What did you see instead?
n * m weak keys created where n=number of rules in the rules engine and m=number of instrumentation matcher patterns. After the background thread cleans, you will see only 'n' WeakKeys remain in the heap

What version are you using?
was proven in 1.13.1 and 1.21.0. (NOTE, in 1.13.1, keys will not be cleaned memory can be leaked)

Environment
Compiler: 1.8.0_321-b3 64bit, AdoptOpenJDK 8.0.352+8
OS: Redhat Linux 7.9, MacOs
Runtime: Oracle JRE 1.8.0_321-b3 64bit, AdoptOpenJDK 8.0.352+8

Additional context

@laurit
Copy link
Contributor

laurit commented Feb 1, 2023

We are using the same kind of cache in multiple places do you know which of these is to be blamed for the excessive number of keys? If you have a heap dump at hand you should be able to list all the instances of WeakLockFreeCache and sort them by the size. Tell us how large is the problematic one and where it is referenced from. Alternatively I guess weak keys could be accumulating in the reference queue could check that also.

@robododge
Copy link
Contributor Author

@laurit Thanks for responding, please see the attached image below of the heap dump.. This was done with javaagent 1.21.0. The first two cache entries are for the referfence queue and helper injectors respectively. The block of 15 caches below that corresponds to the janino classloaders being tracked for each distinct matcher. There are 1127 Drools rule engine elements defined in xml for this app and during runtime each will create a new Janino classloader instance. If I query the heap dump for all the AbstractWeakConcurrentMap$WeakKey objecs where the referent matches the Janino classloader I get 18,032 weak keys. The math matches up exactly because:

  • Matcher queues weak keys: 15 x 1127
  • Weak keys for reference queue: 1127
    Total is = 18,032 weak keys

These keys stay and are never collected. If, for example, my application were to define "n" individual Drools rule condition elements either statically or on the fly, then the weak keys in this case would grow as a function of n. Total keys would be n + (n * 15)

This same exercise can be carried out of the org.eclipse.persistence.internal.jaxb.JaxbClassLoader as it has the same lifecycle being used in the Jackson processing creating java classes on the fly.

weklockfreecache-otel-github

During network recovery situations, the reload of rules engine component will happen. So when several re-loads occur, we end up building thousands of weak keys and the key cleaning thread becomes busy constantly flushing the keys and values that are eligible. Prior to the 1.21 release, these weak keys would stack up with no thread to come along and clean them up, in one case, we ended up with 5.7mil in one case that brought a critical app down. Would prefer to avoid this all together never consider these classloaders.

@laurit
Copy link
Contributor

laurit commented Feb 2, 2023

These keys stay and are never collected.

They should be collected when the class loader is collected. If they are not then there is a memory leak. We have had bugs where caches weren't cleaned properly, for example #6230, which is probably what caused trouble for you. To out best knowledge we have fixed these leaks. If you wish to verify this I guess you could make your app reload the rules multiple times and see whether these caches behave as expected. With older versions of the agent you'll probably see that keys for collected class loaders are not removed. Let us know if you believe that there is still a leak in the current version.
None of the caches on the screenshot are excessively large so it seems fine to me.
Excluding class loaders that are known not contain anything we might need to instrument is fine. Especially when thousands of such class loaders are created excluding them might improve startup time.

@robododge
Copy link
Contributor Author

Thanks, you are correct, older versions of the javaagent prior to the fix #6240 will leak. I have ran the tests you recommend and yes the caches do behave as expected as long as using v1.21 or newer javaagent. What I'm asking to do here is exclude these two classloaders as I see no need to loop into them for instrumentation its just wasting processing time and putting the cache cleaner thread into overdrive. Maybe this project doesn't want to play whack-a-mole like that by addressing every obscure classloader that comes up. No worries then, we are excluding them in our custom extensions.

@robododge
Copy link
Contributor Author

Thanks!, I just realized the PR was approved

mateuszrzeszutek pushed a commit that referenced this issue Feb 2, 2023
Fixes #7670

These obscure classloaders should be ignored completely, they process
in-line class creation and should not branch out to interactions that
are useful in distributed tracing. These two are responsible for outages
recently in many of our applications due to driving up memory usage from
WeakKey caching. In some cases, janino processing has ran wild and
stimulated over 5.7mil WeakKey objects accounting for close to 200MB of
heap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants