Classloaders janino and eclipse.jaxb driving up WeakKey creation #7670

robododge · 2023-01-27T23:49:08Z

Describe the bug
These classloaders have been responsible for driving up OpenTelemetry WeakKeys in the AbstractWeakHashMap (weakfreecahce) and we have experienced large-scale outages at in our older monolith applications due to this.

Problematic classloaders:

org.codehaus.janino.ByteArrayClassLoader
org.eclipse.persistence.internal.jaxb.JaxbClassLoader

Steps to reproduce
Janino: Run java auto-instrumentation against a large monolith or any application that uses an, old, Jboss Drools rules engine embedded. Make your rules engine process a few thousand rules. Check the heap histogram of the running JVM, you will see the exact number of io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap$WeakKey references in your heap, they will not be collected.
jaxb Adding this Jaxb classloader as well, it seems to be attached on a per-thread basis, so if your app has 200 threads in its Tomcat nio handler, you will see $WeakKey count of 200 of these build up after each thread is brought into service during load. Any secondary WeakKey refrences stored in the AbstractWeakHashMap due to ByteBuddy transforms will be cleaned up by the fixes done in #6240 background thread, but it will be a tug of war as keys are created and the background thread will come in and clean them

What did you expect to see?
No WeakKeys created in any AbstractWeakHashMap caches since the classes generated by these classloaders provide no use to distributed tracing.

What did you see instead?
n * m weak keys created where n=number of rules in the rules engine and m=number of instrumentation matcher patterns. After the background thread cleans, you will see only 'n' WeakKeys remain in the heap

What version are you using?
was proven in 1.13.1 and 1.21.0. (NOTE, in 1.13.1, keys will not be cleaned memory can be leaked)

Environment
Compiler: 1.8.0_321-b3 64bit, AdoptOpenJDK 8.0.352+8
OS: Redhat Linux 7.9, MacOs
Runtime: Oracle JRE 1.8.0_321-b3 64bit, AdoptOpenJDK 8.0.352+8

Additional context

The text was updated successfully, but these errors were encountered:

laurit · 2023-02-01T09:09:44Z

We are using the same kind of cache in multiple places do you know which of these is to be blamed for the excessive number of keys? If you have a heap dump at hand you should be able to list all the instances of WeakLockFreeCache and sort them by the size. Tell us how large is the problematic one and where it is referenced from. Alternatively I guess weak keys could be accumulating in the reference queue could check that also.

robododge · 2023-02-02T00:45:51Z

@laurit Thanks for responding, please see the attached image below of the heap dump.. This was done with javaagent 1.21.0. The first two cache entries are for the referfence queue and helper injectors respectively. The block of 15 caches below that corresponds to the janino classloaders being tracked for each distinct matcher. There are 1127 Drools rule engine elements defined in xml for this app and during runtime each will create a new Janino classloader instance. If I query the heap dump for all the AbstractWeakConcurrentMap$WeakKey objecs where the referent matches the Janino classloader I get 18,032 weak keys. The math matches up exactly because:

Matcher queues weak keys: 15 x 1127
Weak keys for reference queue: 1127
Total is = 18,032 weak keys

These keys stay and are never collected. If, for example, my application were to define "n" individual Drools rule condition elements either statically or on the fly, then the weak keys in this case would grow as a function of n. Total keys would be n + (n * 15)

This same exercise can be carried out of the org.eclipse.persistence.internal.jaxb.JaxbClassLoader as it has the same lifecycle being used in the Jackson processing creating java classes on the fly.

During network recovery situations, the reload of rules engine component will happen. So when several re-loads occur, we end up building thousands of weak keys and the key cleaning thread becomes busy constantly flushing the keys and values that are eligible. Prior to the 1.21 release, these weak keys would stack up with no thread to come along and clean them up, in one case, we ended up with 5.7mil in one case that brought a critical app down. Would prefer to avoid this all together never consider these classloaders.

laurit · 2023-02-02T08:48:10Z

These keys stay and are never collected.

They should be collected when the class loader is collected. If they are not then there is a memory leak. We have had bugs where caches weren't cleaned properly, for example #6230, which is probably what caused trouble for you. To out best knowledge we have fixed these leaks. If you wish to verify this I guess you could make your app reload the rules multiple times and see whether these caches behave as expected. With older versions of the agent you'll probably see that keys for collected class loaders are not removed. Let us know if you believe that there is still a leak in the current version.
None of the caches on the screenshot are excessively large so it seems fine to me.
Excluding class loaders that are known not contain anything we might need to instrument is fine. Especially when thousands of such class loaders are created excluding them might improve startup time.

robododge · 2023-02-02T13:40:42Z

Thanks, you are correct, older versions of the javaagent prior to the fix #6240 will leak. I have ran the tests you recommend and yes the caches do behave as expected as long as using v1.21 or newer javaagent. What I'm asking to do here is exclude these two classloaders as I see no need to loop into them for instrumentation its just wasting processing time and putting the cache cleaner thread into overdrive. Maybe this project doesn't want to play whack-a-mole like that by addressing every obscure classloader that comes up. No worries then, we are excluding them in our custom extensions.

robododge · 2023-02-02T13:52:38Z

Thanks!, I just realized the PR was approved

Fixes #7670 These obscure classloaders should be ignored completely, they process in-line class creation and should not branch out to interactions that are useful in distributed tracing. These two are responsible for outages recently in many of our applications due to driving up memory usage from WeakKey caching. In some cases, janino processing has ran wild and stimulated over 5.7mil WeakKey objects accounting for close to 200MB of heap.

robododge added the bug Something isn't working label Jan 27, 2023

robododge mentioned this issue Jan 27, 2023

Ignore classloaders: janino and eclipse.jaxb robododge/opentelemetry-java-instrumentation#2

Closed

robododge mentioned this issue Feb 1, 2023

Ignore janino classloader #7710

Merged

mateuszrzeszutek closed this as completed in #7710 Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classloaders janino and eclipse.jaxb driving up WeakKey creation #7670

Classloaders janino and eclipse.jaxb driving up WeakKey creation #7670

robododge commented Jan 27, 2023

laurit commented Feb 1, 2023

robododge commented Feb 2, 2023

laurit commented Feb 2, 2023

robododge commented Feb 2, 2023

robododge commented Feb 2, 2023

Classloaders janino and eclipse.jaxb driving up WeakKey creation #7670

Classloaders janino and eclipse.jaxb driving up WeakKey creation #7670

Comments

robododge commented Jan 27, 2023

laurit commented Feb 1, 2023

robododge commented Feb 2, 2023

laurit commented Feb 2, 2023

robododge commented Feb 2, 2023

robododge commented Feb 2, 2023