Metaspace fills with java.lang.invoke.LambdaForms$ over time #4391

Open
hydrogen18 opened this Issue Dec 15, 2016 · 11 comments

Projects

None yet

2 participants

@hydrogen18

Environment

JRuby version: 9.1.60

java -version:
openjdk version "1.8.0_91" OpenJDK Runtime Environment (build 1.8.0_91-b14) OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

OS: Linux community-app2.onr.spiceworks.com 2.6.32-573.18.1.el6.x86_64 #1 SMP Tue Feb 9 22:46:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

JRUBY_OPTS=-J-XX:+UseCompressedOops -J-XX:-TieredCompilation -J-Xms6144m -J-Xmx6144m -J-XX:MaxMetaspaceSize=768m -J-XX:ReservedCodeCacheSize=512m -J-XX:InitialCodeCacheSize=256m -J-Djava.net.preferIPv4Stack=true -J-Dsun.rmi.dgc.server.gcInterval=3600000 -J-Djava.awt.headless=true -J-XX:+PerfDisableSharedMem -J-XX:+UseConcMarkSweepGC -J-XX:+CMSParallelRemarkEnabled -J-XX:+CMSClassUnloadingEnabled -J-XX:+UseParNewGC -J-XX:+CMSConcurrentMTEnabled -J-XX:ConcGCThreads=4 -J-XX:NewSize=2000m -J-XX:MaxNewSize=2000m -J-XX:-UseAdaptiveSizePolicy -J-XX:SurvivorRatio=5

Running a rails 4.1 application.

The application starts and works correctly. Over time the metaspace in use always goes up at a constant rate. Eventually performance starts to suffer and full GC seems to start taking place, degrading performance. A heap dump at this point reveals many 1000s of classes defined with a name of java.lang.invoke.LambdaForms$ the part after the dollar sign is DMH or BMH. The rest of the heap is typical.

This same application works without issue on 1.7.22.

My understanding is that this class is somehow related to java.lang.invoke.MethodHandle, with the classes actually being an anonymous class. Looking at the code, the prevalent usage of this is under the org.jruby.ir.targets. This has been extensively refactored since 1.7.22 it appears.

Please let me know what additional troubleshooting steps I may perform to try and troubleshoot the cause of this issue. I cannot provide the heap dump but could provide further statistics from it.

@headius
Member
headius commented Dec 19, 2016

First confirm for me whether or not you are enabling invokedynamic. 9k still uses it for a few things even when off, which would narrow things down a bit.

I'm not sure if "1000s" is a lot or not. Every chain of LambdaForm used internally by method handles could be a class eventually, and given that we use it for constants, globals, and a few other things it is easy to see how thousands of such sites might exist in a Rails app.

I would suggest a few things:

  • Disable invokedynamic if you have enabled it.
  • Try most recent Java 8. Yours is pretty recent, though.
  • Take a heap dump so we can see what is holding references to those LambdaForm and figure out if it is correct.
@hydrogen18

Thanks for the feedback.

We have invokedynamic off, it is too problematic.

I will consider updating to a newer OpenJDK version. Is there a difference between OpenJDK & Oracle here?

I have the heap dump, I will do more investigation and try and determine what has references to those classes.

@hydrogen18

I can open the Heap Dump with Eclipse MAT. The issue is from what I can tell, all the heap dump tools are designed to show you the things holding references to the instances of a class. Each of these classes has zero instances. But something can still hold an instance to a class, which I suspect is happening. On JRuby 1.7.22 (same app) I can see the class unloaded count go up when the metaspace size go backs down. This is the expected behavior with CMS class unloading enabled. This never happens on 9k.

I ended up parsing the Hprof output myself and running it against a heap dump from a trivial irb session on JRuby 9.1.6.0. I observe the same thing, no shortage of class definitions starting with the name java.lang.invoke.LambdaForm. The heap dump has a class loader ID of zero for each class, which I guess makes sense if the classes are dynamically defined. There are no instances, but I can find object instances that reference the classes. Each of those instances has a class type of java.lang.invoke.MemberName.

Do I just need to produce some sort of tree structure tracing those references back to an object in the org.jruby namespace? Should I expect to trace it back to a java thread or some global value?

@headius
Member
headius commented Jan 6, 2017

OracleJDK and OpenJDK are 99% the same thing.

LambdaForm classes should still be heap objects of type java.lang.Class, so that's how you'd track their reference chain. If they are not loaded into a normal classloader (I believe this is the case; they use an "AnonymousClassloader") the problem may actually be a OpenJDK bug. This is especially likely if you're seeing large numbers of LambdaForm classes with no instances. JRuby itself never even touches LambdaForm objects directly; we would hold references to MethodHandle objects, which themselves should hold a LambdaForm instance. There's no way to call a method against a Class reference.

Is there sensitive information in your heap dump, or would be you able to share it with us and with OpenJDK engineers? It's starting to sound like it's not our bug.

@hydrogen18

Philosophically, I agree that the class object should be an instance of java.lang.Class. However, I've found no such tool that can show me anything interesting about a java.lang.Class or search for instances of java.lang.Class that define a certain class.

Inspecting any java heap dump with jvisualvm shows me only 9 instances of java.lang.Class. These correspond to the primitive types in the Java language.

So I'm not sure that is helpful.

@headius
Member
headius commented Jan 6, 2017

Ok, I tossed an email to some JVM friends, and they don't know why you'd see LF class counts this high. One of them thought perhaps there's a lot of caching happening and not enough GC pressure to force the soft links to clear. You might try passing -XX:SoftRefLRUPolicyMSPerMB=0 to the JVM to see if that helps clear these out.

You do mention that the metaspace expands enough that full GCs start happening so I wouldn't be surprised if that flag doesn't help. It would be very helpful if I could look at that heap, even if you just host a jhat instance I can poke around in for a while.

@hydrogen18

I will put that on my list of options to try next time we run the application in production.

@headius
Member
headius commented Jan 6, 2017

Thanks for your help with this. Don't worry, if there's a problem we'll find it :-)

@hydrogen18
hydrogen18 commented Jan 10, 2017 edited

We have options configured on the JVM to force hourly full GCs, additionally the JVM is literally out of metaspace so full GCs wind up happening frequently.

I've had a chance to do some more investigation into this. Since I've established there are no instances of the problematic classes, I've done the following

  1. Found all the instances that are the class definitions
  2. Found all the instances that references those class definitions
  3. Repeated step 2 for each set of objects identified
  4. Never search for references to the same instance twice
  5. If JRuby class instances are included in the iteration, only search for those in the next iteration
  6. Stop searching when all you can find are java finalizer instances.

Obviously there is more to it with various optimizations to permit this to run in minutes on a multi-gigabyte heap dump rather than days.

The output of this produces a tree that traces back to JRuby objects that are thread contexts as far as I can tell. The tree is unfortunately too large to simply look at and try and look at without additional processing. So somehow the class definitions are still referenced. Notably, I found in the middle of this tree references to a JRuby class ConstantEntry, lending support to your prior notion that Ruby constants are somehow involved in this.

I did conclude that the only problematic class is java.lang.invoke.LambdaForm$DMH with over 20000 class definitions in my application. This apparently is an anonymous inner class of LambdaForm that is a Direct Method handle. I found a bunch of documentation about what a DMH is and why you would want one here: https://wiki.openjdk.java.net/display/HotSpot/Direct+method+handles . Without spending too much time trying to understand the nature of this thing, my guess is that it permits the virtual lookup performed when calling a method on a java object instance to be performed once and the cost to be reduced or eliminated in future calls. So it seems that usage of this class is a very obvious optimization for JRuby.

I am leaning towards this being an OpenJDK issue, because for the first 3 levels of the tree I produced consists of class instances that are entirely internals of the JVM. In other words it appears that JRuby does not directly interact with this class ever. But I feel like it could still be caused by the breadth of our application. Is there a Java method inside of JRuby that can be called to flush the optimized code paths? This could be called with a long period to possibly prevent accumulation of these class definitions. I think the JVM has a similar concept in relation to its JIT engine that allows the flushing of JIT'd code.

The first JRuby class I find in the tree is org/jruby/internal/runtime/methods/CompiledIRMethod with a field of specific. I can't figure out what IR stands for in this class, do you know?

Are there further options in JRuby I could alter to troubleshoot this? If we think this is an OpenJDK problem, do we suspect the same problem exists in Oracle's JDK?

@headius
Member
headius commented Jan 10, 2017

Thank you for the very thorough investigation!

I am also leaning toward this being a bug in OpenJDK, though I won't rule out something being amiss in JRuby. I'd actually prefer the latter, since I can fix JRuby :-)

The first JRuby class I find in the tree is org/jruby/internal/runtime/methods/CompiledIRMethod with a field of specific. I can't figure out what IR stands for in this class, do you know?

CompiledIRMethod is basically what gets stored in a class's method table when we JIT a method defined in Ruby. You'll probably see a MixedModeIRMethod above that, which triggers that JIT to happen after 50 calls. Most code in a typical application will never JIT because it's never called (or not called enough to matter). Only code that JITs will produce a CompiledIRMethod (unless you're running in a mode that forces JIT to run earlier or immediately) which for a recent Rails app we investigated was only a couple thousand methods.

IR refers to JRuby 9k's compiler, which has its own Intermediate Representation.

DMH handles are basically the last stop before the target method. They're the direct function pointer. So in theory you should only ever see 2 * number of jitted methods in JRuby (2x because many methods will have a variable-arity and a specific-arity entry point; that's the "specific" you saw). So again, unless your system is huge AND all those methods are getting hit, I would not expect tens of thousands of DMH.

Are there further options in JRuby I could alter to troubleshoot this?

Well, you could try disabling JIT altogether. That would reduce our use of invokedynamic to almost nothing. But that's not really what you want, since it will run many times slower without JIT. You're also not running with invokedynamic "on" in JRuby, so we can't do any more there. JRuby will still use method handles for some types of variables, frame construction, and constants, but these should all resolve to a handful of back-end method handles that rarely change.

So I guess I'm not sure.

If we think this is an OpenJDK problem, do we suspect the same problem exists in Oracle's JDK?

My OpenJDK friends would like to know more about this. It has been suggested that this could be "method handle caching gone rogue" (unexplored) or that soft references are keeping too much alive (unlikely since you have tried full GCs). Outside that, we'd need to be able to provide them some way to reproduce, which may simply be providing them a heap dump or may require runnable code.

I suggest you join the MLVM mailing list here: http://openjdk.java.net/projects/mlvm/

Jump on my thread about this issue and share whatever you can. The folks listening are JVM guys at Oracle who work on exactly this stuff.

We'll leave this open for now.

@hydrogen18

We ran with -XX:SoftRefLRUPolicyMSPerMB=0 as you suggested and saw no change. The metaspace still fills up and stays full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment