Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby thread stuck on IntHashMap.rehash #3394

Closed
banker opened this issue Oct 13, 2015 · 26 comments · Fixed by #3395
Closed

JRuby thread stuck on IntHashMap.rehash #3394

banker opened this issue Oct 13, 2015 · 26 comments · Fixed by #3395

Comments

@banker
Copy link

banker commented Oct 13, 2015

I have a multi-threaded application running on JRuby 9.0.1.0 (Java 1.8.0_45-b14). The application runs with 8 Ruby threads. One of those threads has been stuck on the IntHashMap.rehash method for the last 13 hours. The remaining threads are all functioning normally. Here is (most of) the stack trace for the thread in question as reported by jstack:

"Ruby-0-Thread-7: /data/apps/my_app/current/vendor/bundle/jruby/2.2.0/gems/my-gem-java/lib/java_executor_simple.rb:22" #21 daemon prio=5 os_prio=0 tid=0x00007f1f585bf800 nid=0x505e runnable [0x00007f1f0db99000]
   java.lang.Thread.State: RUNNABLE
        at org.jruby.util.collections.IntHashMap.rehash(IntHashMap.java:139)
        at org.jruby.util.collections.IntHashMap.put(IntHashMap.java:170)
        at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:118)
        at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:282)
        at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.invokeOther31:time(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.RUBY$method$md?$0(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb:149)
        at java.lang.invoke.LambdaForm$DMH/885284298.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/2130192211.invokeExact_MT(LambdaForm$MH)
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:131)
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:181)
        at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:197)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:161)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.invokeOther3:md?(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.RUBY$block$process$2(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb:182)
        at java.lang.invoke.LambdaForm$DMH/885284298.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/2130192211.invokeExact_MT(LambdaForm$MH)
        at org.jruby.runtime.CompiledIRBlockBody.commonYieldPath(CompiledIRBlockBody.java:70)
        at org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:67)
        at org.jruby.runtime.Block.yieldSpecific(Block.java:116)
        at org.jruby.ir.runtime.IRRuntimeHelpers.yieldSpecific(IRRuntimeHelpers.java:457)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.my_method.RUBY$method$cache$0(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/bdcc.rb:19)
        at java.lang.invoke.LambdaForm$DMH/664223387.invokeStatic_L8_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/684822005.invokeExact_MT(LambdaForm$MH)
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:146)
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:216)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:203)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:209)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.invokeOther8:cache(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my_app.application.RUBY$block$process$1(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/my_app/application.rb:176)
        at java.lang.invoke.LambdaForm$DMH/885284298.invokeStatic_L7_L(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/2130192211.invokeExact_MT(LambdaForm$MH)
        at org.jruby.runtime.CompiledIRBlockBody.commonYieldPath(CompiledIRBlockBody.java:70)
        at org.jruby.runtime.IRBlockBody.doYield(IRBlockBody.java:140)
        at org.jruby.runtime.BlockBody.yield(BlockBody.java:77)

I cannot yet reliably reproduce this, but it does reliably occur once the app has been running for a few hours.

@headius
Copy link
Member

headius commented Oct 14, 2015

That's very strange. Nothing should block in that map, so it sounds like something has put it into an unstable state that loops forever. I'm not surprised you can't reproduce it.

Can you build JRuby? Perhaps you can stop by IRC tomorrow during the day US time and we'll help you look at it? It might be easiest to attach a debugger or console to see what that thread is actually doing.

@headius
Copy link
Member

headius commented Oct 14, 2015

@kares I was just looking quickly, and I don't see any synchronization around our overload resolution logic in JI. Can you confirm?

@kares
Copy link
Member

kares commented Oct 14, 2015

@headius no sync indeed ... adding the IntHashMap did some improvement - e.g. eliminating some Integer allocations. there was a FIXME around the previous ConcurrentHashMap used as a signature cache which now seems wasn't valid ... some logic to handle concurrent cache updates needs to be done 805ccb8#diff-1284db960bb926e3532f0aa650c24916L39 ... as the map is currently used keys are never replaced in the map, so maybe tweaking or synchronizing the rehash logic will mostly do ?

@kares
Copy link
Member

kares commented Oct 14, 2015

simplest thing would be synchronized (cache) on these puts

@headius
Copy link
Member

headius commented Oct 14, 2015

Without an atomic solution like ConcurrentHashMap, we will need to synchronize both puts and gets, since gets could see an unstable state in the data structure if a put is active.

@headius headius added this to the JRuby 9.0.2.0 milestone Oct 14, 2015
@headius
Copy link
Member

headius commented Oct 14, 2015

This needs to be resolved before 9.0.2.0.

headius added a commit that referenced this issue Oct 14, 2015
The move to IntHashMap in RubyToJavaInvoker improved throughput
for multiple-arity calls, but since IntHashMap is not synchronized
this could cause corrupted data under concurrent load as seen in
CallableSelector to encapsulate the cache and synchronize access
to it. I believe this is a minimum effort to make the cache access
safe and fix #3394.

I have not measured the impact of this synchronization in a high
concurrency setting. It may indeed turn out to be more overhead
and a worse bottleneck than using ConcurrentHashMap with Integer
objects. Unfortunately there's no middle ground here unless we
build or adopt some concurrency-friendly int-based map.

Another heavy option would be to maintain two references to
IntHashMap in RubyToJavaInvoker, copying from a synchronized
writable copy to a read-only copy after every write. This would
double the size of the cache structures in memory but would provide
a fast, unsynchronized cache for read-mostly signature lookups.
This pattern is also easier to support with the refactoring in
this commit.
@headius
Copy link
Member

headius commented Oct 14, 2015

@banker If you have the ability to do so, it would be a great help if you could test the code in PR #3395 and possibly also the additional patch I provided in a comment there.

See BUILDING.md for building your own JRuby or poke us on IRC today.

headius added a commit that referenced this issue Oct 14, 2015
@headius
Copy link
Member

headius commented Oct 14, 2015

@banker The fixed IntHashMap usage has been pushed to master and will be in 9.0.2.

@banker
Copy link
Author

banker commented Oct 14, 2015

Thanks for looking into this, @headius. I'll build from master and re-run in my environment. Since the bug has taken up to 12 hours to appear, it will take some time for me to verify that the issue is resolved. I'll report back by Friday.

@banker
Copy link
Author

banker commented Oct 15, 2015

Bad news, @headius. I ran one instance of my app overnight using the new code, and it's stuck again, but differently.

All 8 of my threads are BLOCKED on a call into a Java metrics library. All of the thread stacks look like this:

"Ruby-0-Thread-4: /data/apps/my_app/current/vendor/bundle/jruby/2.2.0/gems/mygem/lib/mylib/java_executor_simple.rb:22" #1
8 daemon prio=5 os_prio=0 tid=0x00007f3fa5333000 nid=0x3fa9 waiting for monitor entry [0x00007f3f7bdf9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.jruby.java.invokers.RubyToJavaInvoker.putSignature(RubyToJavaInvoker.java:203)
        - waiting to lock <0x00000005c1ae89d0> (a org.jruby.util.collections.IntHashMap)
        at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:66)
        at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:348)
        at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.my.app.application.invokeOther31:time(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/mylib/application.rb)

Let me know if I can provide any further information, and thanks again for such a quick response for far!

@enebo enebo reopened this Oct 15, 2015
@headius
Copy link
Member

headius commented Oct 15, 2015

@banker I think this is the older code that had a lock involved, because the newest version does not have a lock (object monitor). Can you show me the -v for that JRuby?

@headius
Copy link
Member

headius commented Oct 15, 2015

Ahh actually I'm sorry...the new code does have a lock for writes, as in putSignature. But I'm very confused now why it would be hanging, since it's only a single lock, no other locks are involved, and reads should have no locking of their own.

@banker
Copy link
Author

banker commented Oct 15, 2015

Right. The only thing I can add is that, anecdotally, this new code is much more likely to trigger this deadlock.

@headius
Copy link
Member

headius commented Oct 15, 2015

@banker Harumph. Very confusing. It sounds like there may simply be a bug in our IntHashMap at this point.

@headius
Copy link
Member

headius commented Oct 15, 2015

@banker Can you gist a full stack dump of your VM in this bad state? I would expect to see another thread inside IntHashMap that has acquired the lock. My theory at the moment is that IntHashMap just has a bug that causes it to endlessly loop, and the object monitor trace is just a side effect.

@thedarkone
Copy link
Contributor

Yeah, I also don't see how all threads could be stuck waiting for that lock.

@banker
Copy link
Author

banker commented Oct 15, 2015

@headius Okay, it turns out that one thread was runnable, which I've included below. Can I email you the gist? Not completely comfortable making it public.

"Ruby-0-Thread-9: /data/apps/my_app/current/vendor/bundle/jruby/2.2.0/gems/my-gem/lib/mylibjava_executor_simple.rb:22" #23 daemon prio=5 os_prio=0 tid=0x00007
f3fa533c000 nid=0x3fae runnable [0x00007f3f7b3f4000]
   java.lang.Thread.State: RUNNABLE
        at org.jruby.util.collections.IntHashMap.put(IntHashMap.java:160)
        at org.jruby.util.collections.IntHashMap.clone(IntHashMap.java:406)
        at org.jruby.java.invokers.RubyToJavaInvoker.putSignature(RubyToJavaInvoker.java:205)
        - locked <0x00000005c1ae89d0> (a org.jruby.util.collections.IntHashMap)
        at org.jruby.java.dispatch.CallableSelector.matchingCallableArityOne(CallableSelector.java:66)
        at org.jruby.java.invokers.RubyToJavaInvoker.findCallableArityOne(RubyToJavaInvoker.java:348)
        at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:98)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:139)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:145)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.foo.myapp.application.invokeOther31:time(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/mylibmyapp/application.rb)
        at data.apps.my_app.releases.my_app_minus_03751e0c_minus_1caa0727.lib.foo.myapp.application.RUBY$method$duplicate_content?$0(/data/apps/my_app/releases/my_app-03751e0c-1caa0727/lib/mylibmyapp/application.rb:149)

@headius
Copy link
Member

headius commented Oct 16, 2015

@banker Yeah go ahead, headius@headius.com. I'm also pushing a change to master to use a high-throughput, concurrency-friendly map implementation rather than using our own. It should be in nightly builds tomorrow morning.

@headius
Copy link
Member

headius commented Oct 16, 2015

After a brief search (and some lazy Twittering) I was pointed at https://github.com/boundary/high-scale-lib, a maintained, maven-released fork of Cliff Click's high-throughput map implementations. They are fast, memory-efficient, concurrency-friendly, and non-blocking. The best part is that there's an implementation that takes a long key, so we can basically drop it in wherever we use IntHashMap.

Here's perf results.

Before (COW IntHashMap):

                                                    user     system      total        real
Control                                         0.590000   0.000000   0.590000 (  0.341363)
Math.abs with byte-ranged Fixnum                1.630000   0.010000   1.640000 (  1.379069)
Math.abs with short-ranged Fixnum               1.730000   0.080000   1.810000 (  1.533326)
Math.abs with int-ranged Fixnum                 1.700000   0.020000   1.720000 (  1.479799)
Math.abs with long-ranged Fixnum                1.660000   0.000000   1.660000 (  1.507284)
Math.abs with float-ranged Float                1.900000   0.020000   1.920000 (  1.671365)
Math.abs with double-ranged Float               1.650000   0.010000   1.660000 (  1.609894)
Contended Math.abs with double-ranged Float     3.950000   0.030000   3.980000 (  0.539143)
Control                                         0.400000   0.000000   0.400000 (  0.389647)
Math.abs with byte-ranged Fixnum                1.570000   0.000000   1.570000 (  1.560834)
Math.abs with short-ranged Fixnum               1.650000   0.010000   1.660000 (  1.617118)
Math.abs with int-ranged Fixnum                 1.660000   0.000000   1.660000 (  1.644870)
Math.abs with long-ranged Fixnum                1.680000   0.010000   1.690000 (  1.659800)
Math.abs with float-ranged Float                1.700000   0.010000   1.710000 (  1.675863)
Math.abs with double-ranged Float               1.690000   0.000000   1.690000 (  1.662213)
Contended Math.abs with double-ranged Float     3.120000   0.020000   3.140000 (  0.444798)

After (NonBlockingHashMapLong

                                                    user     system      total        real
Control                                         0.930000   0.020000   0.950000 (  0.416645)
Math.abs with byte-ranged Fixnum                1.570000   0.010000   1.580000 (  1.350059)
Math.abs with short-ranged Fixnum               1.710000   0.080000   1.790000 (  1.483133)
Math.abs with int-ranged Fixnum                 1.640000   0.020000   1.660000 (  1.422112)
Math.abs with long-ranged Fixnum                1.620000   0.010000   1.630000 (  1.445246)
Math.abs with float-ranged Float                1.950000   0.020000   1.970000 (  1.646687)
Math.abs with double-ranged Float               1.680000   0.010000   1.690000 (  1.591855)
Contended Math.abs with double-ranged Float     3.730000   0.030000   3.760000 (  0.511322)
Control                                         0.400000   0.000000   0.400000 (  0.402811)
Math.abs with byte-ranged Fixnum                1.530000   0.000000   1.530000 (  1.531421)
Math.abs with short-ranged Fixnum               1.620000   0.010000   1.630000 (  1.608175)
Math.abs with int-ranged Fixnum                 1.590000   0.010000   1.600000 (  1.574305)
Math.abs with long-ranged Fixnum                1.540000   0.000000   1.540000 (  1.518596)
Math.abs with float-ranged Float                1.640000   0.010000   1.650000 (  1.617539)
Math.abs with double-ranged Float               1.630000   0.010000   1.640000 (  1.614268)
Contended Math.abs with double-ranged Float     2.920000   0.010000   2.930000 (  0.416824)

Performance is essentially the same. I'm going to push the change, since the library only adds about 100kB to JRuby.

@headius headius reopened this Oct 16, 2015
@headius
Copy link
Member

headius commented Oct 16, 2015

Reopening one last time. If the new collection works, then it would indicate there's a bug in IntHashMap. In that case, we should also reimpl IntHashMap in terms of NonBlockingHashMapLong in both JRuby 1.7.x and 9k.

@thedarkone
Copy link
Contributor

@headius just a note (in case you are unaware of it and end up using Cliff's hash map elsewhere in JRuby), it leaks deleted keys (only keys not values). Obviously, this might not matter depending on use case (no deletions) or, as is the case with NonBlockingHashMapLong, the keys are primitive longs/ints.

@headius
Copy link
Member

headius commented Oct 16, 2015

@thedarkone That's good information, thank you. I think we'll probably still be using CHM for our Object:Object maps but this will be a good replacement for IntHashMap, which obviously seems to have some issues.

@headius
Copy link
Member

headius commented Oct 20, 2015

@banker Hopefully this has fixed your issues. This change will go out in the release today. I'm going to optimistically close this.

@headius headius closed this as completed Oct 20, 2015
@banker
Copy link
Author

banker commented Oct 20, 2015

@headius I'll build again, run, and get back to you. I hope it's fixed, too! Currently just restarting the app on a cron to preemptively avoid :-\

@kares
Copy link
Member

kares commented Oct 30, 2015

@banker or anyone ... could you confirm its behaving properly - considering back-porting to jruby-1_7 if so

kares pushed a commit to kares/jruby that referenced this issue Oct 30, 2015
This should hopefully fix jruby#3394.

Conflicts:
	core/pom.rb
	core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java
kares added a commit to kares/jruby that referenced this issue Nov 2, 2015
…avaInvoker-s

... introducing an internal CallableCache interface to loose coupling in CallableSelector

aligned with main fix for jruby#3394
kares added a commit that referenced this issue Nov 2, 2015
…avaInvoker-s

... introducing an internal CallableCache interface to loose coupling in CallableSelector

aligned with main fix for #3394
kares pushed a commit to kares/jruby that referenced this issue Nov 11, 2015
This should hopefully fix jruby#3394.

Conflicts:
	core/pom.rb
	core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java
kares added a commit to kares/jruby that referenced this issue Nov 11, 2015
…avaInvoker-s

... introducing an internal CallableCache interface to loose coupling in CallableSelector

aligned with main fix for jruby#3394

this commit also includes pieces of da1b849 and so

Conflicts:
	core/src/main/java/org/jruby/java/dispatch/CallableSelector.java
	core/src/main/java/org/jruby/java/invokers/RubyToJavaInvoker.java
	core/src/main/java/org/jruby/javasupport/Java.java
@headius
Copy link
Member

headius commented Nov 24, 2015

Also released in 1.7.23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants