Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex behavior mismatch with MRI since JRuby 9.2.8.0 #5875

Closed
walro opened this issue Sep 13, 2019 · 8 comments
Closed

Mutex behavior mismatch with MRI since JRuby 9.2.8.0 #5875

walro opened this issue Sep 13, 2019 · 8 comments
Milestone

Comments

@walro
Copy link

@walro walro commented Sep 13, 2019

Environment

$ jruby -v
jruby 9.2.8.0 (2.5.3) 2019-08-12 a1ac7ff Java HotSpot(TM) 64-Bit Server VM 9.0.1+11 on 9.0.1+11 +jit [darwin-x86_64]

Expected Behavior

The following executes "forever" without errors on MRI 2.5.3 and JRuby 9.2.7.0:

mutex = Mutex.new

t1 = Thread.new do
  loop do
    mutex.synchronize do
      mutex.sleep 1 # not needed to provoke the error, but just to motivate the t1.run below
    end
  end
end

loop do
  mutex.synchronize do
  end

  t1.run
end

This is a real-world problem cooked down to its core. The real-world app (a busy web-app) uses https://github.com/sharpstone/rack-timeout, where one can find similar mutex usage.

Actual Behavior

$ jruby mutex.rb 
warning: thread "Ruby-0-Thread-1: mutex.rb:1" terminated with exception (report_on_exception is true):
ConcurrencyError: interrupted waiting for mutex
         lock at org/jruby/ext/thread/Mutex.java:101
  synchronize at org/jruby/ext/thread/Mutex.java:158
     mutex.rb at mutex.rb:5
         loop at org/jruby/RubyKernel.java:1425
     mutex.rb at mutex.rb:4
ThreadError: killed thread
    wakeup at org/jruby/RubyThread.java:1279
       run at org/jruby/RubyThread.java:1437
  mutex.rb at mutex.rb:15
      loop at org/jruby/RubyKernel.java:1425
    <main> at mutex.rb:11

After some digging around I would assume this is connected to the change in #5683. What I find a bit curious is that the main thread must also try to synchronize to force the error to happen.

@headius
Copy link
Member

@headius headius commented Sep 16, 2019

I'm on it! The problem here is that for whatever reason I implemented mutex sleeping to sleep based on some arbitrary lock, rather than on the lock associated with the mutex.

@enebo enebo added this to the JRuby 9.2.9.0 milestone Sep 25, 2019
@mrmargolis
Copy link

@mrmargolis mrmargolis commented Oct 4, 2019

Seeing this also happen in a Rails app that uses the message_bus gem on JRuby 9.2.8.0. Very similar trace to what @walro posted. Full blocker for being able to upgrade to 9.2.8.0 since we blow up inside the thread handling in that gem

@headius
Copy link
Member

@headius headius commented Oct 16, 2019

Circled back to this. Will fix for 9.2.9.

@headius
Copy link
Member

@headius headius commented Oct 23, 2019

Ok, finally addressing this.

The bug here seems to be that when acquiring a lock held by another thread, we will raise this interrupt error if interrupted.

The script you provide needs the main-thread synchronize because that causes the child thread to go into a different piece of logic. With an uncontended lock, that child thread will only ever call lock.tryLock which returns immediately if the lock is not contended. When you cause the lock to be contended, we fall into different logic that attempts to acquire the lock via an interruptible call. If that lock acquisition is interrupted (as it is here by immediately forcing the thread to run) we raise the error.

The question is what behavior we should do. If we are attempting to acquire a lock and some other thread forces us to run, do we:

  • Raise an error (as current behavior).
  • Silently ignore it and leave the lock unlocked (seems bad, since we'd proceed without the lock).
  • Silently ignore it and attempt to acquire the lock again (also seems bad, since you can't force the thread to stop trying to acquire the lock).

What's correct?

@headius
Copy link
Member

@headius headius commented Oct 23, 2019

The other issue here is that in CRuby/MRI, there's no real concurrency around the lock. If the main thread acquires the lock, the other thread will go to sleep. Once the lock is released, the thread will be woken up and proceed to acquire it. There's no scenario where the thread would be waiting for the lock once run gets called.

@headius
Copy link
Member

@headius headius commented Oct 23, 2019

I believe this patch might be the correct behavior. In this case, when the lock acquisition is interrupted, we will loop and try to acquire it again but also check whether any thread events have fired (e.g. kill, raise that would break the thread out of its normal execution path).

This means we'll try again to get the lock for this run case, but other interrupt cases like kill or raise will trigger.

diff --git a/core/src/main/java/org/jruby/ext/thread/Mutex.java b/core/src/main/java/org/jruby/ext/thread/Mutex.java
index 06278320c3..45788d4d13 100644
--- a/core/src/main/java/org/jruby/ext/thread/Mutex.java
+++ b/core/src/main/java/org/jruby/ext/thread/Mutex.java
@@ -95,10 +95,14 @@ public class Mutex extends RubyObject implements DataType {
 
         // try locking without sleep status to avoid looking like blocking
         if (!thread.tryLock(lock)) {
-            try {
-                context.getThread().lockInterruptibly(lock);
-            } catch (InterruptedException ex) {
-                throw context.runtime.newConcurrencyError("interrupted waiting for mutex");
+            for (;;) {
+                try {
+                    context.getThread().lockInterruptibly(lock);
+                    return this;
+                } catch (InterruptedException ex) {
+                    /// ignore, check thread events and try again!
+                    context.pollThreadEvents();
+                }
             }
         }

@jboler
Copy link

@jboler jboler commented Oct 25, 2019

I believe we're experiencing this bug in our rails app. After we upgraded to jRuby 9.2.8.0 our app would just stop responding completely when it was processing many concurrent requests with puma.

Thanks for the fix!

@headius headius closed this in cc6bf49 Oct 26, 2019
@headius
Copy link
Member

@headius headius commented Oct 30, 2019

@jboler Sure! Feel free to give a nightly build a try, if you can, to confirm it's all good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants