Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex gets 'broken' sporadically (since 9.2.1) #5520

Closed
kares opened this Issue Dec 15, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@kares
Copy link
Member

kares commented Dec 15, 2018

we're seen some mutex occasional failures in CI, for a while now.
turns out maybe not false alarms, but the implementation regressing (since 9.2.1)

the issue can be spotted under concurrent threading with some user use-cases.
e.g. a .rb CountDownLatch https://raw.githubusercontent.com/benlangfeld/countdownlatch/master/lib/countdownlatch.rb ... guessing AR's connection pool might be also affected by this.

already bisected and find the cause to be: 66d2905

which makes no sense ... going to try setup a PR with a concurrent test-case and a proposed fix.

@kares kares added this to the JRuby 9.2.6.0 milestone Dec 15, 2018

@kares

This comment has been minimized.

Copy link
Member Author

kares commented Dec 15, 2018

think this might have been a false alarm, I will need to compare some changes in behavior in 9.2

@kares

This comment has been minimized.

Copy link
Member Author

kares commented Dec 15, 2018

CDL impl + occasional Timeout::Error failing script :

require 'thread'
require 'timeout'

class CountDownLatch
  def initialize(count)
    raise ArgumentError if count < 0
    @count = count
    @mutex = Mutex.new
    @conditional = ConditionVariable.new
  end

  def countdown!
    @mutex.synchronize do
      @count -= 1 if @count > 0
      @conditional.broadcast if @count == 0
    end
  end

  def count
    @mutex.synchronize { @count }
  end

  def wait(timeout = nil)
    # begin
      Timeout::timeout timeout do
        @mutex.synchronize do
          @conditional.wait @mutex if @count > 0
        end
      end
      true
    # rescue Timeout::Error
    #   false
    # end
  end
end

def wait_for_latch(count = 10, timeout = 5)
  latch = CountDownLatch.new count
  count.times do
    Thread.new { latch.countdown! }
  end
  latch.wait(timeout)
end

10_000.times do |i|
  puts i

  result = wait_for_latch
end

kares added a commit that referenced this issue Dec 16, 2018

@headius

This comment has been minimized.

Copy link
Member

headius commented Dec 18, 2018

This seems to have manifested in a spec failure.

https://travis-ci.org/jruby/jruby/jobs/469331339

@kares

This comment has been minimized.

Copy link
Member Author

kares commented Dec 18, 2018

yy, it has been manifesting from time-to-time, for a while now.

@headius

This comment has been minimized.

Copy link
Member

headius commented Dec 18, 2018

I'm on it.

@headius headius self-assigned this Dec 18, 2018

headius added a commit that referenced this issue Dec 18, 2018

Don't use addAll to fill one sync list with another's contents.
This relates to #5520 and #5522 but reduces the changes needed for
a fix to just this collection. The script provided in #5520 runs
to completion with this fix, even bumped up to 100-way concurrency
and 100k iterations. This does not appear to regress the handler
test that regressed due to #5522.
@headius

This comment has been minimized.

Copy link
Member

headius commented Dec 18, 2018

@kares Should we marked this fixed or move it to 9.2.7? My small patch addresses the base issue.

@kares kares closed this Dec 18, 2018

headius added a commit that referenced this issue Dec 20, 2018

Use tryLock first to avoid the brief appearance of blocking.
A recent mspec update changed how the Mutex#lock specs test for
blocking. The new logic loops while checking thread status for
"sleep" to indicate the thread is blocking. With our impl of lock,
a thread could briefly appear to be in "sleep" status even if the
lock acquisition was immediately succesful. Using tryLock first
reduces the chance that the thread will look asleep, though the
race could still happen if another thread unlocks the mutex after
tryLock has been called.

See ruby/mspec@4c660ce

See related issue #5520.
@headius

This comment has been minimized.

Copy link
Member

headius commented Dec 20, 2018

@kares As I mentioned on IRC I think the sporadic failure of that spec was due to a spec update. In ruby/mspec@4c660ce mspec's blocking-call detector was modified to use a thread status of "sleep" to indicate that the thread was blocking, but our mutex always set "sleep" briefly around the call to lock. I modified our impl to tryLock first without setting sleep status, which should make the spec stable (even though there's still a thread status race if someone else unlocks the lock after we call tryLock.)

@headius

This comment has been minimized.

Copy link
Member

headius commented Dec 20, 2018

FWIW this issue did find the deadlock in the interrupts list, so that's a great thing to have fixed :-)

@kares

This comment has been minimized.

Copy link
Member Author

kares commented Dec 21, 2018

interesting, mspec driven approach :) ... guess mspec is biased these days.

anyway, the fix is good as is but I still do not like iterating while the collection might get modified ...
will try to come up with a case and address my remaining concerns on top of existing changes at #5522

kares added a commit to kares/jruby that referenced this issue Dec 21, 2018

kares added a commit that referenced this issue Dec 21, 2018

[fix] take a snapshot for the interrupt stack
... as a way of 'double' resolving GH-5520

kares added a commit to kares/jruby that referenced this issue Dec 27, 2018

kares added a commit to kares/jruby that referenced this issue Dec 27, 2018

kares added a commit to kares/jruby that referenced this issue Dec 28, 2018

[fix] take a snapshot for the interrupt stack
... as a way of 'double' resolving jrubyGH-5520

kares added a commit that referenced this issue Dec 29, 2018

[fix] take a snapshot for the interrupt stack
... as a way of 'double' resolving GH-5520
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.