Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex gets 'broken' sporadically (since 9.2.1) #5520

Closed
kares opened this issue Dec 15, 2018 · 9 comments
Closed

Mutex gets 'broken' sporadically (since 9.2.1) #5520

kares opened this issue Dec 15, 2018 · 9 comments

Comments

@kares
Copy link
Member

@kares kares commented Dec 15, 2018

we're seen some mutex occasional failures in CI, for a while now.
turns out maybe not false alarms, but the implementation regressing (since 9.2.1)

the issue can be spotted under concurrent threading with some user use-cases.
e.g. a .rb CountDownLatch https://raw.githubusercontent.com/benlangfeld/countdownlatch/master/lib/countdownlatch.rb ... guessing AR's connection pool might be also affected by this.

already bisected and find the cause to be: 66d2905

which makes no sense ... going to try setup a PR with a concurrent test-case and a proposed fix.

@kares
Copy link
Member Author

@kares kares commented Dec 15, 2018

think this might have been a false alarm, I will need to compare some changes in behavior in 9.2

@kares
Copy link
Member Author

@kares kares commented Dec 15, 2018

CDL impl + occasional Timeout::Error failing script :

require 'thread'
require 'timeout'

class CountDownLatch
  def initialize(count)
    raise ArgumentError if count < 0
    @count = count
    @mutex = Mutex.new
    @conditional = ConditionVariable.new
  end

  def countdown!
    @mutex.synchronize do
      @count -= 1 if @count > 0
      @conditional.broadcast if @count == 0
    end
  end

  def count
    @mutex.synchronize { @count }
  end

  def wait(timeout = nil)
    # begin
      Timeout::timeout timeout do
        @mutex.synchronize do
          @conditional.wait @mutex if @count > 0
        end
      end
      true
    # rescue Timeout::Error
    #   false
    # end
  end
end

def wait_for_latch(count = 10, timeout = 5)
  latch = CountDownLatch.new count
  count.times do
    Thread.new { latch.countdown! }
  end
  latch.wait(timeout)
end

10_000.times do |i|
  puts i

  result = wait_for_latch
end
kares added a commit that referenced this issue Dec 16, 2018
seems to resolve GH-5520
@headius
Copy link
Member

@headius headius commented Dec 18, 2018

This seems to have manifested in a spec failure.

https://travis-ci.org/jruby/jruby/jobs/469331339

@kares
Copy link
Member Author

@kares kares commented Dec 18, 2018

yy, it has been manifesting from time-to-time, for a while now.

@headius
Copy link
Member

@headius headius commented Dec 18, 2018

I'm on it.

@headius headius self-assigned this Dec 18, 2018
headius added a commit that referenced this issue Dec 18, 2018
This relates to #5520 and #5522 but reduces the changes needed for
a fix to just this collection. The script provided in #5520 runs
to completion with this fix, even bumped up to 100-way concurrency
and 100k iterations. This does not appear to regress the handler
test that regressed due to #5522.
@headius
Copy link
Member

@headius headius commented Dec 18, 2018

@kares Should we marked this fixed or move it to 9.2.7? My small patch addresses the base issue.

@kares kares closed this Dec 18, 2018
headius added a commit that referenced this issue Dec 20, 2018
A recent mspec update changed how the Mutex#lock specs test for
blocking. The new logic loops while checking thread status for
"sleep" to indicate the thread is blocking. With our impl of lock,
a thread could briefly appear to be in "sleep" status even if the
lock acquisition was immediately succesful. Using tryLock first
reduces the chance that the thread will look asleep, though the
race could still happen if another thread unlocks the mutex after
tryLock has been called.

See ruby/mspec@4c660ce

See related issue #5520.
@headius
Copy link
Member

@headius headius commented Dec 20, 2018

@kares As I mentioned on IRC I think the sporadic failure of that spec was due to a spec update. In ruby/mspec@4c660ce mspec's blocking-call detector was modified to use a thread status of "sleep" to indicate that the thread was blocking, but our mutex always set "sleep" briefly around the call to lock. I modified our impl to tryLock first without setting sleep status, which should make the spec stable (even though there's still a thread status race if someone else unlocks the lock after we call tryLock.)

@headius
Copy link
Member

@headius headius commented Dec 20, 2018

FWIW this issue did find the deadlock in the interrupts list, so that's a great thing to have fixed :-)

@kares
Copy link
Member Author

@kares kares commented Dec 21, 2018

interesting, mspec driven approach :) ... guess mspec is biased these days.

anyway, the fix is good as is but I still do not like iterating while the collection might get modified ...
will try to come up with a case and address my remaining concerns on top of existing changes at #5522

kares added a commit to kares/jruby that referenced this issue Dec 21, 2018
kares added a commit that referenced this issue Dec 21, 2018
... as a way of 'double' resolving GH-5520
kares added a commit to kares/jruby that referenced this issue Dec 27, 2018
kares added a commit to kares/jruby that referenced this issue Dec 27, 2018
kares added a commit to kares/jruby that referenced this issue Dec 28, 2018
... as a way of 'double' resolving jrubyGH-5520
kares added a commit that referenced this issue Dec 29, 2018
... as a way of 'double' resolving GH-5520
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants