Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActiveRecord ConnectionPool locking issue with 9.2.12.0 #6326

Closed
dgolombek opened this issue Jul 15, 2020 · 11 comments
Closed

ActiveRecord ConnectionPool locking issue with 9.2.12.0 #6326

dgolombek opened this issue Jul 15, 2020 · 11 comments

Comments

@dgolombek
Copy link

Environment Information

Provide at least:

Expected Behavior

ActiveRecord ConnectionPool locks should be acquired and released quickly, as happened with JRuby 9.1.17.0

Actual Behavior

Threads seem to block trying to acquire the ActiveRecord::ConnectionAdapters::ConnectionPool and being very slow to do so, OR something hidden is holding that lock. Our app explodes from ~115 threads to ~350 threads, with most of them either trying to acquire or release the lock on that class. The three locations we see most threads in:

  1. https://github.com/rails/rails/blob/4-2-stable/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L262
  2. https://github.com/rails/rails/blob/4-2-stable/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L280
  3. https://github.com/rails/rails/blob/4-2-stable/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L91

The full 15M Thread dump is at https://gist.githubusercontent.com/dgolombek/8b3ce3c68e7cd53438b758f1825061f5/raw/8aaa8524fd032772b97a5674f3667a1003921a0e/gistfile1.txt

We discussed this on #jruby:matrix.org starting at 2:31 EST on 2020-07-15.

@headius
Copy link
Member

headius commented Jul 15, 2020

From Matrix discussion, it seems like nearly all of these threads are blocked waiting for the connection pool lock in ActiveRecord. The unusual finding is that we could find no thread in the dump that has actually acquired the lock, raising the possibility that due to some logic error in either Rails or JRuby the lock has been locked and never unlocked by one of the other threads or by a thread which has exited.

A cursory inspection of Mutex#lock, Mutex#unlock, Monitor#synchronize, and related Thread code did not lead me anywhere. synchronize unlocks the acquired lock in an ensure block, and thread shutdown appears to have no way to skip unlocking all locks the thread has acquired. There's a possibility that the handle_interrupt logic in synchronize could be related.

@headius
Copy link
Member

headius commented Jul 15, 2020

@dgolombek Is this easy for you to reproduce, or does it require a long time? I am reviewing some related commits that have happened in the 2 years since 9.1.17.0 was released, and I'm hoping it might be possible for you to try reverting some of them to see if they help.

Here's the first one I see that's interesting:

cc6bf49

This change makes Mutex more aggressively try to keep acquiring the lock, even after being interrupted while waiting. The reasoning here is that interrupts for Ruby reasons will usually (always?) be accompanied by a thread event like a raise or kill, and other interrupts are most likely spurious wakeups.

Of course, if the interrupt was done by Java in an attempt to shut down the thread, it would generally be ignored here and the thread would go on trying to acquire the lock.

I will keep reviewing other locking-related changes.

@dgolombek
Copy link
Author

This unfortunately only reproduces in production -- we haven't found the right combination of load tests to run in combination to reproduce in staging environment. It happens about once every 30 minutes in production, resulting in death of the instance running it.

@headius
Copy link
Member

headius commented Jul 15, 2020

Another interesting one, this adds the handle_interrupt calls that are intended to disable thread events while locking or unlocking the mutex: c99e3f9

This came in along with an update to CRuby's standard library.

@dgolombek
Copy link
Author

We downgraded to 9.2.6.0 and thus far have been unable to replicate this problem. It's been running for ~10 hours w/out a lock up, where as previously we'd see an instance lock up and die at least once an hour.

@headius
Copy link
Member

headius commented Jul 18, 2020

@dgolombek Thank you! I understand from your Matrix messages that you are unable to update to 9.2.7.0 but this gives us a lot more information. We know it's not part of the initial 2.5 stdlib update, and it was working as of February last year.

headius added a commit to headius/jruby that referenced this issue Sep 21, 2020
There's a race doing the lock add after the executeTask, since a
thread event could interrupt the flow and bypass that logic. A
finally would not be appropriate, since an interrupted attempt to
acquire the lock would then stil add it. Instead we just move the
lock add immediately after lockInterruptibly. If we reach this
point, the lock is locked by the current thread, and must be
added.

Fixes jruby#6405 and maybe jruby#6326.
@headius
Copy link
Member

headius commented Sep 21, 2020

Good news, I think we've found the problem thanks to @stalbot's reproduction in #6405.

The issue was introduced in bea9ad4 (released in 9.2.8.0) by adding the executeTask wrapper around the lock acquisition, but failing to keep the lock add right next to it. executeTask can be interrupted before returning if a thread kill or raise event arrives, skipping the lock add in RubyThread.lockInterruptibly but leaving the lock locked.

@headius
Copy link
Member

headius commented Sep 22, 2020

I'm hoping we hit the mark in #6407 so marking this one fixed. 👍

@dgolombek
Copy link
Author

Great, thank you! As soon as 9.2.14.0 is out, I'll try upgrading and let you know. The failure mode reported in #6405 certainly matches what we saw here. We don't have any explicit calls to Thread.kill, but it's hard to vet all of our libraries. We use concurrent-ruby extensively, including ThreadPoolExecutors, but I don't see how that would trigger this failure.

@dgolombek
Copy link
Author

Unfortunately this bug (or one that looks very similar) has reproduced for us when running 9.2.14.0. See the attached stack dump. Jstack -l hung for at least 15 minutes -- I have a longer running dump going now to see if it ever completes. This dump shows 132 threads in sun.misc.Unsafe.park and 9 in Object.wait, plus a few in doing polling/reads/socket stuff (JMX and Jetty listeners).

jstack.log

@headius
Copy link
Member

headius commented Jan 12, 2021

Your assessment of the dumps seems accurate.

The 132 threads parked appear to be Jetty worker threads of some kind; no JRuby code appears in those stacks so I will assume this is just part of normal Jetty operation.

Some interesting stacks involving Ruby:

Timeout thread for Rack, seems like this is probably normal.

Thread 1672: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame)
 - org.jruby.RubyThread$3.run(org.jruby.runtime.ThreadContext, java.util.concurrent.locks.Condition) @bci=26, line=2241 (Compiled frame)
 - org.jruby.RubyThread$3.run(org.jruby.runtime.ThreadContext, java.lang.Object) @bci=6, line=2235 (Compiled frame)
 - org.jruby.RubyThread.executeTask(org.jruby.runtime.ThreadContext, java.lang.Object, org.jruby.RubyThread$Status, org.jruby.RubyThread$Task) @bci=40, line=1585 (Compiled frame)
 - org.jruby.RubyThread.sleep(java.util.concurrent.locks.Lock, long) @bci=47, line=2235 (Compiled frame)
 - org.jruby.ext.thread.Mutex.sleep(org.jruby.runtime.ThreadContext, org.jruby.runtime.builtin.IRubyObject) @bci=80, line=148 (Compiled frame)

Threads blocked waiting for a mutex lock. Also 8735, 5340, 2109, 25240, 24125, 23351, 21076, 15349, 9608, 9164, 24771, 23815, 23770, and a couple dozen more. These are likely the problem threads.

Thread 15065: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=836 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(int) @bci=66, line=897 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(int) @bci=24, line=1222 (Compiled frame)
 - java.util.concurrent.locks.ReentrantLock.lockInterruptibly() @bci=5, line=335 (Compiled frame)
 - org.jruby.RubyThread$2.run(org.jruby.runtime.ThreadContext, java.util.concurrent.locks.Lock) @bci=1, line=2168 (Compiled frame)
 - org.jruby.RubyThread$2.run(org.jruby.runtime.ThreadContext, java.lang.Object) @bci=6, line=2165 (Compiled frame)
 - org.jruby.RubyThread.executeTask(org.jruby.runtime.ThreadContext, java.lang.Object, org.jruby.RubyThread$Status, org.jruby.RubyThread$Task) @bci=40, line=1585 (Compiled frame)
 - org.jruby.RubyThread.executeTask(org.jruby.runtime.ThreadContext, java.lang.Object, org.jruby.RubyThread$Task) @bci=7, line=1571 (Compiled frame)
 - org.jruby.RubyThread.lockInterruptibly(java.util.concurrent.locks.Lock) @bci=39, line=2165 (Compiled frame)
 - org.jruby.ext.thread.Mutex.lock(org.jruby.runtime.ThreadContext) @bci=29, line=100 (Compiled frame)

Given the number of threads stuck waiting for locks, it does seem similar to #6405 and may indicate that issue is not fixed or there is some additional issue.

Interestingly, however, these stacks (or at least the ones I have looked at) are now blocked waiting for a redis connection, not an ActiveRecord connection. Thread 15065, for example, is stuck at line 98 backend/services/cache.rb, with backend/rpc/app_inventory_service.rb line 52 in the start_processing method a bit lower in the stack. Those appear to be your app code since they are not inside of a gem. The cache.rb call them proceeds into the connection_pool gem which, calls back to your code, which tries to acquire a redis connection and gets stuck waiting on that lock.

In addition to a JRuby bug, this could also be a bug in either the connection_pool gem or the redis gem's handling of its own connection locking.

If you don't see any ActiveRecord connection pool threads I would say we should open a new issue, since this may be specific to these other gems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants