New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ActiveRecord ConnectionPool locking issue with 9.2.12.0 #6326
Comments
From Matrix discussion, it seems like nearly all of these threads are blocked waiting for the connection pool lock in ActiveRecord. The unusual finding is that we could find no thread in the dump that has actually acquired the lock, raising the possibility that due to some logic error in either Rails or JRuby the lock has been locked and never unlocked by one of the other threads or by a thread which has exited. A cursory inspection of |
@dgolombek Is this easy for you to reproduce, or does it require a long time? I am reviewing some related commits that have happened in the 2 years since 9.1.17.0 was released, and I'm hoping it might be possible for you to try reverting some of them to see if they help. Here's the first one I see that's interesting: This change makes Mutex more aggressively try to keep acquiring the lock, even after being interrupted while waiting. The reasoning here is that interrupts for Ruby reasons will usually (always?) be accompanied by a thread event like a Of course, if the interrupt was done by Java in an attempt to shut down the thread, it would generally be ignored here and the thread would go on trying to acquire the lock. I will keep reviewing other locking-related changes. |
This unfortunately only reproduces in production -- we haven't found the right combination of load tests to run in combination to reproduce in staging environment. It happens about once every 30 minutes in production, resulting in death of the instance running it. |
Another interesting one, this adds the This came in along with an update to CRuby's standard library. |
We downgraded to 9.2.6.0 and thus far have been unable to replicate this problem. It's been running for ~10 hours w/out a lock up, where as previously we'd see an instance lock up and die at least once an hour. |
@dgolombek Thank you! I understand from your Matrix messages that you are unable to update to 9.2.7.0 but this gives us a lot more information. We know it's not part of the initial 2.5 stdlib update, and it was working as of February last year. |
There's a race doing the lock add after the executeTask, since a thread event could interrupt the flow and bypass that logic. A finally would not be appropriate, since an interrupted attempt to acquire the lock would then stil add it. Instead we just move the lock add immediately after lockInterruptibly. If we reach this point, the lock is locked by the current thread, and must be added. Fixes jruby#6405 and maybe jruby#6326.
Good news, I think we've found the problem thanks to @stalbot's reproduction in #6405. The issue was introduced in bea9ad4 (released in 9.2.8.0) by adding the executeTask wrapper around the lock acquisition, but failing to keep the lock add right next to it. executeTask can be interrupted before returning if a thread kill or raise event arrives, skipping the lock add in RubyThread.lockInterruptibly but leaving the lock locked. |
I'm hoping we hit the mark in #6407 so marking this one fixed. |
Great, thank you! As soon as 9.2.14.0 is out, I'll try upgrading and let you know. The failure mode reported in #6405 certainly matches what we saw here. We don't have any explicit calls to Thread.kill, but it's hard to vet all of our libraries. We use concurrent-ruby extensively, including ThreadPoolExecutors, but I don't see how that would trigger this failure. |
Unfortunately this bug (or one that looks very similar) has reproduced for us when running 9.2.14.0. See the attached stack dump. Jstack -l hung for at least 15 minutes -- I have a longer running dump going now to see if it ever completes. This dump shows 132 threads in sun.misc.Unsafe.park and 9 in Object.wait, plus a few in doing polling/reads/socket stuff (JMX and Jetty listeners). |
Your assessment of the dumps seems accurate. The 132 threads parked appear to be Jetty worker threads of some kind; no JRuby code appears in those stacks so I will assume this is just part of normal Jetty operation. Some interesting stacks involving Ruby: Timeout thread for Rack, seems like this is probably normal.
Threads blocked waiting for a mutex lock. Also 8735, 5340, 2109, 25240, 24125, 23351, 21076, 15349, 9608, 9164, 24771, 23815, 23770, and a couple dozen more. These are likely the problem threads.
Given the number of threads stuck waiting for locks, it does seem similar to #6405 and may indicate that issue is not fixed or there is some additional issue. Interestingly, however, these stacks (or at least the ones I have looked at) are now blocked waiting for a redis connection, not an ActiveRecord connection. Thread 15065, for example, is stuck at line 98 backend/services/cache.rb, with backend/rpc/app_inventory_service.rb line 52 in the In addition to a JRuby bug, this could also be a bug in either the connection_pool gem or the redis gem's handling of its own connection locking. If you don't see any ActiveRecord connection pool threads I would say we should open a new issue, since this may be specific to these other gems. |
Environment Information
Provide at least:
Expected Behavior
ActiveRecord ConnectionPool locks should be acquired and released quickly, as happened with JRuby 9.1.17.0
Actual Behavior
Threads seem to block trying to acquire the ActiveRecord::ConnectionAdapters::ConnectionPool and being very slow to do so, OR something hidden is holding that lock. Our app explodes from ~115 threads to ~350 threads, with most of them either trying to acquire or release the lock on that class. The three locations we see most threads in:
The full 15M Thread dump is at https://gist.githubusercontent.com/dgolombek/8b3ce3c68e7cd53438b758f1825061f5/raw/8aaa8524fd032772b97a5674f3667a1003921a0e/gistfile1.txt
We discussed this on #jruby:matrix.org starting at 2:31 EST on 2020-07-15.
The text was updated successfully, but these errors were encountered: