-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock on closing connection from pool #497
Comments
I've done some further testing, and it looks like the problem occurs poolTimeout seconds after the connection is first opened, irrespective of activity in-between. That is, the problem doesn't just occur when connections are idle for longer than the timeout, but just the life of the connection. This means the problem will always start to occur after poolTimeout seconds, if you're dealing with parallel queries. To illustrate this, change the original test code, and replace the async.waterfall tests with: async.waterfall([
function first_tests(next) {
async.each(['1', '2', '3'], function (id, cb) { test_query(id, pool, cb) }, next)
},
function pause(next) {
console.log('Pausing for '+timeout/2+' seconds')
setTimeout(next, timeout*500)
},
function second_tests(next) {
async.each(['A', 'B', 'C'], function (id, cb) { test_query(id, pool, cb) }, next)
},
function pause(next) {
console.log('Pausing for '+timeout/2+' seconds')
setTimeout(next, timeout*500)
},
function third_tests(next) {
async.each(['X', 'Y', 'Z'], function (id, cb) { test_query(id, pool, cb) }, next)
}
],
: For this set of tests, we do 3 rounds of tests, pausing for half the timeout between each test, so that the total time exceeds the poolTimeout, but the time between connection uses is always less than poolTimeout. The result is the first 2 rounds of tests are quick, but the third test hits the deadlock issue:
|
Just a quick ack. We will take a proper look at the testcase. Thanks for all the detail. |
I'm back from work & vacation travel so sorry this hasn't been looked at. Now I just have to catch up... Some brief comments: Since this is not an issue in 12.1, I could only recommend using 12 as the solution. The pool will be locked for some operations - I expect you are seeing a symptom of that. I know there were pool changes in 12.1 that @krismohan may or may not be able to comment on. I see you carefully used only 3 connections but did you experiment with different values of UV_THREADPOOL_SIZE anyway? The app doesn't appear to close connections when queries fail, but this probably isn't an issue you are seeing. Perhaps you could sleep a bit longer than the pool timeout value to avoid race conditions on the timeouts? |
Thanks for the response. No, I hadn't tried experimenting with UV_THREADPOOL_SIZE. The problem with the sleep and timeout is that ideally in a real-world situation, you would use a bigger Timeout, and will often have requests happen more often than the Timeout. Since the result of the "problem" is a Lockup equal to the Timeout, increasing the Timeout makes things worse, not better, but having a shorter timeout causes the problem more often and still introduces delays. We do have a few apps in Production now, and we've had to go with a work-around of having connections never time out, with some extra error checking to catch stale connections. It's not ideal, but it is working Ok so far. Luckily they've been fairly low-throughput apps so far, but we're working on some others that will get much heavier use. I do understand this is possibly non-trivial to fix for 11, and I am hoping our area will upgrade and shift to 12 sooner rather than later. |
It doesn't reproduce for me with 11.2 or 12.1 client on Linux. I tried with and without decreasing the pool timeout so it is just less than the sleep timeout.
One other comment: Idle connections are terminated only when the pool is accessed. |
Closing - no feedback |
I encountered this problem, too. My environment:
I have not found a solution yet. Setting |
@magiclen please open a new issue and provide runnable code. |
Also it helps to have the pstack output of the hanging process.
|
I am encountering a problem which I believe is related to #398, #395 and others, where interaction with a database will freeze for a long time before recovering and continuing.
I believe this problem is specifically a deadlock when using several connections in parallel, triggering when a connection from a pool is closed after connection inactivity greater than poolTimeout. That is, where the pool has several connections used, then there is inactivity greater than poolTimeout, and then there is a burst of parallel connection use, then the deadlock occurs.
This doesn't give any errors, unless it indirectly triggers timeouts in other parts of the app.
The deadlock pauses the connection for time equal to a little over the poolTimeout, but if there's more than one extra connection, the timeout is added.
So, if your poolTimeout is 10 seconds, and you have 3 parallel connections, then the first will close quickly, the second will take 10+ seconds to close, and the third will take 20+ seconds. With the default timeout of 60 seconds, this would be 0, 60+ and 120+!
This only seems to be an issue with Oracle client 11, and not 12.
I've seen this issue in other apps, under heavy saturation load-testing, and also seen it crop up in Production occasionally, but until now haven't managed to narrow it down and come up with a test case. In a new app I'm working on, I specifically run 3 queries in parallel, and the issue has become regular.
From the other similar issues, the workarounds suggested include:
Increasing poolMin - I suspect this "fixes" the issue by ensuring there are more connections kept open to deal with a normal load of concurrent queries.
Setting poolTimeout to something large or 0 - I suspect this works in most cases as it makes it less likely for the timeout inactivity to occur, leading to less situations where it will happen.
Changing to Oracle Client 12 - This does seem to be a complete fix - the issue isn't manifesting in 12. A good solution if you can update your client
For high availability, currently only the last seems like a complete solution for high-load high-availability apps.
I've tested this with node 0.12.9, 0.12.15 and 4.4.3, along with Oracle Instant Client 11.2 and 12.1. Operating systems tested were Debian 8.4 and Debian 7.11 in a schroot.
All 11.2 tests failed, and all 12.1 tests succeeded.
Below is a simplified test code to demonstrate the problem. I'm using the async module to manage the parallel requests and pause between tests.
Running the script will:
A simple measure of elapsed time shows the deadlock happening when the connection is released. All statements do successfully execute, but the deadlock pauses all db activity.
Test code:
On Oracle client 12, the output from the second set of tests is:
On Oracle client 11, the output from the second set of tests is:
Setting the poolTimeout to 60 seconds, this same test gives:
All 3 connections open quickly. All 3 queries run quickly, and return correct data. When releasing the connections, the first releases quickly, but the second blocks for poolTimeout+ seconds, and the third then blocks for an additional poolTimeout+ seconds.
The text was updated successfully, but these errors were encountered: