Deadlock on closing connection from pool #497

dortamur · 2016-08-25T03:46:24Z

I am encountering a problem which I believe is related to #398, #395 and others, where interaction with a database will freeze for a long time before recovering and continuing.

I believe this problem is specifically a deadlock when using several connections in parallel, triggering when a connection from a pool is closed after connection inactivity greater than poolTimeout. That is, where the pool has several connections used, then there is inactivity greater than poolTimeout, and then there is a burst of parallel connection use, then the deadlock occurs.

This doesn't give any errors, unless it indirectly triggers timeouts in other parts of the app.
The deadlock pauses the connection for time equal to a little over the poolTimeout, but if there's more than one extra connection, the timeout is added.

So, if your poolTimeout is 10 seconds, and you have 3 parallel connections, then the first will close quickly, the second will take 10+ seconds to close, and the third will take 20+ seconds. With the default timeout of 60 seconds, this would be 0, 60+ and 120+!

This only seems to be an issue with Oracle client 11, and not 12.

I've seen this issue in other apps, under heavy saturation load-testing, and also seen it crop up in Production occasionally, but until now haven't managed to narrow it down and come up with a test case. In a new app I'm working on, I specifically run 3 queries in parallel, and the issue has become regular.

From the other similar issues, the workarounds suggested include:

Increasing poolMin - I suspect this "fixes" the issue by ensuring there are more connections kept open to deal with a normal load of concurrent queries.

Setting poolTimeout to something large or 0 - I suspect this works in most cases as it makes it less likely for the timeout inactivity to occur, leading to less situations where it will happen.

Changing to Oracle Client 12 - This does seem to be a complete fix - the issue isn't manifesting in 12. A good solution if you can update your client

For high availability, currently only the last seems like a complete solution for high-load high-availability apps.

I've tested this with node 0.12.9, 0.12.15 and 4.4.3, along with Oracle Instant Client 11.2 and 12.1. Operating systems tested were Debian 8.4 and Debian 7.11 in a schroot.
All 11.2 tests failed, and all 12.1 tests succeeded.

Below is a simplified test code to demonstrate the problem. I'm using the async module to manage the parallel requests and pause between tests.
Running the script will:

In parallel, open 3 connections, run a select query, and release/close the connections
Pause for time equal to the poolTimeout (5 seconds)
In parallel, open 3 connections, run a select query, and release/close the connections

A simple measure of elapsed time shows the deadlock happening when the connection is released. All statements do successfully execute, but the deadlock pauses all db activity.

Test code:

var OracleDB = require('oracledb')
var async = require('async')

var timeout = 5

var config = {
      user: "user",
      password: "password",
      connectString: "dbid",
      poolAliast: 'test',
      poolTimeout: timeout
   }

OracleDB.createPool(config, function(err, pool) {
   if (err) throw err

   async.waterfall([
      function first_tests(next) {
         async.each(['1', '2', '3'], function (id, cb) { test_query(id, pool, cb) }, next)
      },
      function pause(next) {
         console.log('Pausing for '+timeout+' seconds')
         setTimeout(next, timeout*1000)
      },
      function second_tests(next) {
         async.each(['A', 'B', 'C'], function (id, cb) { test_query(id, pool, cb) }, next)
      }
   ],
      function done(err) {
         console.log('Tests complete')
      })
})

// Get connection & do a query
function test_query(id, pool, callback) {
   // Measure elapsed time since last call
   var ets = new Date()
   function elapsed() {
      var le = ets
      ets = new Date()
      return ets - le
   }

   pool.getConnection(function(err, connection) {
      console.log(id, 'getConnection', elapsed(), err || '')
      if (err) throw err

      connection.execute("select :id from dual", [id], {outFormat: OracleDB.OBJECT}, function (err, result) {
         console.log(id, 'doQuery', elapsed(), err || result && result.rows[0])
         if (err) throw err

         connection.close(function(err) {
            console.log(id, 'dbClose', elapsed(true), err || '')
            if (err) throw err
            callback()
         })
      })
   })
}

On Oracle client 12, the output from the second set of tests is:

A getConnection 0
B getConnection 0
C getConnection 1
A doQuery 3 { ':ID': 'A' }
A dbClose 0
C doQuery 4 { ':ID': 'C' }
B doQuery 5 { ':ID': 'B' }
C dbClose 1
B dbClose 1

On Oracle client 11, the output from the second set of tests is:

A getConnection 1
B getConnection 1
C getConnection 1
A doQuery 2 { ':ID': 'A' }
A dbClose 0
C doQuery 3 { ':ID': 'C' }
B doQuery 10 { ':ID': 'B' }
C dbClose 5184
B dbClose 11176

Setting the poolTimeout to 60 seconds, this same test gives:

A getConnection 0
B getConnection 0
C getConnection 0
A doQuery 3 { ':ID': 'A' }
A dbClose 0
B doQuery 6 { ':ID': 'B' }
C doQuery 11 { ':ID': 'C' }
B dbClose 60811
C dbClose 121816

All 3 connections open quickly. All 3 queries run quickly, and return correct data. When releasing the connections, the first releases quickly, but the second blocks for poolTimeout+ seconds, and the third then blocks for an additional poolTimeout+ seconds.

The text was updated successfully, but these errors were encountered:

dortamur · 2016-08-25T06:00:44Z

I've done some further testing, and it looks like the problem occurs poolTimeout seconds after the connection is first opened, irrespective of activity in-between. That is, the problem doesn't just occur when connections are idle for longer than the timeout, but just the life of the connection.

This means the problem will always start to occur after poolTimeout seconds, if you're dealing with parallel queries.

To illustrate this, change the original test code, and replace the async.waterfall tests with:

   async.waterfall([
      function first_tests(next) {
         async.each(['1', '2', '3'], function (id, cb) { test_query(id, pool, cb) }, next)
      },
      function pause(next) {
         console.log('Pausing for '+timeout/2+' seconds')
         setTimeout(next, timeout*500)
      },
      function second_tests(next) {
         async.each(['A', 'B', 'C'], function (id, cb) { test_query(id, pool, cb) }, next)
      },
      function pause(next) {
         console.log('Pausing for '+timeout/2+' seconds')
         setTimeout(next, timeout*500)
      },
      function third_tests(next) {
         async.each(['X', 'Y', 'Z'], function (id, cb) { test_query(id, pool, cb) }, next)
      }
   ],
   :

For this set of tests, we do 3 rounds of tests, pausing for half the timeout between each test, so that the total time exceeds the poolTimeout, but the time between connection uses is always less than poolTimeout.

The result is the first 2 rounds of tests are quick, but the third test hits the deadlock issue:

X getConnection 0
Y getConnection 0
Z getConnection 0
X doQuery 3 { ':ID': 'X' }
X dbClose 1
Z doQuery 6 { ':ID': 'Z' }
Y doQuery 6 { ':ID': 'Y' }
Z dbClose 5786
Y dbClose 11783

cjbj · 2016-08-25T11:32:33Z

Just a quick ack. We will take a proper look at the testcase. Thanks for all the detail.

cjbj · 2016-10-19T00:53:49Z

I'm back from work & vacation travel so sorry this hasn't been looked at. Now I just have to catch up...

Some brief comments:

Since this is not an issue in 12.1, I could only recommend using 12 as the solution. The pool will be locked for some operations - I expect you are seeing a symptom of that. I know there were pool changes in 12.1 that @krismohan may or may not be able to comment on.

I see you carefully used only 3 connections but did you experiment with different values of UV_THREADPOOL_SIZE anyway? The app doesn't appear to close connections when queries fail, but this probably isn't an issue you are seeing. Perhaps you could sleep a bit longer than the pool timeout value to avoid race conditions on the timeouts?

dortamur · 2016-10-19T02:47:45Z

Thanks for the response.

No, I hadn't tried experimenting with UV_THREADPOOL_SIZE.

The problem with the sleep and timeout is that ideally in a real-world situation, you would use a bigger Timeout, and will often have requests happen more often than the Timeout. Since the result of the "problem" is a Lockup equal to the Timeout, increasing the Timeout makes things worse, not better, but having a shorter timeout causes the problem more often and still introduces delays.

We do have a few apps in Production now, and we've had to go with a work-around of having connections never time out, with some extra error checking to catch stale connections. It's not ideal, but it is working Ok so far. Luckily they've been fairly low-throughput apps so far, but we're working on some others that will get much heavier use.

I do understand this is possibly non-trivial to fix for 11, and I am hoping our area will upgrade and shift to 12 sooner rather than later.

cjbj · 2016-10-19T05:42:25Z

It doesn't reproduce for me with 11.2 or 12.1 client on Linux. I tried with and without decreasing the pool timeout so it is just less than the sleep timeout.

cjones@localhost:~/n$ node issue497.js
2 getConnection 66 
2 doQuery 11 { ':ID': '2' }
3 getConnection 111 
3 doQuery 6 { ':ID': '3' }
1 getConnection 171 
2 dbClose 96 
3 dbClose 55 
1 doQuery 4 { ':ID': '1' }
1 dbClose 0 
Pausing for 50 seconds
C getConnection 2 
A getConnection 2 
B getConnection 2 
A doQuery 3 { ':ID': 'A' }
B doQuery 3 { ':ID': 'B' }
A dbClose 0 
B dbClose 0 
C doQuery 5 { ':ID': 'C' }
C dbClose 1 
Tests complete
cjones@localhost:~/n$ ldd node_modules/oracledb/build/Release/oracledb.node 
    linux-vdso.so.1 =>  (0x00007ffc9ebde000)
    libclntsh.so.11.1 => /home/cjones/instantclient/libclntsh.so.11.1 (0x00007f2b4a55b000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f2b4a252000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f2b49f50000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2b49d3a000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2b49b1e000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f2b4975f000)
    libnnz11.so => /home/cjones/instantclient/libnnz11.so (0x00007f2b49392000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f2b4918e000)
    libnsl.so.1 => /lib64/libnsl.so.1 (0x00007f2b48f75000)
    libaio.so.1 => /lib64/libaio.so.1 (0x00007f2b48d73000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f2b4d117000)
cjones@localhost:~/n$ node -v
v6.7.0
cjones@localhost:~/n$

One other comment: Idle connections are terminated only when the pool is accessed.

cjbj · 2016-12-03T10:26:46Z

Closing - no feedback

magiclen · 2023-01-09T03:01:42Z

I encountered this problem, too.

My environment:

OS: Windows Server 2019
Oracle Instant Client: 19.17 or 20.8
OracleDB: 11.2.0.4.0
Node: 18.13.0
node-oracledb: 5.5.0

I have not found a solution yet. Setting poolTimeout to 0 seems to work but I'm worry that the broken connections caused by the bad network cannot be destroyed.

cjbj · 2023-01-10T05:56:37Z

@magiclen please open a new issue and provide runnable code.

krismohan · 2023-01-10T06:07:43Z

Also it helps to have the pstack output of the hanging process.

cjbj added the question label Oct 19, 2016

cjbj closed this as completed Dec 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock on closing connection from pool #497

Deadlock on closing connection from pool #497

dortamur commented Aug 25, 2016 •

edited

Loading

dortamur commented Aug 25, 2016

cjbj commented Aug 25, 2016

cjbj commented Oct 19, 2016

dortamur commented Oct 19, 2016

cjbj commented Oct 19, 2016

cjbj commented Dec 3, 2016

magiclen commented Jan 9, 2023

cjbj commented Jan 10, 2023

krismohan commented Jan 10, 2023 via email •

edited by cjbj

Loading

Deadlock on closing connection from pool #497

Deadlock on closing connection from pool #497

Comments

dortamur commented Aug 25, 2016 • edited Loading

dortamur commented Aug 25, 2016

cjbj commented Aug 25, 2016

cjbj commented Oct 19, 2016

dortamur commented Oct 19, 2016

cjbj commented Oct 19, 2016

cjbj commented Dec 3, 2016

magiclen commented Jan 9, 2023

cjbj commented Jan 10, 2023

krismohan commented Jan 10, 2023 via email • edited by cjbj Loading

dortamur commented Aug 25, 2016 •

edited

Loading

krismohan commented Jan 10, 2023 via email •

edited by cjbj

Loading