Fix cluster/pool release of connection #6361

PierreF · 2016-04-13T09:55:27Z

Rebased/mergable
Tests pass
CHANGELOG.md updated
Sign CLA (if not already signed)

This PR fix a possible resource exhaustion on connection to other node.
I can reproduce this issue on v0.11.1 and this PR fix issue. On 0.12 / master this issue should no longer exists - at least in opensource version. That why I didn't updated CHANGELOG.md.

The issue happen when a load spike cause multiple thread to create new connection on the pool. It may cause total to exceed capacity : code of boundedPool.Get

The later when exceeding connection are returned to the pool, they are discarded (since pool is full) but total is NOT decremented : code of boundedPool.put

At this point with the default value, pool have 3 working connections (== capacity) but total may be larger than this, lets say 6.

Now, if the target node goes down, connection will be marked as unusable which will closed them and remove them from pool. code of boundedPool.MarkUnusable

Here total is decremented, but since is was above capacity we end up with an empty pool (all connections fail/timeout since the target node is down) but total is still above 0. It will be 3 in our example.

Then from now, all call to boundedPool.Get will fail:

The pool is empty, unable to get a already existing connection
total = 3 which is NOT less than capacity, so it don't open a new connection

jwilder · 2016-04-19T19:36:32Z

@PierreF Is there some what I can reproduce this locally?

PierreF · 2016-04-20T07:50:31Z

As said in #5876, I was able to reproduce it with the following step:

Restart all cluster node (I always done 3 node cluster)
start influx_stress on node1. keep it running for about 30 seconds, then stop it. This objective of this step is to create a rush in boundedPool.Get and had total exceed the capacity. We stop the stress to be sure it try to put back all connection in the pool (and thus discarding exceeding connection - without reducing total counter)
kill/stop node2
restart influx_stress on node1. This should mark all connection as unusable and close them since node2 is down. Once all connections are closed, we should end up with no connection in the pool and total > 0

With those step, I was able to reproduce it all the time on my laptop. It was using 0.11.1 with a 3 node cluster. Needless to say that with this PR I was no longer able to reproduce with the above step

jwilder · 2016-04-25T17:16:32Z

@PierreF Can you add a changelog entry under 0.13?

PierreF · 2016-04-25T18:08:45Z

Changelog entry added + rebase on master

jwilder · 2016-04-25T20:12:03Z

👍 Thanks @PierreF!

PierreF mentioned this pull request Apr 13, 2016

Write Errors #5876

Closed

jwilder added this to the 0.13.0 milestone Apr 25, 2016

Fix cluster/pool release of connection

9f9a3fc

PierreF force-pushed the cluster-pool-hang branch from 909b089 to 9f9a3fc Compare April 25, 2016 18:08

jwilder merged commit 9a3a52a into influxdata:master Apr 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cluster/pool release of connection #6361

Fix cluster/pool release of connection #6361

PierreF commented Apr 13, 2016

jwilder commented Apr 19, 2016

PierreF commented Apr 20, 2016

jwilder commented Apr 25, 2016

PierreF commented Apr 25, 2016

jwilder commented Apr 25, 2016

Fix cluster/pool release of connection #6361

Fix cluster/pool release of connection #6361

Conversation

PierreF commented Apr 13, 2016

jwilder commented Apr 19, 2016

PierreF commented Apr 20, 2016

jwilder commented Apr 25, 2016

PierreF commented Apr 25, 2016

jwilder commented Apr 25, 2016