Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708

lizthegrey opened this issue Apr 27, 2019 · 4 comments


Copy link

@lizthegrey lizthegrey commented Apr 27, 2019

The previous SQL connection pool behavior used a FIFO queue implemented as a slice, but in 4f6d4bb was changed as a side effect to read the first entry in map recurse order (effectively random among requests not yet timed out when execution started).

I believe we can do much better than this -- we should go back to dequeuing the pending requests in an ordered fashion, while continuing to not leak cancelled requests. And we should be able to instrument to see whether people are getting better performance out of LIFO, FIFO, or random.

See also: #22697 talks about greater pool observability and configurability
See also: #18080 talks about observability as well (Honeycomb has implemented our version of passing contexts into wrapped SQL calls, but loses visibility once they reach the SQL layer)

What version of Go are you using (go version)?

$ go version
go version go1.11.6 linux/amd64

Does this issue reproduce with the latest release?


What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473464966=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Saturate our connection pool with a higher rate of incoming requests than it could temporarily handle after our database got slow. Almost no requests succeeded because we picked random requests from the pool to work on, many of which didn't have enough remaining time to complete without expiring after being pulled off the pending map.

What did you expect to see?

If we have requests coming in 10% faster than they can be fulfilled, 90% of them should succeed and only 10% should time out. And we should have a separate waitDuration counter for successful requests that were dequeued and started running, vs. cancelled requests that timed out before even starting to be serviced.

What did you see instead?

~100% of connections timed out because most of the connections in the pool became dead while being serviced, and random selection wasn't trying first the requests most likely to succeed. and we couldn't tell why, because the waitDuration of the successful requests wasn't separated from the failed requests.

Copy link

@lizthegrey lizthegrey commented Apr 27, 2019

cc @kardianos @julieqiu

@odeke-em odeke-em changed the title SQL connection pool was FIFO, is now random, but should be LIFO database/sql: connection pool was originally FIFO, is now random, but should be LIFO Apr 27, 2019
@odeke-em odeke-em added the NeedsDecision label Apr 27, 2019
Copy link

@odeke-em odeke-em commented Apr 27, 2019

Hello @lizthegrey, thank you for filing this bug and welcome to the Go project!

Randomization comes to bite us here: interestingly this is a case where we can perhaps draw lessons from load balancing algorithms e.g. in -- a guide of which you @lizthegrey are an author :)

In the bucket of ideas to try, perhaps the pool could be a heap where the keys during insertion are the time left before expiry, and inherently by the definition of a heap, we'll be able to service requests by critically of time to expire.

Kindly paging @bradfitz @rsc too for examining pooling mechanisms

In regards to instrumentation of database/sql wrappers, @basvanbeek and small part from me, courtesy of OpenCensus, worked on which provides metrics and traces for SQL calls and is most prominently used in the Go cloud development kit

Copy link

@stevenh stevenh commented Jun 14, 2020

My PR for the related issues #39471 implements LIFO by swapping fast delete with slow to maintain order and reusing connections from the end not beginning.

Copy link

@thefallentree thefallentree commented May 5, 2022

My PR for the related issues #39471 implements LIFO by swapping fast delete with slow to maintain order and reusing connections from the end not beginning.

@stevenh I think your PR is not related to this issue, this issue is talking about when driver have a free connection, it actually pick a random one form the queue and give the connection to that. which is usually not what you want! The more connection pending, the more likely your oldest requests are starved waiting! Implement FIFO seems to be a good way of doing it , LIFO has even more fairness issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

4 participants