Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: connection pool was originally FIFO, is now random, but should be LIFO #31708

Open
lizthegrey opened this issue Apr 27, 2019 · 2 comments
Labels

Comments

@lizthegrey
Copy link

@lizthegrey lizthegrey commented Apr 27, 2019

The previous SQL connection pool behavior used a FIFO queue implemented as a slice, but in 4f6d4bb was changed as a side effect to read the first entry in map recurse order (effectively random among requests not yet timed out when execution started).

I believe we can do much better than this -- we should go back to dequeuing the pending requests in an ordered fashion, while continuing to not leak cancelled requests. And we should be able to instrument to see whether people are getting better performance out of LIFO, FIFO, or random.

See also: #22697 talks about greater pool observability and configurability
See also: #18080 talks about observability as well (Honeycomb has implemented our version of passing contexts into wrapped SQL calls, but loses visibility once they reach the SQL layer)

What version of Go are you using (go version)?

$ go version
go version go1.11.6 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/lizf/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/lizf/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/go-1.11"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.11/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/lizf/go/src/github.com/honeycombio/hound/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473464966=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Saturate our connection pool with a higher rate of incoming requests than it could temporarily handle after our database got slow. Almost no requests succeeded because we picked random requests from the pool to work on, many of which didn't have enough remaining time to complete without expiring after being pulled off the pending map.

What did you expect to see?

If we have requests coming in 10% faster than they can be fulfilled, 90% of them should succeed and only 10% should time out. And we should have a separate waitDuration counter for successful requests that were dequeued and started running, vs. cancelled requests that timed out before even starting to be serviced.

What did you see instead?

~100% of connections timed out because most of the connections in the pool became dead while being serviced, and random selection wasn't trying first the requests most likely to succeed. and we couldn't tell why, because the waitDuration of the successful requests wasn't separated from the failed requests.

@lizthegrey

This comment has been minimized.

Copy link
Author

@lizthegrey lizthegrey commented Apr 27, 2019

@odeke-em odeke-em changed the title SQL connection pool was FIFO, is now random, but should be LIFO database/sql: connection pool was originally FIFO, is now random, but should be LIFO Apr 27, 2019
@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Apr 27, 2019

Hello @lizthegrey, thank you for filing this bug and welcome to the Go project!

Randomization comes to bite us here: interestingly this is a case where we can perhaps draw lessons from load balancing algorithms e.g. in https://landing.google.com/sre/sre-book/chapters/load-balancing-datacenter/ -- a guide of which you @lizthegrey are an author :)

In the bucket of ideas to try, perhaps the pool could be a heap where the keys during insertion are the time left before expiry, and inherently by the definition of a heap, we'll be able to service requests by critically of time to expire.

Kindly paging @bradfitz @rsc too for examining pooling mechanisms

In regards to instrumentation of database/sql wrappers, @basvanbeek and small part from me, courtesy of OpenCensus, worked on https://opencensus.io/integrations/sql/go_sql/ which provides metrics and traces for SQL calls and is most prominently used in the Go cloud development kit https://gocloud.dev/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.