Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: few usable connections in connection pool after network packet loss event #64614

Open
xshipeng opened this issue Dec 8, 2023 · 4 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@xshipeng
Copy link

xshipeng commented Dec 8, 2023

Go version

go version go1.20.6 linux/amd64

What operating system and processor architecture are you using (go env)?

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/shipeng/.cache/go-build"
GOENV="/home/shipeng/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/pay/src/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/pay/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.6"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/pay/tmp/go-build1095428125=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We’re using pgx v4 as driver and database/sql as connection pool for client requests and database connections. Our server and database are in different regions, causing a 50 - 60 ms network latency. We're having an issue where our host can't recover and complete client queries in time, even after network recovered from prior packet loss events. Check this link to see how to reproduce the issue with script: https://github.com/spencer-x/go-database-sql-issue#ways-to-reproduce-the-issue-in-linux-machine.

What did you expect to see?

Service host should be able to serve requests within timeout limits after network recovers.

What did you see instead?

Even after the network between the host and the database recovered, client requests continued to timeout. The number of open connections reported from the connection pool stats was considerably higher (200) than the actual number of connections reported from the PostgreSQL database (20-30).

We tried to look into the issue and found that many connections were closed during the network packet loss event. Subsequently, even after the network was recovered, pgx continued to close the connection when clients canceled the query due to context timeout. The connection opener in database/sql rapidly increased the count of db.numOpen, but the opening of new connections was sequential and relatively slow. Consequently, there were few usable connections available from the connection pools.

@xshipeng xshipeng changed the title database/sql: host could not auto recover after network packet loss event database/sql: few usable connections in connection pool after network packet loss event Dec 8, 2023
@seankhliao
Copy link
Member

Maybe I'm not understanding the issue: is it the driver not returning database/sql/driver.ErrBadConn to inform database/sql.DB to drop the connection or something else?

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 8, 2023
@xshipeng
Copy link
Author

xshipeng commented Dec 8, 2023

The pgx driver returned database/sql/driver.ErrBadConn in ResetSession and many other places, so database/sql.DB knew the connection was closed.

To help further debug, I changed the variable scope in database/sql connection pool and printed them out.

We can see NumPendingOpenConn (Number of messages in OpenerCh) is constantly high even after the packet loss event is canceled.

time: 103.00013455s, err/req (97/108), NumConnClosed: 288, NumConnRequests: 296, NumPendingOpenConn: 195, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:8702 WaitDuration:5h48m24.850698435s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 104.000693703s, err/req (79/89), NumConnClosed: 289, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:8792 WaitDuration:5h52m36.032147827s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 105.000788483s, err/req (104/114), NumConnClosed: 291, NumConnRequests: 296, NumPendingOpenConn: 195, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:8906 WaitDuration:5h58m0.505383481s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 106.000719511s, err/req (98/111), NumConnClosed: 291, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9017 WaitDuration:6h3m15.139842187s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 107.000629773s, err/req (77/94), NumConnClosed: 291, NumConnRequests: 293, NumPendingOpenConn: 192, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9111 WaitDuration:6h7m35.933963032s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 108.000874808s, err/req (99/111), NumConnClosed: 294, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9223 WaitDuration:6h12m47.116478313s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 109.00060339s, err/req (97/110), NumConnClosed: 294, NumConnRequests: 293, NumPendingOpenConn: 192, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9333 WaitDuration:6h17m54.046824665s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 110.000664355s, err/req (84/102), NumConnClosed: 295, NumConnRequests: 291, NumPendingOpenConn: 192, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9433 WaitDuration:6h22m30.295615995s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 111.000503913s, err/req (95/107), NumConnClosed: 300, NumConnRequests: 296, NumPendingOpenConn: 196, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9544 WaitDuration:6h27m24.105551002s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 112.000255612s, err/req (96/105), NumConnClosed: 301, NumConnRequests: 295, NumPendingOpenConn: 195, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9650 WaitDuration:6h32m27.629826103s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 113.0007683s, err/req (89/97), NumConnClosed: 301, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9748 WaitDuration:6h37m5.057659169s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 114.000850286s, err/req (98/112), NumConnClosed: 304, NumConnRequests: 296, NumPendingOpenConn: 196, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9861 WaitDuration:6h42m14.895479289s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 115.000485688s, err/req (95/103), NumConnClosed: 306, NumConnRequests: 297, NumPendingOpenConn: 196, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9967 WaitDuration:6h47m12.84743957s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}

@xshipeng
Copy link
Author

xshipeng commented Dec 8, 2023

One potential quick fix is to open connections in parallel like this. NumPendingOpenConn is cleared quickly with this change and the pool is recovered quickly after the packet loss event is canceled. I can help submit a PR if needed.

@seankhliao
Copy link
Member

cc @kardianos

@seankhliao seankhliao added this to the Unplanned milestone Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

2 participants