Description
Go version
go version go1.20.6 linux/amd64
What operating system and processor architecture are you using (go env
)?
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/shipeng/.cache/go-build"
GOENV="/home/shipeng/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/pay/src/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/pay/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.6"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/pay/tmp/go-build1095428125=/tmp/go-build -gno-record-gcc-switches"
What did you do?
We’re using pgx v4 as driver and database/sql as connection pool for client requests and database connections. Our server and database are in different regions, causing a 50 - 60 ms network latency. We're having an issue where our host can't recover and complete client queries in time, even after network recovered from prior packet loss events. Check this link to see how to reproduce the issue with script: https://github.com/spencer-x/go-database-sql-issue#ways-to-reproduce-the-issue-in-linux-machine.
What did you expect to see?
Service host should be able to serve requests within timeout limits after network recovers.
What did you see instead?
Even after the network between the host and the database recovered, client requests continued to timeout. The number of open connections reported from the connection pool stats was considerably higher (200) than the actual number of connections reported from the PostgreSQL database (20-30).
We tried to look into the issue and found that many connections were closed during the network packet loss event. Subsequently, even after the network was recovered, pgx continued to close the connection when clients canceled the query due to context timeout. The connection opener in database/sql rapidly increased the count of db.numOpen, but the opening of new connections was sequential and relatively slow. Consequently, there were few usable connections available from the connection pools.