Skip to content

database/sql: few usable connections in connection pool after network packet loss event #64614

Open
@xshipeng

Description

@xshipeng

Go version

go version go1.20.6 linux/amd64

What operating system and processor architecture are you using (go env)?

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/shipeng/.cache/go-build"
GOENV="/home/shipeng/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/pay/src/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/pay/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.6"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/pay/tmp/go-build1095428125=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We’re using pgx v4 as driver and database/sql as connection pool for client requests and database connections. Our server and database are in different regions, causing a 50 - 60 ms network latency. We're having an issue where our host can't recover and complete client queries in time, even after network recovered from prior packet loss events. Check this link to see how to reproduce the issue with script: https://github.com/spencer-x/go-database-sql-issue#ways-to-reproduce-the-issue-in-linux-machine.

What did you expect to see?

Service host should be able to serve requests within timeout limits after network recovers.

What did you see instead?

Even after the network between the host and the database recovered, client requests continued to timeout. The number of open connections reported from the connection pool stats was considerably higher (200) than the actual number of connections reported from the PostgreSQL database (20-30).

We tried to look into the issue and found that many connections were closed during the network packet loss event. Subsequently, even after the network was recovered, pgx continued to close the connection when clients canceled the query due to context timeout. The connection opener in database/sql rapidly increased the count of db.numOpen, but the opening of new connections was sequential and relatively slow. Consequently, there were few usable connections available from the connection pools.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions