Does this issue reproduce with the latest release?
We've since mitigated the issue by moving away from using database/sql, but it was still reproducible in 1.16.1.
What operating system and processor architecture are you using (go env)?
Single-static binary packaged into a scratch container.
What did you do?
Despite configuring the maximum idle connections and maximum connection lifetime, we continued to encounter edge cases where the database/sql package would use a connection that had exceeded the max connection lifetime. Since the upstream database had already closed the connection, we'd get an RST. It could happen at any code path and any query, so adding retry logic proved difficult.
The database/sql package isn't very observable, so we don't have true evidence to provide. However, this behavior occurred most commonly when our orchestration platform scaled the CPU on the container to 0. The container was still running (and time was still passing), but the container was unable to do any "work". Then, when the orchestrator increases the CPU, a race occurs. Sometimes the connection would correctly be marked as unusable because its max time had failed, but sometimes it would be returned from the pool as "healthy" even though it had exceeded the maximum lifetime. We attempted to trace the package to understand how this could happen, but we were unsuccessful (@jeremyfaller).
We moved away from using database/sql to using pgx (and pgxpool) directly. That library has various hooks into connections that allow us to customize observability and, most importantly, verify the connection is healthy immediately before use. Since switching libraries, we haven't experienced any RSTs.
The text was updated successfully, but these errors were encountered: