db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870

fho · 2019-05-27T13:44:31Z

I'm using lib/pq v1.1.1, go 1.12.5, Linux 4.19.45-1-lts.

I have a db handler on that I run >=1 operations, restart the db-server and wait until the startup finished, the next query fails if all the previous TCP connections of the Operation System to the database were closed in the meantime (ss -na | grep 5432' shows nothing).
The operation fails with: write tcp [::1]:45676->[::1]:5432: write: broken pipe.
If another query is done after the failed one, it succeeds.

I expect that the db.Exec() query succeeds after the postgresql restart finished and the sql package or pq driver retries and reconnects transparently if needed in the background.

If the postgresql-server restart happens quickly while they are still TCP connections in FIN-WAIT-2 or another state, the db operations after the postgres restart succeeds.

How to reproduce:

run docker run -p 5432:5432 postgres:latest
run in another terminal https://gist.github.com/fho/777ccc77971612f3659cbdf5cef27ede
pass as command-line argument postgresql://postgres@localhost?sslmode=disable
Press q to do a sql query
Shutdown the postgres server by pressing ctrl + c in the terminal
check watch -n 0.5 "sh -c 'ss -na |grep 5432", wait until the TCP connections vanished
start postgres again: run docker run -p 5432:5432 postgres:latest
press q in the terminal that runs the go program to trigger another query => it fails

My first idea for a fix was to return ErrBadConn on broken pipe errors but like discussed in #422 this has the issue that operations might be redone.

The mysql driver seems to solve it by having a custom error type to indicate retryable connection errors.
If a Write() on the tcp socket failed before a whole SQL statements was send, it's safe to retry the operation.
The caller of conn.send() could decide it and set error to ErrBadConn.

The text was updated successfully, but these errors were encountered:

Since the commit "Don't return ErrBadConn on a network error" net.OpError do not return driver.ErrBadConn anymore. This caused that in some situations the sql package does not retry an operation when it should. E.g. when the postgresql server is restarted, a broken pipe error might happen for the query is done after the server finished the startup (lib#870). With this commit driver.ErrBadConn is returned for netErrors when it's ensured that the server did not already executed the operation. This is the case when e.g. a netError occur for the call that tries to send the message to initiate the query to the server.

In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the already established connection, it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.

In some situations the sql package does not retry a pq operation when it should. One of the situations is lib#870. When a postgresql-server is restarted and after the restart is finished an operation is triggered on the same db handle it failed with an broken pipe error in some circumstances. The sql package does not retry the operation and instead fail because the pq driver does not return driver.ErrBadConn for network errors. The driver must not return ErrBadConn when the server might have already executed the operation. This would cause that sql package is retrying it and the operation would be run multiple times by the postgresql server. In some situations it's safe to return ErrBadConn on network errors. This is the case when it's ensured that the server did not receive the message that triggers the operation. This commit introduces a netErrorNoWrite error. This error should be used when network operations panic when it's safe to retry the operation. When errRecover() receives this error it returns ErrBadConn() and marks the connection as bad. A mustSendRetryable() function is introduced that wraps a netOpError in an netErrorNoWrite when panicing. mustSendRetryable() is called in situations when the send that triggers the operation failed.

dharmjit · 2020-04-17T21:48:14Z

Hi, I am also getting random broken pipe errors. Any update on this issue.

fho · 2020-04-18T07:39:30Z

@dharmjit no, nothing new regarding the issue.
You can use my fix from my PR: #871

nullbio · 2020-05-27T23:50:48Z

I'm getting this error on my localhost after leaving my webserver running overnight and then attempting to log in, as it hits the database for the first time for the day it gets a write: broken pipe error - I believe this is because the connection has timed out overnight but it hasn't gracefully dropped the handle from the pool?

I'm seeing "broken pipe" errors when working with CRDB using sqlx. The issue seemed to be the tcp connections were diconnected while the conns in db driver (pq) still has stale connection. It happens more often when the DB is behind a proxy. In our cases, the pods were proxied by the envoy sidecar. There were other instances on the community reporting similar issues, and took different workaround by sebding perodic dummy queries in app mimicing keepalive, enlenghthen proxy idle timeout, or shortening the lifetime of db conn. This has been reported and fixed by the lib/pq upstream in v1.9+ lib/pq#1013 lib/pq#723 lib/pq#897 lib/pq#870 grafana/grafana#29957

fho · 2022-05-31T18:26:03Z

@agis it was fixed via #1013

Presumably, caused by this[1] problem, which has been fixed a couple of years ago. [1] lib/pq#870

fho mentioned this issue May 28, 2019

allow retry on netErrors in safe situations #871

Closed

jacksontj mentioned this issue Oct 9, 2019

Hydra write to database: broken pipe ory/hydra#1599

Closed

stevendanna mentioned this issue Nov 5, 2019

bump lib/pq to 1.2.0 chef/automate#2115

Closed

easokol mentioned this issue Nov 10, 2020

Sometimes postgres write broken pipe 500 odahu/odahu-flow#328

Open

schinns mentioned this issue Nov 12, 2020

stale db connection due to "broken pipe" film42/pgreba#15

Closed

dhermes mentioned this issue Nov 19, 2020

Mark net.Conn failed writes as recoverable when 0 bytes were written. #1013

Merged

maddyblue closed this as completed Nov 23, 2020

mtrang1263 mentioned this issue Dec 11, 2020

BCDA-3710 - Resolve issues with stale db connections CMSgov/bcda-app#600

Merged

6 tasks

marefr mentioned this issue Feb 16, 2021

Error broken pipe on dashboard grafana/grafana#29957

Closed

roylee17 mentioned this issue Mar 21, 2021

bump lib/pq ftom 1.2 to 1.10 jmoiron/sqlx#715

Closed

cainlevy mentioned this issue Mar 30, 2021

reconnect to on-demand databases keratin/authn-server#173

Closed

rtfb added a commit to rtfb/rtfblog that referenced this issue Oct 5, 2022

Bump pq version in an attempt to fix intermittent broken pipe

2a803a2

Presumably, caused by this[1] problem, which has been fixed a couple of years ago. [1] lib/pq#870

ChiQuang98 mentioned this issue Dec 9, 2022

Postgres lib update absmach/magistrala#1667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870

db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870

fho commented May 27, 2019 •

edited

Loading

dharmjit commented Apr 17, 2020

fho commented Apr 18, 2020

nullbio commented May 27, 2020

fho commented May 31, 2022

db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870

db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870

Comments

fho commented May 27, 2019 • edited Loading

dharmjit commented Apr 17, 2020

fho commented Apr 18, 2020

nullbio commented May 27, 2020

fho commented May 31, 2022

fho commented May 27, 2019 •

edited

Loading