Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db operation fails with "broken pipe" instead of reconnecting transparently after server restart #870

Closed
fho opened this issue May 27, 2019 · 4 comments

Comments

@fho
Copy link

fho commented May 27, 2019

I'm using lib/pq v1.1.1, go 1.12.5, Linux 4.19.45-1-lts.

I have a db handler on that I run >=1 operations, restart the db-server and wait until the startup finished, the next query fails if all the previous TCP connections of the Operation System to the database were closed in the meantime (ss -na | grep 5432' shows nothing).
The operation fails with: write tcp [::1]:45676->[::1]:5432: write: broken pipe.
If another query is done after the failed one, it succeeds.

I expect that the db.Exec() query succeeds after the postgresql restart finished and the sql package or pq driver retries and reconnects transparently if needed in the background.

If the postgresql-server restart happens quickly while they are still TCP connections in FIN-WAIT-2 or another state, the db operations after the postgres restart succeeds.


How to reproduce:

  • run docker run -p 5432:5432 postgres:latest
  • run in another terminal https://gist.github.com/fho/777ccc77971612f3659cbdf5cef27ede
    pass as command-line argument postgresql://postgres@localhost?sslmode=disable
  • Press q to do a sql query
  • Shutdown the postgres server by pressing ctrl + c in the terminal
  • check watch -n 0.5 "sh -c 'ss -na |grep 5432", wait until the TCP connections vanished
  • start postgres again: run docker run -p 5432:5432 postgres:latest
  • press q in the terminal that runs the go program to trigger another query => it fails

My first idea for a fix was to return ErrBadConn on broken pipe errors but like discussed in #422 this has the issue that operations might be redone.

The mysql driver seems to solve it by having a custom error type to indicate retryable connection errors.
If a Write() on the tcp socket failed before a whole SQL statements was send, it's safe to retry the operation.
The caller of conn.send() could decide it and set error to ErrBadConn.

fho added a commit to fho/pq that referenced this issue May 28, 2019
Since the commit "Don't return ErrBadConn on a network error"
net.OpError do not return driver.ErrBadConn anymore.

This caused that in some situations the sql package does not retry an
operation when it should.
E.g. when the postgresql server is restarted, a broken pipe error might
happen for the query is done after the server finished the startup
(lib#870).

With this commit driver.ErrBadConn is returned for netErrors when it's
ensured that the server did not already executed the operation.

This is the case when e.g. a netError occur for the call that tries to
send the message to initiate the query to the server.
fho added a commit to fho/pq that referenced this issue May 28, 2019
Since the commit "Don't return ErrBadConn on a network error"
net.OpError do not return driver.ErrBadConn anymore.

This caused that in some situations the sql package does not retry an
operation when it should.
E.g. when the postgresql server is restarted, a broken pipe error might
happen for the query is done after the server finished the startup
(lib#870).

With this commit driver.ErrBadConn is returned for netErrors when it's
ensured that the server did not already executed the operation.

This is the case when e.g. a netError occur for the call that tries to
send the message to initiate the query to the server.
fho added a commit to fho/pq that referenced this issue May 28, 2019
In some situations the sql package does not retry a pq operation when it
should.
One of the situations is lib#870. When a
postgresql-server is restarted and after the restart is finished an
operation is triggered on the already established connection, it failed
with an broken pipe error in some circumstances.

The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.

This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.
fho added a commit to fho/pq that referenced this issue May 28, 2019
In some situations the sql package does not retry a pq operation when it
should.
One of the situations is lib#870. When a
postgresql-server is restarted and after the restart is finished an
operation is triggered on the already established connection, it failed
with an broken pipe error in some circumstances.

The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.

This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.
fho added a commit to fho/pq that referenced this issue May 28, 2019
In some situations the sql package does not retry a pq operation when it
should.
One of the situations is lib#870. When a
postgresql-server is restarted and after the restart is finished an
operation is triggered on the already established connection, it failed
with an broken pipe error in some circumstances.

The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.

This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.
fho added a commit to fho/pq that referenced this issue May 28, 2019
In some situations the sql package does not retry a pq operation when it
should.
One of the situations is lib#870. When a
postgresql-server is restarted and after the restart is finished an
operation is triggered on the same db handle it failed with an broken
pipe error in some circumstances.

The sql package does not retry the operation and instead fail because
the pq driver does not return driver.ErrBadConn for network errors.
The driver must not return ErrBadConn when the server might have already
executed the operation. This would cause that sql package is retrying it
and the operation would be run multiple times by the postgresql server.
In some situations it's safe to return ErrBadConn on network errors.
This is the case when it's ensured that the server did not receive the
message that triggers the operation.

This commit introduces a netErrorNoWrite error. This error should be
used when network operations panic when it's safe to retry the
operation.
When errRecover() receives this error it returns ErrBadConn() and marks
the connection as bad.
A mustSendRetryable() function is introduced that wraps a netOpError in
an netErrorNoWrite when panicing.
mustSendRetryable() is called in situations when the send that triggers
the operation failed.
@dharmjit
Copy link

Hi, I am also getting random broken pipe errors. Any update on this issue.

@fho
Copy link
Author

fho commented Apr 18, 2020

@dharmjit no, nothing new regarding the issue.
You can use my fix from my PR: #871

@nullbio
Copy link

nullbio commented May 27, 2020

I'm getting this error on my localhost after leaving my webserver running overnight and then attempting to log in, as it hits the database for the first time for the day it gets a write: broken pipe error - I believe this is because the connection has timed out overnight but it hasn't gracefully dropped the handle from the pool?

roylee17 added a commit to roylee17/sqlx that referenced this issue Mar 21, 2021
    I'm seeing "broken pipe" errors when working with CRDB using sqlx.
    The issue seemed to be the tcp connections were diconnected while the conns
    in db driver (pq) still has stale connection.

    It happens more often when the DB is behind a proxy.
    In our cases, the pods were proxied by the envoy sidecar.

    There were other instances on the community reporting similar issues,
    and took different workaround by sebding perodic dummy queries in app
    mimicing keepalive, enlenghthen proxy idle timeout, or shortening the
    lifetime of db conn.

    This has been reported and fixed by the lib/pq upstream in v1.9+

      lib/pq#1013

      lib/pq#723
      lib/pq#897
      lib/pq#870

      grafana/grafana#29957
@fho
Copy link
Author

fho commented May 31, 2022

@agis it was fixed via #1013

rtfb added a commit to rtfb/rtfblog that referenced this issue Oct 5, 2022
Presumably, caused by this[1] problem, which has been fixed a couple of
years ago.

[1] lib/pq#870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants