Database restarts are not handled gracefully #672

ellulpatrick · 2020-01-29T23:31:21Z

When the database is rebooted gracefully, it sends an <E meaning FATAL: terminating connection due to administrator command (SQLSTATE 57P01)

lib/pq discards such connection upfront and the next execution attempts to create a new connection.

Pgx returns an error at the next execution which is a wasted error, because typically the database has already rebooted, or a hot replica is already available,

How can we make pgx realise upfront of similar problems with the connection so that we don't waste a usage attempt with an error when we didn't need to?

The text was updated successfully, but these errors were encountered:

jackc · 2020-02-01T17:29:46Z

The trick is this requires a read before every query to see if an error has arrived.

Unfortunately, Go does not provide a good way to do a non-blocking read. Apparently, it can be done with sufficient fiddling with the syscall and RawConn. However, that could be a problem with TLS connections.

Another option would be to do a read with a very short read deadline. But that is impractical because a read deadline long enough to read the possible error would also be long enough to be a significant amount of overhead to pay on every query.

The last option is to continuously read in a goroutine and send the messages through a channel. But there was a significant amount of overhead -- at least when I looked into it several years ago. It also adds a fair amount of complexity -- the copy protocol requires something like this and the code is really messy (https://github.com/jackc/pgconn/blob/c9abb86f21f0b89b909e9d112829e21daf3c06d8/pgconn.go#L1039).

Perhaps that code could be generalized and optionally used for those who want this eager reading functionality...

Beyond that, this type logic should go on the connection pool rather than the individual connections. Otherwise, session state could silently be lost on an automatic reconnect.

ellulpatrick · 2020-02-01T20:39:49Z

Thanks @jackc for your detailed feedback. We'll have a good think about it. Best regards.

jackc · 2020-02-01T22:17:24Z

Just for kicks I hacked together something that continuously read in a goroutine and sent the messages through a goroutine. Unfortunately performance is still unacceptable. A trivial select took 22% longer with that change.

Also, after reading your original message again I realized I may have misread it before.

lib/pq discards such connection upfront and the next execution attempts to create a new connection.

You mean the lib/pq Go library, not the libpq C library. That actually makes more sense because AFAIK C libpq is connection oriented and it would be very surprising for it to transparently establish a different connection.

That also confirms may assertion that this logic should go in the connection pool. A database/sql driver can signal to the pool that the connection is bad by returning driver.ErrBadConn. database/sql will automatically retry the query on a different connection in that case.

After a quick glance at lib/pq it appears that it automatically maps any fatal error to a driver.ErrBadConn. However, this behavior appears to be error prone.

...

Yup. It's error prone alright. It can cause your queries to silently be executed twice. I just reported the bug to them in lib/pq#939.

...

So... as far as what could be done to handle this correctly...

There isn't a one size fits all solution that I can see.

The "perfect" solution is to try to read before sending any queries. But that is almost certainly unacceptable from a performance standpoint.

I suppose you could checkout one connection and dedicate it to doing nothing but reading from the database. If it ever failed that would mean that a network failure or database restart had occurred. In that case you would want to close your database connection pool and create a new one. (doing that in a non-racy manner would entail a few gymnastics...)

ellulpatrick · 2020-02-02T04:20:54Z

Hi @jackc. Yep thanks for the detailed analysis.. we had arrived at some of those conclusions already. It's really frustrating, sadly.

Automatic retry logic for immutable queries would be nice. The application could tell the library whether the query is safe to retry.

At a deeper level, it's such a pity that Go does not have a non blocking read to check the connection status. Because the operating system is aware that the connection is closed, yet Go can't use that information protectively. In fact, writes still succeed, and then the read fails. Which makes it impossible to tell if query was executed or not.
I hope one day Go addresses this problem.
Thanks again.

vtolstov · 2020-02-04T20:12:15Z

so for now, what is the preferred way to handle database restart? In background do database sql ping method (https://golang.org/pkg/database/sql/#DB.PingContext in case of database/sql) or something different?
What happens when i have pool of 20 connections and database goes down and up ?

ellulpatrick · 2020-02-04T21:06:16Z

As sad as this sounds, there doesn't seem to be a perfect solution at the moment. lib/pq handles it but is risky for double execution.
Your best bet is to set the max idle time for a connection so at least it won't stay stale for a long time.

ellulpatrick · 2020-02-04T21:11:55Z

The eager read as @jackc would probably be the most bullet proof way to do this and would solve also for network disconnections that the operating system knows about thanks to TCP keepalives etc. It is not trivial to implement, though.

jackc · 2020-02-05T03:46:48Z

Your best bet is to set the max idle time for a connection so at least it won't stay stale for a long time.

This is the simplest approach. Any busy connections are going to encounter an error -- no way around that. So all we can easily do is minimize the idle connections.

The other approach is this:

I suppose you could checkout one connection and dedicate it to doing nothing but reading from the database. If it ever failed that would mean that a network failure or database restart had occurred. In that case you would want to close your database connection pool and create a new one. (doing that in a non-racy manner would entail a few gymnastics...)

Those gymnastics would likely involve atomic.StorePointer and atomic.LoadPointer.

ellulpatrick · 2020-02-05T04:57:55Z

@jackc, I did have another idea. Not sure if it is feasible but I will put it past you anyway.

The concept is that when a connection is being released back to idle, a go routine is launched with a blocking read on that connection until it is needed again. That way, that go routine can handle any unsolicited messages from the server, and even an EOF from the OS. Then, when the connection is needed again, the blocking read is cancelled and the connection is handed to the application again.

In my head this doesn't sound like it would be too invasive, or is it too out of the box?

Regards, and thanks again for your contributions.

jackc · 2020-02-05T15:33:56Z

It's a reasonable approach. There's already a method pgconn.PgConn.ReceiveMessage that would be ideal for the blocking read. If you want, experiment down this path and see where it ends up. But I wouldn't suggest spending a lot of time on it yet. I plan on taking another shot at the full time background reader goroutine approach. I thought of a technique that should reduce the overhead, but I'm not sure if it will be enough.

I also want to test some more real world performance cases. The ~20% loss was with a tiny select on a Unix socket with an uncancellable context. But the actual overhead was measured in microseconds. It may be that in a more production-like environment of connecting over TCP with TLS and a cancellable context that the overhead is insignificant.

ellulpatrick · 2020-02-05T20:07:07Z

Thanks @jackc. I'll have a look to see how trivial it is.

Regarding background reading, what you say makes sense, and we look forward to hearing about your attempt at it.

As you mentioned earlier, the application will always and in any case need to be able to gracefully handle failure when talking to the database. There's nothing that the library can do to totally absolve the application developer from handling errors. However, in a HA production system, we try to do anything to minimise any errors and down time as much as possible. Most failures occur due to network components having issues, rerouting, restarts, etc and are typically very temporary and especially under low load, such extra care from the library could hide the impact in some cases.

vtolstov · 2020-02-05T20:32:30Z

can we have some handler that fired then connection lost? So user app can handle it via reconnect or get new connection to fallback?

jackc · 2020-02-05T20:44:09Z

can we have some handler that fired then connection lost? So user app can handle it via reconnect or get new connection to fallback?

The problem is there is no way to know the connection has been lost without reading from the underlying net.Conn.

vtolstov · 2020-02-06T00:19:11Z

for my use-case is ok to get error after next query, but without callback about connection lost i need to write wrapper what checks error code and do reconnect in all instances that uses this conn. if i have some callback i can modify sql.Db pointer with mutex and continue to run other queries....

bmon · 2020-02-06T00:23:46Z

The problem is there is no way to know the connection has been lost without reading from the underlying net.Conn.

Would it be possible to use poll or epoll to identify if the socket has been closed?
http://man7.org/linux/man-pages/man2/poll.2.html

this idea would introduce compatibility concerns with non-unix environments.

vtolstov · 2020-02-06T00:57:02Z

we need to peek some data from socket, but as i know go does not provide ability to peek data from socket

redbaron · 2020-04-11T17:08:00Z

@ellulpatrick ,

IMHO your suggestion of idle-only background goroutine is preferable to always on background reader. Simply because value can be added only when detected during idle, when connection is actively used error should be propagated to application code anyway

malolans · 2020-05-21T10:56:01Z

I read this blog post from GitHub on how they solved for MySQL. Can that approach new used here?

https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

It's the second issue that was fixed.

jackc · 2020-05-21T14:39:59Z

It might, though it would not be a trivial change. The biggest issue is how many layers this concept has to cross. At the moment only the connection process knows if TCP or Unix domain sockets are being used and is TLS is being used. Once that is complete everything is through the net.Conn interface. In addition, all of pgx / pgconn / pgproto3 relies on the chunkreader abstraction to handle the reading entire PostgreSQL messages at a time. chunkreader enables the higher levels to never have to deal with partial messages.

This change would potentially cross all these layers and require a lot of additional information sharing / coupling between them.

Doable, and might even be the right approach -- but definitely messy.

GeertJohan · 2020-05-26T08:49:28Z

After switching from lib/pq to pgx/v4/stdlib, this is the only drawback: No automatic reconnect.

In my development process, the database is often recreated. To do that, all open connections must be terminated so the database can be dropped and created. The first query that is then executed by my Go process ends up with a FATAL: terminating connection due to administrator command (SQLSTATE 57P01). This wasn't the case with lib/pq. It's quite annoying in the development process.
I'm not sure if this is the same problem as a broken TCP connection though?

Hellysonrp · 2020-11-07T18:15:11Z

I got this error today while calling Ping().

After a quick glance at lib/pq it appears that it automatically maps any fatal error to a driver.ErrBadConn. However, this behavior appears to be error prone.

I don't see a problem if this behavior is implemented only in the Ping() function.
I'm already calling Ping() to know if my connection is alive.
As stated in the DB.Ping definition:

Ping verifies a connection to the database is still alive, establishing a connection if necessary.

In my opinion, if any kind of error is thrown on Ping(), the connection must be treated as broken. Another connection will then be created if possible.

Hellysonrp · 2020-12-03T19:07:22Z

@jackc What do you think about my suggestion? It'd really be helpful to me if Ping() treated the connection as broken on any error.

jackc · 2020-12-03T19:13:23Z

What exactly would change? The only error I can see occurring from ping would be a network failure or context cancellation -- both of which are already fatal.

Hellysonrp · 2020-12-03T19:46:43Z

As stated in this comment on database/sql/driver/driver.go, returning driver.ErrBadConn from Ping() should remove the connection from the pool.
Ping is returning the fatal error mentioned in this issue. This fatal error is a clear example of a connection that is broken on the client side and can be reopened. The connection is only being treated as broken after another try:

First time trying to use the connection after a successful restart: this fatal error
Second time: driver: bad connection
Third time: a new connection is created and the request succeeds

Returning driver.ErrBadConn earlier would make the creation of a new healthy connection happen earlier.

I don't know if there is any drawback in this approach. I'm not used to the way golang treats database connections.
Maybe I'm wrong, I honestly don't know.

refs #672

jackc · 2020-12-04T01:52:43Z

@Hellysonrp I would have expected the database/sql connection pool to already handle this correctly as is. But it was a simple enough change to explicitly close a connection on ping failure and report the error as driver.ErrBadConn. Changed as requested in 82bac82.

Implemented in pgxpool.Pool and database/sql. #672

jackc · 2022-06-25T23:29:05Z

Well, I finally took another shot at solving this. v5.0.0-alpha.4 now has a *PgConn.CheckConn() method. This executes a non-blocking read to check the connection liveness. As discussed above non-blocking IO especially in combination with TLS is much complicated and difficult than it should be. But it seems to be working at the expense of significant internal complexity. I have walled off the nasty complexity in internal/nbconn.

pgxpool and database/sql now automatically call CheckConn() when the connection has been idle for more than 1 second.

ellulpatrick · 2022-06-26T05:00:36Z

🎩 👏

to get jackc/pgx#672 (comment)

ellulpatrick changed the title ~~Database Restart is not handled gracefully~~ Database restarts are not handled gracefully Jan 29, 2020

jackc mentioned this issue Feb 1, 2020

Retry logic can cause queries to be silently performed twice lib/pq#939

Open

jackc mentioned this issue Feb 5, 2020

Huge unread notifications combined with huge Exec may deadlock jackc/pgconn#27

Closed

jackc mentioned this issue May 23, 2020

Is pgx vulnerable to any of the go-mysql-driver issues? #755

Closed

thequailman mentioned this issue Nov 6, 2020

Why is this in maintenance mode? lib/pq#1010

Open

jackc added a commit that referenced this issue Dec 4, 2020

stdlib: consider any Ping failure as fatal

82bac82

refs #672

dzbee mentioned this issue Mar 1, 2021

Replace lib/pq with pgx/stdlib as postgres driver golang-migrate/migrate#512

Closed

jackc added a commit that referenced this issue Jun 25, 2022

Check conn liveness before using when idle for more than 1 second

03da9fc

Implemented in pgxpool.Pool and database/sql. #672

alistanis mentioned this issue Jul 15, 2022

Restarting Postgres server used for backend while Teleport is running results in odd behaviour gravitational/teleport#13215

Closed

dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022

update to pgx v5

81e9216

to get jackc/pgx#672 (comment)

dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022

update to pgx v5

cf056f3

to get jackc/pgx#672 (comment)

dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022

update to pgx v5

9664b3a

to get jackc/pgx#672 (comment)

dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022

update to pgx v5

8ff5f59

to get jackc/pgx#672 (comment)

bfontaine closed this as completed Nov 3, 2022

chacmool mentioned this issue Feb 7, 2023

SQL errors when the connection used is no longer valid (pgx driver version) ory/hydra#3432

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database restarts are not handled gracefully #672

Database restarts are not handled gracefully #672

ellulpatrick commented Jan 29, 2020 •

edited

Loading

jackc commented Feb 1, 2020

ellulpatrick commented Feb 1, 2020

jackc commented Feb 1, 2020

ellulpatrick commented Feb 2, 2020

vtolstov commented Feb 4, 2020

ellulpatrick commented Feb 4, 2020 •

edited

Loading

ellulpatrick commented Feb 4, 2020

jackc commented Feb 5, 2020

ellulpatrick commented Feb 5, 2020

jackc commented Feb 5, 2020

ellulpatrick commented Feb 5, 2020

vtolstov commented Feb 5, 2020

jackc commented Feb 5, 2020

vtolstov commented Feb 6, 2020

bmon commented Feb 6, 2020 •

edited

Loading

vtolstov commented Feb 6, 2020

redbaron commented Apr 11, 2020

malolans commented May 21, 2020

jackc commented May 21, 2020

GeertJohan commented May 26, 2020 •

edited

Loading

Hellysonrp commented Nov 7, 2020

Hellysonrp commented Dec 3, 2020

jackc commented Dec 3, 2020

Hellysonrp commented Dec 3, 2020 •

edited

Loading

jackc commented Dec 4, 2020

jackc commented Jun 25, 2022

ellulpatrick commented Jun 26, 2022

Database restarts are not handled gracefully #672

Database restarts are not handled gracefully #672

Comments

ellulpatrick commented Jan 29, 2020 • edited Loading

jackc commented Feb 1, 2020

ellulpatrick commented Feb 1, 2020

jackc commented Feb 1, 2020

ellulpatrick commented Feb 2, 2020

vtolstov commented Feb 4, 2020

ellulpatrick commented Feb 4, 2020 • edited Loading

ellulpatrick commented Feb 4, 2020

jackc commented Feb 5, 2020

ellulpatrick commented Feb 5, 2020

jackc commented Feb 5, 2020

ellulpatrick commented Feb 5, 2020

vtolstov commented Feb 5, 2020

jackc commented Feb 5, 2020

vtolstov commented Feb 6, 2020

bmon commented Feb 6, 2020 • edited Loading

vtolstov commented Feb 6, 2020

redbaron commented Apr 11, 2020

malolans commented May 21, 2020

jackc commented May 21, 2020

GeertJohan commented May 26, 2020 • edited Loading

Hellysonrp commented Nov 7, 2020

Hellysonrp commented Dec 3, 2020

jackc commented Dec 3, 2020

Hellysonrp commented Dec 3, 2020 • edited Loading

jackc commented Dec 4, 2020

jackc commented Jun 25, 2022

ellulpatrick commented Jun 26, 2022

ellulpatrick commented Jan 29, 2020 •

edited

Loading

ellulpatrick commented Feb 4, 2020 •

edited

Loading

bmon commented Feb 6, 2020 •

edited

Loading

GeertJohan commented May 26, 2020 •

edited

Loading

Hellysonrp commented Dec 3, 2020 •

edited

Loading