Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database restarts are not handled gracefully #672

Closed
ellulpatrick opened this issue Jan 29, 2020 · 27 comments
Closed

Database restarts are not handled gracefully #672

ellulpatrick opened this issue Jan 29, 2020 · 27 comments

Comments

@ellulpatrick
Copy link
Contributor

ellulpatrick commented Jan 29, 2020

When the database is rebooted gracefully, it sends an <E meaning FATAL: terminating connection due to administrator command (SQLSTATE 57P01)

lib/pq discards such connection upfront and the next execution attempts to create a new connection.

Pgx returns an error at the next execution which is a wasted error, because typically the database has already rebooted, or a hot replica is already available,

How can we make pgx realise upfront of similar problems with the connection so that we don't waste a usage attempt with an error when we didn't need to?

@ellulpatrick ellulpatrick changed the title Database Restart is not handled gracefully Database restarts are not handled gracefully Jan 29, 2020
@jackc
Copy link
Owner

jackc commented Feb 1, 2020

The trick is this requires a read before every query to see if an error has arrived.

Unfortunately, Go does not provide a good way to do a non-blocking read. Apparently, it can be done with sufficient fiddling with the syscall and RawConn. However, that could be a problem with TLS connections.

Another option would be to do a read with a very short read deadline. But that is impractical because a read deadline long enough to read the possible error would also be long enough to be a significant amount of overhead to pay on every query.

The last option is to continuously read in a goroutine and send the messages through a channel. But there was a significant amount of overhead -- at least when I looked into it several years ago. It also adds a fair amount of complexity -- the copy protocol requires something like this and the code is really messy (https://github.com/jackc/pgconn/blob/c9abb86f21f0b89b909e9d112829e21daf3c06d8/pgconn.go#L1039).

Perhaps that code could be generalized and optionally used for those who want this eager reading functionality...

Beyond that, this type logic should go on the connection pool rather than the individual connections. Otherwise, session state could silently be lost on an automatic reconnect.

@ellulpatrick
Copy link
Contributor Author

Thanks @jackc for your detailed feedback. We'll have a good think about it. Best regards.

@jackc
Copy link
Owner

jackc commented Feb 1, 2020

Just for kicks I hacked together something that continuously read in a goroutine and sent the messages through a goroutine. Unfortunately performance is still unacceptable. A trivial select took 22% longer with that change.

Also, after reading your original message again I realized I may have misread it before.

lib/pq discards such connection upfront and the next execution attempts to create a new connection.

You mean the lib/pq Go library, not the libpq C library. That actually makes more sense because AFAIK C libpq is connection oriented and it would be very surprising for it to transparently establish a different connection.

That also confirms may assertion that this logic should go in the connection pool. A database/sql driver can signal to the pool that the connection is bad by returning driver.ErrBadConn. database/sql will automatically retry the query on a different connection in that case.

After a quick glance at lib/pq it appears that it automatically maps any fatal error to a driver.ErrBadConn. However, this behavior appears to be error prone.

...

Yup. It's error prone alright. It can cause your queries to silently be executed twice. I just reported the bug to them in lib/pq#939.

...

So... as far as what could be done to handle this correctly...

There isn't a one size fits all solution that I can see.

The "perfect" solution is to try to read before sending any queries. But that is almost certainly unacceptable from a performance standpoint.

I suppose you could checkout one connection and dedicate it to doing nothing but reading from the database. If it ever failed that would mean that a network failure or database restart had occurred. In that case you would want to close your database connection pool and create a new one. (doing that in a non-racy manner would entail a few gymnastics...)

@ellulpatrick
Copy link
Contributor Author

Hi @jackc. Yep thanks for the detailed analysis.. we had arrived at some of those conclusions already. It's really frustrating, sadly.

Automatic retry logic for immutable queries would be nice. The application could tell the library whether the query is safe to retry.

At a deeper level, it's such a pity that Go does not have a non blocking read to check the connection status. Because the operating system is aware that the connection is closed, yet Go can't use that information protectively. In fact, writes still succeed, and then the read fails. Which makes it impossible to tell if query was executed or not.
I hope one day Go addresses this problem.
Thanks again.

@vtolstov
Copy link

vtolstov commented Feb 4, 2020

so for now, what is the preferred way to handle database restart? In background do database sql ping method (https://golang.org/pkg/database/sql/#DB.PingContext in case of database/sql) or something different?
What happens when i have pool of 20 connections and database goes down and up ?

@ellulpatrick
Copy link
Contributor Author

ellulpatrick commented Feb 4, 2020

As sad as this sounds, there doesn't seem to be a perfect solution at the moment. lib/pq handles it but is risky for double execution.
Your best bet is to set the max idle time for a connection so at least it won't stay stale for a long time.

@ellulpatrick
Copy link
Contributor Author

The eager read as @jackc would probably be the most bullet proof way to do this and would solve also for network disconnections that the operating system knows about thanks to TCP keepalives etc. It is not trivial to implement, though.

@jackc
Copy link
Owner

jackc commented Feb 5, 2020

Your best bet is to set the max idle time for a connection so at least it won't stay stale for a long time.

This is the simplest approach. Any busy connections are going to encounter an error -- no way around that. So all we can easily do is minimize the idle connections.

The other approach is this:

I suppose you could checkout one connection and dedicate it to doing nothing but reading from the database. If it ever failed that would mean that a network failure or database restart had occurred. In that case you would want to close your database connection pool and create a new one. (doing that in a non-racy manner would entail a few gymnastics...)

Those gymnastics would likely involve atomic.StorePointer and atomic.LoadPointer.

@ellulpatrick
Copy link
Contributor Author

@jackc, I did have another idea. Not sure if it is feasible but I will put it past you anyway.

The concept is that when a connection is being released back to idle, a go routine is launched with a blocking read on that connection until it is needed again. That way, that go routine can handle any unsolicited messages from the server, and even an EOF from the OS. Then, when the connection is needed again, the blocking read is cancelled and the connection is handed to the application again.

In my head this doesn't sound like it would be too invasive, or is it too out of the box?

Regards, and thanks again for your contributions.

@jackc
Copy link
Owner

jackc commented Feb 5, 2020

It's a reasonable approach. There's already a method pgconn.PgConn.ReceiveMessage that would be ideal for the blocking read. If you want, experiment down this path and see where it ends up. But I wouldn't suggest spending a lot of time on it yet. I plan on taking another shot at the full time background reader goroutine approach. I thought of a technique that should reduce the overhead, but I'm not sure if it will be enough.

I also want to test some more real world performance cases. The ~20% loss was with a tiny select on a Unix socket with an uncancellable context. But the actual overhead was measured in microseconds. It may be that in a more production-like environment of connecting over TCP with TLS and a cancellable context that the overhead is insignificant.

@ellulpatrick
Copy link
Contributor Author

Thanks @jackc. I'll have a look to see how trivial it is.

Regarding background reading, what you say makes sense, and we look forward to hearing about your attempt at it.

As you mentioned earlier, the application will always and in any case need to be able to gracefully handle failure when talking to the database. There's nothing that the library can do to totally absolve the application developer from handling errors. However, in a HA production system, we try to do anything to minimise any errors and down time as much as possible. Most failures occur due to network components having issues, rerouting, restarts, etc and are typically very temporary and especially under low load, such extra care from the library could hide the impact in some cases.

@vtolstov
Copy link

vtolstov commented Feb 5, 2020

can we have some handler that fired then connection lost? So user app can handle it via reconnect or get new connection to fallback?

@jackc
Copy link
Owner

jackc commented Feb 5, 2020

can we have some handler that fired then connection lost? So user app can handle it via reconnect or get new connection to fallback?

The problem is there is no way to know the connection has been lost without reading from the underlying net.Conn.

@vtolstov
Copy link

vtolstov commented Feb 6, 2020

for my use-case is ok to get error after next query, but without callback about connection lost i need to write wrapper what checks error code and do reconnect in all instances that uses this conn. if i have some callback i can modify sql.Db pointer with mutex and continue to run other queries....

@bmon
Copy link

bmon commented Feb 6, 2020

The problem is there is no way to know the connection has been lost without reading from the underlying net.Conn.

Would it be possible to use poll or epoll to identify if the socket has been closed?
http://man7.org/linux/man-pages/man2/poll.2.html

this idea would introduce compatibility concerns with non-unix environments.

@vtolstov
Copy link

vtolstov commented Feb 6, 2020

we need to peek some data from socket, but as i know go does not provide ability to peek data from socket

@redbaron
Copy link
Contributor

@ellulpatrick ,

IMHO your suggestion of idle-only background goroutine is preferable to always on background reader. Simply because value can be added only when detected during idle, when connection is actively used error should be propagated to application code anyway

@malolans
Copy link

I read this blog post from GitHub on how they solved for MySQL. Can that approach new used here?

https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/

It's the second issue that was fixed.

@jackc
Copy link
Owner

jackc commented May 21, 2020

It might, though it would not be a trivial change. The biggest issue is how many layers this concept has to cross. At the moment only the connection process knows if TCP or Unix domain sockets are being used and is TLS is being used. Once that is complete everything is through the net.Conn interface. In addition, all of pgx / pgconn / pgproto3 relies on the chunkreader abstraction to handle the reading entire PostgreSQL messages at a time. chunkreader enables the higher levels to never have to deal with partial messages.

This change would potentially cross all these layers and require a lot of additional information sharing / coupling between them.

Doable, and might even be the right approach -- but definitely messy.

@GeertJohan
Copy link
Contributor

GeertJohan commented May 26, 2020

After switching from lib/pq to pgx/v4/stdlib, this is the only drawback: No automatic reconnect.

In my development process, the database is often recreated. To do that, all open connections must be terminated so the database can be dropped and created. The first query that is then executed by my Go process ends up with a FATAL: terminating connection due to administrator command (SQLSTATE 57P01). This wasn't the case with lib/pq. It's quite annoying in the development process.
I'm not sure if this is the same problem as a broken TCP connection though?

@Hellysonrp
Copy link

I got this error today while calling Ping().

After a quick glance at lib/pq it appears that it automatically maps any fatal error to a driver.ErrBadConn. However, this behavior appears to be error prone.

I don't see a problem if this behavior is implemented only in the Ping() function.
I'm already calling Ping() to know if my connection is alive.
As stated in the DB.Ping definition:

Ping verifies a connection to the database is still alive, establishing a connection if necessary.

In my opinion, if any kind of error is thrown on Ping(), the connection must be treated as broken. Another connection will then be created if possible.

@Hellysonrp
Copy link

@jackc What do you think about my suggestion? It'd really be helpful to me if Ping() treated the connection as broken on any error.

@jackc
Copy link
Owner

jackc commented Dec 3, 2020

What exactly would change? The only error I can see occurring from ping would be a network failure or context cancellation -- both of which are already fatal.

@Hellysonrp
Copy link

Hellysonrp commented Dec 3, 2020

As stated in this comment on database/sql/driver/driver.go, returning driver.ErrBadConn from Ping() should remove the connection from the pool.
Ping is returning the fatal error mentioned in this issue. This fatal error is a clear example of a connection that is broken on the client side and can be reopened. The connection is only being treated as broken after another try:

  • First time trying to use the connection after a successful restart: this fatal error
  • Second time: driver: bad connection
  • Third time: a new connection is created and the request succeeds

Returning driver.ErrBadConn earlier would make the creation of a new healthy connection happen earlier.

I don't know if there is any drawback in this approach. I'm not used to the way golang treats database connections.
Maybe I'm wrong, I honestly don't know.

jackc added a commit that referenced this issue Dec 4, 2020
@jackc
Copy link
Owner

jackc commented Dec 4, 2020

@Hellysonrp I would have expected the database/sql connection pool to already handle this correctly as is. But it was a simple enough change to explicitly close a connection on ping failure and report the error as driver.ErrBadConn. Changed as requested in 82bac82.

@jackc
Copy link
Owner

jackc commented Jun 25, 2022

Well, I finally took another shot at solving this. v5.0.0-alpha.4 now has a *PgConn.CheckConn() method. This executes a non-blocking read to check the connection liveness. As discussed above non-blocking IO especially in combination with TLS is much complicated and difficult than it should be. But it seems to be working at the expense of significant internal complexity. I have walled off the nasty complexity in internal/nbconn.

pgxpool and database/sql now automatically call CheckConn() when the connection has been idle for more than 1 second.

@ellulpatrick
Copy link
Contributor Author

🎩 👏

dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022
dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022
dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022
dermetfan added a commit to input-output-hk/cicero that referenced this issue Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants