-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdminShutdown error code isn't considered transient #5073
Comments
Yes, this is by design. Npgsql has no way of knowing when the database goes down; it could do a roundtrip to check, but if we did that every time we handed out a connection from the pool, that would defeat the purpose of pooling in the first place (because of the reduced performance due to that round trip). Consider that a database can be restarted at any point, so it could very well happen 1 microsecond after the connection is taken from the pool, and you'd still get your connection. In other words, there's nothing Npgsql can really do here - it's up to you to make your application resilient to failure by retrying your commands e.g. via something like Polly (EF has its own resiliency feature). Since #3943 was implemented in 6.0, we do clear the pool when we detect a critical failure that's likely to break all connections. But the first connection (or first few) taken out of the pool still fail - there's no way around that. |
Thanks. I do have Feel free to close, if you wish. |
@Jack-Edwards it definitely should. If you can put together a minimal repro showing EF's EnableRetryOnFailure, please open an issue with that on EFCore.PG and I'll take a look. |
@roji Adding the error code for the PostgresException works
My understanding here is PostgresExceptions will not be retried unless the policy is explicitly configured with the error code. |
Yeah, that code (Admin Shutdown) isn't in our transient code list (although it is in our "critical failure" list. There are some related other codes which seem like they could make sense, like crash_shutdown, database_dropped and idle_session_timeout. @vonzshik any recollection on why these specific ones aren't in the list, did we just miss them? |
At that point of time we used these errors to determine whether the cluster is dead, and if so, to mark it as unavailable for multiple hosts. For errors like |
Well it seems from the above report that AdminShutdown may get triggered but the database may come up again very quickly afterwards - if that's true, shouldn't we treat as a transient for the purpose of retries? (note that I'm not discussing the critical error list for multiple hosts - only the PostgresException.IsTransient property) |
OK. I think it's fine to consider |
Yep, that's already true of some other error codes. I'll add these - though maybe I'll leave out database_dropped, since it probably indicates the database no longer exists
|
Restarting a PostgreSQL database does not appear to remove or clear connections from the connection pool. Rather, those idle (and broken) connections are still available to use even after the database has restarted and is ready to accept new connections.
Is this expected behavior for connection pooling?
The steps below are how I'm able to reproduce the issue with my project.
Steps:
docker-compose --profile dev build
docker-compose --profile dev up
https://localhost
and bypass the security warning. At least a few API calls will hit the database.database system is ready to accept connections
terminating connection due to administrator command
If you were to enable database logging, you will see
terminating connection due to administrator command
is logged towards the end of a database shutdown. It's never logged after the database comes back up. In this case it seems like the connection knows it's broken, it knows why it is broken, but it stays in the pool just to throw the exception for a subsequent request.Example stack trace
The text was updated successfully, but these errors were encountered: