AdminShutdown error code isn't considered transient #5073

Jack-Edwards · 2023-05-24T16:46:42Z

Restarting a PostgreSQL database does not appear to remove or clear connections from the connection pool. Rather, those idle (and broken) connections are still available to use even after the database has restarted and is ready to accept new connections.

Is this expected behavior for connection pooling?

The steps below are how I'm able to reproduce the issue with my project.

Steps:

Clone https://github.com/Crypter-File-Transfer/Crypter
docker-compose --profile dev build
docker-compose --profile dev up
Navigate to https://localhost and bypass the security warning. At least a few API calls will hit the database.
Stop the database container. (note: ignore any Hangfire exceptions logged by the API. I sort of expect these to occur since the database is down).
Start the database container. Wait for database system is ready to accept connections
Press F5 in your browser. You should see the client will crash. The API will log an exception that states: terminating connection due to administrator command

If you were to enable database logging, you will see terminating connection due to administrator command is logged towards the end of a database shutdown. It's never logged after the database comes back up. In this case it seems like the connection knows it's broken, it knows why it is broken, but it stays in the pool just to throw the exception for a subsequent request.

Example stack trace

Failed executing DbCommand (19ms) [Parameters=[@__visitorId_2='?' (DbType = Guid), @__username_0='?', @__visitorId_1='?' (DbType = Guid)], CommandType='Text', CommandTimeout='30']
      SELECT u."Username", u0."Alias", u0."About", u2."AllowKeyExchangeRequests", u."Id" = @__visitorId_2 OR (u2."ReceiveMessages" = 4 AND (u2."ReceiveMessages" IS NOT NULL)) OR (u2."Receivessages" = 3 AND (u2."ReceiveMessages" IS NOT NULL)) OR (u2."ReceiveMessages" = 2 AND (u2."ReceiveMessages" IS NOT NULL) AND EXISTS (
          SELECT 1
          FROM crypter."UserContact" AS u4
          WHERE u."Id" = u4."Owner" AND u4."Contact" = @__visitorId_2)), u."Id" = @__visitorId_2 OR (u2."ReceiveFiles" = 4 AND (u2."ReceiveFiles" IS NOT NULL)) OR (u2."ReceiveFiles" = 3 ANDu2."ReceiveFiles" IS NOT NULL)) OR (u2."ReceiveFiles" = 2 AND (u2."ReceiveFiles" IS NOT NULL) AND EXISTS (
          SELECT 1
          FROM crypter."UserContact" AS u5
          WHERE u."Id" = u5."Owner" AND u5."Contact" = @__visitorId_2)), u1."PublicKey", u."EmailVerified"
      FROM crypter."User" AS u
      LEFT JOIN crypter."UserProfile" AS u0 ON u."Id" = u0."Owner"
      LEFT JOIN crypter."UserKeyPair" AS u1 ON u."Id" = u1."Owner"
      LEFT JOIN crypter."UserPrivacySetting" AS u2 ON u."Id" = u2."Owner"
      WHERE u."Username" = @__username_0 AND (u0."Owner" IS NOT NULL) AND (u1."Owner" IS NOT NULL) AND (u2."Owner" IS NOT NULL) AND (u."Id" = @__visitorId_1 OR u2."Visibility" = 3 OR u2."Vibility" = 2 OR (u2."Visibility" = 1 AND EXISTS (
          SELECT 1
          FROM crypter."UserContact" AS u3
          WHERE u."Id" = u3."Owner" AND u3."Contact" = @__visitorId_1)))
      LIMIT 1
fail: Microsoft.EntityFrameworkCore.Query[10100]
      An exception occurred while iterating over the results of a query for context type 'Crypter.Core.DataContext'.
      Npgsql.PostgresException (0x80004005): 57P01: terminating connection due to administrator command
         at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|233_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotificatio, Boolean isReadingPrependedMessage)
         at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
         at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
         at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
         at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
         at Npgsql.NpgsqlCommand.ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken canclationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken canclationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
        Exception data:
          Severity: FATAL
          SqlState: 57P01
          MessageText: terminating connection due to administrator command
          File: postgres.c
          Line: 3191
          Routine: ProcessInterrupts

The text was updated successfully, but these errors were encountered:

roji · 2023-05-24T17:22:54Z

Is this expected behavior for connection pooling?

Yes, this is by design. Npgsql has no way of knowing when the database goes down; it could do a roundtrip to check, but if we did that every time we handed out a connection from the pool, that would defeat the purpose of pooling in the first place (because of the reduced performance due to that round trip).

Consider that a database can be restarted at any point, so it could very well happen 1 microsecond after the connection is taken from the pool, and you'd still get your connection. In other words, there's nothing Npgsql can really do here - it's up to you to make your application resilient to failure by retrying your commands e.g. via something like Polly (EF has its own resiliency feature).

Since #3943 was implemented in 6.0, we do clear the pool when we detect a critical failure that's likely to break all connections. But the first connection (or first few) taken out of the pool still fail - there's no way around that.

Jack-Edwards · 2023-05-24T17:27:06Z

Thanks. I do have options.EnableRetryOnFailure() configured with the default strategy, but it doesn't appear to help in this situation. I'll look into this and other options further.

Feel free to close, if you wish.

roji · 2023-05-24T17:30:23Z

@Jack-Edwards it definitely should. If you can put together a minimal repro showing EF's EnableRetryOnFailure, please open an issue with that on EFCore.PG and I'll take a look.

Jack-Edwards · 2023-05-24T18:18:44Z

@roji Adding the error code for the PostgresException works options.EnableRetryOnFailure(5, TimeSpan.FromSeconds(1), new string[] { "57P01" });.

protected override bool ShouldRetryOn(Exception? exception)
    => exception is PostgresException postgresException &&
        _additionalErrorCodes?.Contains(postgresException.SqlState) == true
        || NpgsqlTransientExceptionDetector.ShouldRetryOn(exception);

My understanding here is PostgresExceptions will not be retried unless the policy is explicitly configured with the error code.

roji · 2023-05-24T19:58:49Z

Yeah, that code (Admin Shutdown) isn't in our transient code list (although it is in our "critical failure" list. There are some related other codes which seem like they could make sense, like crash_shutdown, database_dropped and idle_session_timeout.

@vonzshik any recollection on why these specific ones aren't in the list, did we just miss them?

vonzshik · 2023-05-24T20:29:34Z

@vonzshik any recollection on why these specific ones aren't in the list, did we just miss them?

At that point of time we used these errors to determine whether the cluster is dead, and if so, to mark it as unavailable for multiple hosts. For errors like database_dropped and idle_session_timeout that's not true as the cluster is quite alive. Also, crash_shutdown is in that list.

roji · 2023-05-24T21:07:57Z

Well it seems from the above report that AdminShutdown may get triggered but the database may come up again very quickly afterwards - if that's true, shouldn't we treat as a transient for the purpose of retries?

(note that I'm not discussing the critical error list for multiple hosts - only the PostgresException.IsTransient property)

vonzshik · 2023-05-24T22:27:11Z

OK. I think it's fine to consider database_dropped, idle_session_timeout and crash_shutdown as transient (just have to keep in mind that since the connections is dead, the state it has is also going to be gone, like temporary tables).
As for PostgresException.IsTransient, it was added with 70f9376 which is quite a bit before I joined the project 🐝. So I can't say whether it was intentional for it to not have database_dropped and the others.

roji · 2023-05-25T09:19:55Z

just have to keep in mind that since the connections is dead, the state it has is also going to be gone, like temporary tables

Yep, that's already true of some other error codes.

I'll add these - though maybe I'll leave out database_dropped, since it probably indicates the database no longer exists

doesn't sound very transient. We can always add more codes later if users encounter them etc.

Closes npgsql#5073

Closes #5073

roji mentioned this issue May 24, 2023

Detailed pooling page npgsql/doc#132

Open

roji transferred this issue from npgsql/efcore.pg May 24, 2023

roji closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2023

roji reopened this May 24, 2023

roji added a commit to roji/Npgsql that referenced this issue May 25, 2023

Add some more error codes to the transient exception list

38d5a24

Closes npgsql#5073

roji mentioned this issue May 25, 2023

Add some more error codes to the transient exception list #5076

Merged

roji self-assigned this May 25, 2023

roji added the enhancement label May 25, 2023

roji added this to the 8.0.0 milestone May 25, 2023

roji changed the title ~~'terminating connection due to administrator command' after DB is restarted~~ AdminShutdown error code isn't considered transient May 25, 2023

roji added a commit to roji/Npgsql that referenced this issue May 25, 2023

Add some more error codes to the transient exception list

4f87f83

Closes npgsql#5073

roji closed this as completed in #5076 May 25, 2023

roji added a commit that referenced this issue May 25, 2023

Add some more error codes to the transient exception list (#5076)

05b4609

Closes #5073

Jack-Edwards mentioned this issue May 28, 2023

Retry database query in case of AdminShutdown Crypter-File-Transfer/Crypter#528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdminShutdown error code isn't considered transient #5073

AdminShutdown error code isn't considered transient #5073

Jack-Edwards commented May 24, 2023 •

edited

Loading

roji commented May 24, 2023

Jack-Edwards commented May 24, 2023

roji commented May 24, 2023

Jack-Edwards commented May 24, 2023

roji commented May 24, 2023

vonzshik commented May 24, 2023

roji commented May 24, 2023

vonzshik commented May 24, 2023

roji commented May 25, 2023

AdminShutdown error code isn't considered transient #5073

AdminShutdown error code isn't considered transient #5073

Comments

Jack-Edwards commented May 24, 2023 • edited Loading

Example stack trace

roji commented May 24, 2023

Jack-Edwards commented May 24, 2023

roji commented May 24, 2023

Jack-Edwards commented May 24, 2023

roji commented May 24, 2023

vonzshik commented May 24, 2023

roji commented May 24, 2023

vonzshik commented May 24, 2023

roji commented May 25, 2023

Jack-Edwards commented May 24, 2023 •

edited

Loading