Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdminShutdown error code isn't considered transient #5073

Closed
Jack-Edwards opened this issue May 24, 2023 · 9 comments · Fixed by #5076
Closed

AdminShutdown error code isn't considered transient #5073

Jack-Edwards opened this issue May 24, 2023 · 9 comments · Fixed by #5076
Assignees
Milestone

Comments

@Jack-Edwards
Copy link

Jack-Edwards commented May 24, 2023

Restarting a PostgreSQL database does not appear to remove or clear connections from the connection pool. Rather, those idle (and broken) connections are still available to use even after the database has restarted and is ready to accept new connections.

Is this expected behavior for connection pooling?

The steps below are how I'm able to reproduce the issue with my project.

Steps:

  1. Clone https://github.com/Crypter-File-Transfer/Crypter
  2. docker-compose --profile dev build
  3. docker-compose --profile dev up
  4. Navigate to https://localhost and bypass the security warning. At least a few API calls will hit the database.
  5. Stop the database container. (note: ignore any Hangfire exceptions logged by the API. I sort of expect these to occur since the database is down).
  6. Start the database container. Wait for database system is ready to accept connections
  7. Press F5 in your browser. You should see the client will crash. The API will log an exception that states: terminating connection due to administrator command

If you were to enable database logging, you will see terminating connection due to administrator command is logged towards the end of a database shutdown. It's never logged after the database comes back up. In this case it seems like the connection knows it's broken, it knows why it is broken, but it stays in the pool just to throw the exception for a subsequent request.

Example stack trace

Failed executing DbCommand (19ms) [Parameters=[@__visitorId_2='?' (DbType = Guid), @__username_0='?', @__visitorId_1='?' (DbType = Guid)], CommandType='Text', CommandTimeout='30']
      SELECT u."Username", u0."Alias", u0."About", u2."AllowKeyExchangeRequests", u."Id" = @__visitorId_2 OR (u2."ReceiveMessages" = 4 AND (u2."ReceiveMessages" IS NOT NULL)) OR (u2."Receivessages" = 3 AND (u2."ReceiveMessages" IS NOT NULL)) OR (u2."ReceiveMessages" = 2 AND (u2."ReceiveMessages" IS NOT NULL) AND EXISTS (
          SELECT 1
          FROM crypter."UserContact" AS u4
          WHERE u."Id" = u4."Owner" AND u4."Contact" = @__visitorId_2)), u."Id" = @__visitorId_2 OR (u2."ReceiveFiles" = 4 AND (u2."ReceiveFiles" IS NOT NULL)) OR (u2."ReceiveFiles" = 3 ANDu2."ReceiveFiles" IS NOT NULL)) OR (u2."ReceiveFiles" = 2 AND (u2."ReceiveFiles" IS NOT NULL) AND EXISTS (
          SELECT 1
          FROM crypter."UserContact" AS u5
          WHERE u."Id" = u5."Owner" AND u5."Contact" = @__visitorId_2)), u1."PublicKey", u."EmailVerified"
      FROM crypter."User" AS u
      LEFT JOIN crypter."UserProfile" AS u0 ON u."Id" = u0."Owner"
      LEFT JOIN crypter."UserKeyPair" AS u1 ON u."Id" = u1."Owner"
      LEFT JOIN crypter."UserPrivacySetting" AS u2 ON u."Id" = u2."Owner"
      WHERE u."Username" = @__username_0 AND (u0."Owner" IS NOT NULL) AND (u1."Owner" IS NOT NULL) AND (u2."Owner" IS NOT NULL) AND (u."Id" = @__visitorId_1 OR u2."Visibility" = 3 OR u2."Vibility" = 2 OR (u2."Visibility" = 1 AND EXISTS (
          SELECT 1
          FROM crypter."UserContact" AS u3
          WHERE u."Id" = u3."Owner" AND u3."Contact" = @__visitorId_1)))
      LIMIT 1
fail: Microsoft.EntityFrameworkCore.Query[10100]
      An exception occurred while iterating over the results of a query for context type 'Crypter.Core.DataContext'.
      Npgsql.PostgresException (0x80004005): 57P01: terminating connection due to administrator command
         at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|233_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotificatio, Boolean isReadingPrependedMessage)
         at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
         at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
         at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
         at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
         at Npgsql.NpgsqlCommand.ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken canclationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken canclationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
        Exception data:
          Severity: FATAL
          SqlState: 57P01
          MessageText: terminating connection due to administrator command
          File: postgres.c
          Line: 3191
          Routine: ProcessInterrupts
@roji roji transferred this issue from npgsql/efcore.pg May 24, 2023
@roji
Copy link
Member

roji commented May 24, 2023

Is this expected behavior for connection pooling?

Yes, this is by design. Npgsql has no way of knowing when the database goes down; it could do a roundtrip to check, but if we did that every time we handed out a connection from the pool, that would defeat the purpose of pooling in the first place (because of the reduced performance due to that round trip).

Consider that a database can be restarted at any point, so it could very well happen 1 microsecond after the connection is taken from the pool, and you'd still get your connection. In other words, there's nothing Npgsql can really do here - it's up to you to make your application resilient to failure by retrying your commands e.g. via something like Polly (EF has its own resiliency feature).

Since #3943 was implemented in 6.0, we do clear the pool when we detect a critical failure that's likely to break all connections. But the first connection (or first few) taken out of the pool still fail - there's no way around that.

@Jack-Edwards
Copy link
Author

Thanks. I do have options.EnableRetryOnFailure() configured with the default strategy, but it doesn't appear to help in this situation. I'll look into this and other options further.

Feel free to close, if you wish.

@roji
Copy link
Member

roji commented May 24, 2023

@Jack-Edwards it definitely should. If you can put together a minimal repro showing EF's EnableRetryOnFailure, please open an issue with that on EFCore.PG and I'll take a look.

@roji roji closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2023
@Jack-Edwards
Copy link
Author

@roji Adding the error code for the PostgresException works options.EnableRetryOnFailure(5, TimeSpan.FromSeconds(1), new string[] { "57P01" });.

protected override bool ShouldRetryOn(Exception? exception)
    => exception is PostgresException postgresException &&
        _additionalErrorCodes?.Contains(postgresException.SqlState) == true
        || NpgsqlTransientExceptionDetector.ShouldRetryOn(exception);

My understanding here is PostgresExceptions will not be retried unless the policy is explicitly configured with the error code.

@roji
Copy link
Member

roji commented May 24, 2023

Yeah, that code (Admin Shutdown) isn't in our transient code list (although it is in our "critical failure" list. There are some related other codes which seem like they could make sense, like crash_shutdown, database_dropped and idle_session_timeout.

@vonzshik any recollection on why these specific ones aren't in the list, did we just miss them?

@roji roji reopened this May 24, 2023
@vonzshik
Copy link
Contributor

@vonzshik any recollection on why these specific ones aren't in the list, did we just miss them?

At that point of time we used these errors to determine whether the cluster is dead, and if so, to mark it as unavailable for multiple hosts. For errors like database_dropped and idle_session_timeout that's not true as the cluster is quite alive. Also, crash_shutdown is in that list.

@roji
Copy link
Member

roji commented May 24, 2023

Well it seems from the above report that AdminShutdown may get triggered but the database may come up again very quickly afterwards - if that's true, shouldn't we treat as a transient for the purpose of retries?

(note that I'm not discussing the critical error list for multiple hosts - only the PostgresException.IsTransient property)

@vonzshik
Copy link
Contributor

OK. I think it's fine to consider database_dropped, idle_session_timeout and crash_shutdown as transient (just have to keep in mind that since the connections is dead, the state it has is also going to be gone, like temporary tables).
As for PostgresException.IsTransient, it was added with 70f9376 which is quite a bit before I joined the project 🐝. So I can't say whether it was intentional for it to not have database_dropped and the others.

@roji
Copy link
Member

roji commented May 25, 2023

just have to keep in mind that since the connections is dead, the state it has is also going to be gone, like temporary tables

Yep, that's already true of some other error codes.

I'll add these - though maybe I'll leave out database_dropped, since it probably indicates the database no longer exists

  • doesn't sound very transient. We can always add more codes later if users encounter them etc.

roji added a commit to roji/Npgsql that referenced this issue May 25, 2023
@roji roji self-assigned this May 25, 2023
@roji roji added this to the 8.0.0 milestone May 25, 2023
@roji roji changed the title 'terminating connection due to administrator command' after DB is restarted AdminShutdown error code isn't considered transient May 25, 2023
roji added a commit to roji/Npgsql that referenced this issue May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants