dangling transactions after 'failed to deallocate cached statement' errors #2100

Valodim · 2024-08-01T09:04:38Z

Hey there. First of all, thank you for all your work on this library 👍

Describe the bug

We are using pgx in a fairly typical backend application, that serves requests to clients via grpc. We use pgx as a driver via sqlx. Starting with version 5.5.4, we started to notice some problems:

Our production logs contained lines "failed to deallocate cached statement: conn closed". Roughly it was a couple hundred times a day, for a couple hundred thousand requests a day where each request is a transaction.
This happened predominantly (but not exclusively) in streaming requests. Those are ended when cancelled by the user, so we assume it's some race condition around cancelled context behavior.
The logged error happens from a call to tx.Rollback(), which is called from a defer that is set up with the transaction and run at the end of the cancelled rpc method.
We also started to see dangling transactions left in an idle in transaction state, indefinitely (hours), causing issues particularly when they held on to locks. While this is not definitely related to the log entries, it started happening at the same time and the log entries indicate an error during transaction cleanup.

It seems like something goes wrong around transaction cleanup, but we can't tell what's going on at all.

To Reproduce

Unfortunately, we were unable to reproduce this bug outside our production environment, which also isn't source available. :( Rolling back to 5.5.3 "fixed" the issue for us, so it might be caused by 046f497. We tried some things in our codebase around transaction behavior and made sure we begin/end transactions consistently throughout, to no avail. If you have any ideas, we'd be happy to try.

Version

Go: $ go version -> go version go1.21.5 linux/amd64
PostgreSQL: $ psql --no-psqlrc --tuples-only -c 'select version()' -> PostgreSQL 15.7 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, 64-bit
pgx: $ grep 'github.com/jackc/pgx/v[0-9]' go.mod -> v5.5.4

The text was updated successfully, but these errors were encountered:

jackc · 2024-08-03T14:57:29Z

I do not think that failed to deallocate cached statement: conn closed is directly connected to dangling transactions -- after all the connection is closed -- it can't be holding open a connection. I suppose it could be worth checking that the connection actually is closed on the server though.

I suppose it's possible this has something to do with the context cancellation. But because of the difficulty in reliably interrupting a query and recovering the connection, pgx simply closes the connection when a query is interrupted. So it would seem there's not much that could go wrong there.

xocasdashdash · 2024-10-03T21:19:25Z

👋
I'm seeing some similar errors when using a "child" context to query some rows for locking.

So i'm doing basically this:

tctx, cancel := context.WithTimeout(ctx, 10 * time.Second)
defer cancel()

txx, err := s.sqlClient.BeginTxx(tctx, nil)

defer func() {
   // handle recover
   fn := txx.Commit
   if err != nil {
     fn = txx.Rollback
    }
    funcErr := fn()
    if funErr != nil {
       log.Error("error applying transaction %+v", funcErr)
        // Here i see the error with `failed to deallocate cached statement: conn closed `
       err = fmt.Errorf("failed to apply transaction %w", funErr)
     }
     return err
}

childCtx, cancelChildren := context.WithTimeout(tctx, 1 * time.Secoond)
defer cancelChildren()

var ids []string
err := s.sqlClient.ExecCtx(childCtx, &ids, "select id from users where age > 99 for update"); // this takes more than 1 second
if err != nil {
 return err
}
// use the ids to run an update
// loop through users and run an update
// commit transaction

After the "lock" queries started hitting their context timeout, i started seeing this errors on the application logs. Is it the case that this "child" context being cancelled is closing down the original transaction connection ?

I'm trying to understand where in the code does this happen if there's something I can do to better handle this (I already want to refactor how the transaction is committed/rollbacked to avoid failing)

jackc · 2024-10-05T15:26:55Z

@xocasdashdash

Is it the case that this "child" context being cancelled is closing down the original transaction connection ?

Yes. Context cancellation closes the connection in most cases because it is very difficult or impossible to reliably recover from an interrupted query.

But beyond that, the fundamental means of interrupting a query in PostgreSQL causes the interrupted query to return an error. If a query returns an error the wrapping transaction is places in an error state. So even at the PostgreSQL level the only way to handle this would be to use savepoints (pseudo nested transactions) around every query you might want to cancel and rollback.

Valodim added the bug label Aug 1, 2024

Valodim changed the title ~~dangling transactions after f'failed to deallocate cached statement' errors~~ dangling transactions after 'failed to deallocate cached statement' errors Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dangling transactions after 'failed to deallocate cached statement' errors #2100

dangling transactions after 'failed to deallocate cached statement' errors #2100

Valodim commented Aug 1, 2024

jackc commented Aug 3, 2024

xocasdashdash commented Oct 3, 2024

jackc commented Oct 5, 2024

dangling transactions after 'failed to deallocate cached statement' errors #2100

dangling transactions after 'failed to deallocate cached statement' errors #2100

Comments

Valodim commented Aug 1, 2024

jackc commented Aug 3, 2024

xocasdashdash commented Oct 3, 2024

jackc commented Oct 5, 2024