Multiple "pq: unexpected describe rows response" errors #254

mattrco · 2014-05-01T14:27:49Z

We've been experiencing errors with lib/pq that eventually results in a state where no database connections in the pool are free. Max connections as set by SetMaxOpenConns is not reached.

The errors normally start with db.Prepare:

Could not prepare statement: pq: unexpected describe rows response: '3'

We also see the unexpected response 'C'. These develop into multiple occurrences of

sql: statement expects 0 inputs; got 4

and similar errors on statements that have previously been prepared with no errors.

To reiterate, we don't query any prepared statements that returned errors from Prepare. These errors are returned from queries on successfully prepared statements.

Then, we see occurrences of failures of Begin:

Could not start a transaction: pq: unknown response for simple query: '1'
Could not start a transaction: unexpected command tag 
Could not start a transaction: pq: unexpected transaction status idle in transaction

The only error in postgres around this time is:

FATAL:  invalid frontend message type 90

Which happened around 20s after the initial errors were seen in our application log.

pprof indicates that all goroutines querying the database were netpolling:

#   0x423886    netpollblock+0xa6                       /usr/local/src/go/src/pkg/runtime/netpoll.goc:280
#   0x4231ea    net.runtime_pollWait+0x6a                   /usr/local/src/go/src/pkg/runtime/netpoll.goc:116
#   0x695534    net.(*pollDesc).Wait+0x34                   /usr/local/src/go/src/pkg/net/fd_poll_runtime.go:81
#   0x695590    net.(*pollDesc).WaitRead+0x30                   /usr/local/src/go/src/pkg/net/fd_poll_runtime.go:86
#   0x696910    net.(*netFD).Read+0x2a0                     /usr/local/src/go/src/pkg/net/fd_unix.go:204
#   0x6a5825    net.(*conn).Read+0xc5                       /usr/local/src/go/src/pkg/net/net.go:122
#   0x48a1a0    bufio.(*Reader).fill+0x110                  /usr/local/src/go/src/pkg/bufio/bufio.go:91
#   0x48a5a4    bufio.(*Reader).Read+0x1a4                  /usr/local/src/go/src/pkg/bufio/bufio.go:159
#   0x46f566    io.ReadAtLeast+0xf6                     /usr/local/src/go/src/pkg/io/io.go:288
#   0x46f6d1    io.ReadFull+0x71                        /usr/local/src/go/src/pkg/io/io.go:306
#   0x4f5f9b    github.com/lib/pq.(*conn).recvMessage+0x10b         /home/matt/dev/go/src/github.com/lib/pq/conn.go:637
#   0x4f6357    github.com/lib/pq.(*conn).recv1+0x27                /home/matt/dev/go/src/github.com/lib/pq/conn.go:690
#   0x4f5242    github.com/lib/pq.(*conn).prepareToSimpleStmt+0x822     /home/matt/dev/go/src/github.com/lib/pq/conn.go:508
#   0x4f4997    github.com/lib/pq.(*conn).prepareTo+0x87            /home/matt/dev/go/src/github.com/lib/pq/conn.go:486
#   0x4f5670    github.com/lib/pq.(*conn).Prepare+0x120             /home/matt/dev/go/src/github.com/lib/pq/conn.go:539
#   0x4e5199    database/sql.(*driverConn).prepareLocked+0x49           /usr/local/src/go/src/pkg/database/sql/sql.go:250
#   0x4e7ce1    database/sql.(*DB).prepare+0xb1                 /usr/local/src/go/src/pkg/database/sql/sql.go:828
#   0x4e7b9c    database/sql.(*DB).Prepare+0x5c                 /usr/local/src/go/src/pkg/database/sql/sql.go:808

In a separate instance of unresponsiveness, all the goroutines were waiting on mutexes, e.g.:

#   0x4241d0    sync.runtime_Semacquire+0x30                            /usr/local/src/go/src/pkg/runtime/sema.goc:199
#   0x47f136    sync.(*Mutex).Lock+0xd6                             /usr/local/src/go/src/pkg/sync/mutex.go:66
#   0x4e6dd2    database/sql.(*DB).conn+0x42                            /usr/local/src/go/src/pkg/database/sql/sql.go:616
#   0x4e7c82    database/sql.(*DB).prepare+0x32                         /usr/local/src/go/src/pkg/database/sql/sql.go:823
#   0x4e7bbc    database/sql.(*DB).Prepare+0x5c                         /usr/local/src/go/src/pkg/database/sql/sql.go:808

The logged errors weren't present in that case, so these could be two totally different problems.

We're running Ubuntu 12.04 and pg 9.2 with go 1.2.1, and we're not assuming a trouble-free network by any means.

I appreciate this is somewhat vague and we don't have a reproducible test case (yet), but any pointers for further investigation would be appreciated. Let me know if further background info would be helpful.

The text was updated successfully, but these errors were encountered:

johto · 2014-05-01T14:30:42Z

The first thing to check would be that you're closing the previous Rows object before trying to run another query in the same transaction.

dgryski · 2014-05-01T14:35:10Z

It sounds like a connection that was being used for query A is somehow getting sucked to being used for query B. I've had similar problems with mysql and db handles surviving across fork(), but I'm not sure how that would end up happening with Go. I wonder if the race detector would show anything?

mattrco · 2014-05-01T14:36:57Z

@johto I've doubled-checked, we actually don't have any transactions that execute more than one query. And separately, rows are closed everywhere, so I don't think we are leaking anything.

mattrco · 2014-05-01T14:37:54Z

@dgryski I can certainly try and see if it turns anything up.

johto · 2014-05-01T15:00:32Z

Which exact commit of pq are you using?

johto · 2014-05-01T15:06:11Z

Also, is Could not prepare statement: pq: unexpected describe rows response: '3' the very first "weird" error you're seeing?

mattrco · 2014-05-01T15:32:59Z

@johto That's the first error I saw this time but I can't say with certainty it is always the case. I can take another look at our logs.

We are on f08bb02.

mattrco · 2014-05-01T15:39:11Z

I should've been clearer about the mutex waits above. I've edited the issue to reflect that the same errors weren't observed so we shouldn't assume it is the same issue.

johto · 2014-05-01T16:02:00Z

I don't suppose the problem is reproducible enough that you could capture the network traffic of one such session?

I can't see anything obvious in lib/pq at this moment. I'll dive a bit deeper later tonight.

mattrco · 2014-05-01T16:21:48Z

I've added race instrumentation so we'll see if it turns anything up.

Theoretically yes, we could tcpdump the traffic. It'll take me a little time to get that set up, but I can probably get it done tomorrow. At worst it can take a few days for this issue to appear, though.

Thanks for the suggestions so far.

johto · 2014-05-01T18:13:45Z

The only error in postgres around this time is:

FATAL: invalid frontend message type 90

Hmm.. I wonder if this is interesting or not. Message type 90 is 'Z', i.e. ReadyForQuery, which the frontend would never send. This might suggest that there is, indeed, concurrent access to the pq.conn happening as Damian suggested.

buro9 · 2014-05-01T18:47:41Z

We've deployed a build with "-race" enabled.

And this hasn't triggered the usual symptoms (which ultimately result in what mattcottingham describes), but figured you'd want to know about them anyway.

==================
WARNING: DATA RACE
Write by goroutine 39:
  bufio.(*Reader).fill()
      /usr/local/src/go/src/pkg/bufio/bufio.go:86 +0x158
  bufio.(*Reader).Read()
      /usr/local/src/go/src/pkg/bufio/bufio.go:159 +0x3fb
  io.ReadAtLeast()
      /usr/local/src/go/src/pkg/io/io.go:288 +0x119
  io.ReadFull()
      /usr/local/src/go/src/pkg/io/io.go:306 +0x7e
  github.com/lib/pq.(*conn).recvMessage()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:637 +0x204
  github.com/lib/pq.(*conn).recv1()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:690 +0x34
  github.com/lib/pq.(*rows).Next()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1061 +0x19f
  github.com/lib/pq.(*rows).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1032 +0x42
  database/sql.(*Rows).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1580 +0xc3
  database/sql.(*Row).Scan()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1636 +0x7d
  github.com/microcosm-cc/microcosm/models.GetPermission()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/authorisation.go:109 +0xa22
  github.com/microcosm-cc/microcosm/controller.(*MicrocosmsController).ReadMany()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosms.go:85 +0x206
  github.com/microcosm-cc/microcosm/controller.MicrocosmsHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosms.go:30 +0x160
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Previous read by goroutine 115:
  bufio.(*Reader).Read()
      /usr/local/src/go/src/pkg/bufio/bufio.go:145 +0x110
  io.ReadAtLeast()
      /usr/local/src/go/src/pkg/io/io.go:288 +0x119
  io.ReadFull()
      /usr/local/src/go/src/pkg/io/io.go:306 +0x7e
  github.com/lib/pq.(*conn).recvMessage()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:637 +0x204
  github.com/lib/pq.(*conn).recv1()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:690 +0x34
  github.com/lib/pq.(*stmt).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:840 +0x638
  database/sql.(*driverStmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:333 +0xbd
  database/sql.(*Stmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1452 +0x1a0
  github.com/microcosm-cc/microcosm/models.UpdateLastActive()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/profiles.go:408 +0x204
  github.com/microcosm-cc/microcosm/models.(*Context).authenticate()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:203 +0xe4a
  github.com/microcosm-cc/microcosm/models.MakeContext()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:120 +0x3ff
  github.com/microcosm-cc/microcosm/controller.WhoAmIHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/whoami.go:13 +0x5c
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Both of the calling queries are pretty much identical, they are single statement transactions that prepare a statement, and use QueryRow() (in auth) or Exec() (in updateLastActive) to execute it. There's no nested transaction statements or rows.Next() issues.

This is using the latest version of lib/pq at this commit: c808a1b

The line in question is this one: https://github.com/lib/pq/blob/master/conn.go#L637

Which is the attempt to read the type and 4 byte length value from the start of the message buffer.

We'll continue monitoring for the actual issue.

johto · 2014-05-01T18:56:50Z

Uhh.. Can I get a sanity check here? What business does func (rs *Rows) Close() calling rs.rowsi.Close() without holding its lock?

johto · 2014-05-01T19:01:42Z

Though strictly looking at this, http://golang.org/pkg/database/sql/driver/#Rows doesn't give the same guarantees as driver.Stmt. I wonder if we're relying on undefined behaviour here, or if we should shift the blame on database/sql.

johto · 2014-05-01T19:04:38Z

Looks like at least the mysql driver is doing the same thing, though. Will continue looking at this..

buro9 · 2014-05-01T19:06:27Z

I've found another unique race in the log file:

WARNING: DATA RACE
Write by goroutine 48:
  runtime.copy()
      /usr/local/src/go/src/pkg/runtime/slice.c:120 +0x0
  bufio.(*Reader).Read()
      /usr/local/src/go/src/pkg/bufio/bufio.go:168 +0x634
  io.ReadAtLeast()
      /usr/local/src/go/src/pkg/io/io.go:288 +0x119
  io.ReadFull()
      /usr/local/src/go/src/pkg/io/io.go:306 +0x7e
  github.com/lib/pq.(*conn).recvMessage()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:637 +0x204
  github.com/lib/pq.(*conn).recv1()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:690 +0x34
  github.com/lib/pq.(*rows).Next()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1061 +0x19f
  github.com/lib/pq.(*rows).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1032 +0x42
  database/sql.(*Rows).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1580 +0xc3
  database/sql.(*Row).Scan()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1636 +0x7d
  github.com/microcosm-cc/microcosm/models.GetPermission()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/authorisation.go:109 +0xa22
  github.com/microcosm-cc/microcosm/controller.(*MicrocosmController).Read()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosm.go:57 +0x1dc
  github.com/microcosm-cc/microcosm/controller.MicrocosmHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosm.go:31 +0x1bd
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Previous write by goroutine 122:
  github.com/lib/pq.(*stmt).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:833 +0xf5
  database/sql.(*driverStmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:333 +0xbd
  database/sql.(*Stmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1452 +0x1a0
  github.com/microcosm-cc/microcosm/models.UpdateLastActive()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/profiles.go:408 +0x204
  github.com/microcosm-cc/microcosm/models.(*Context).authenticate()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:203 +0xe4a
  github.com/microcosm-cc/microcosm/models.MakeContext()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:120 +0x3ff
  github.com/microcosm-cc/microcosm/controller.WhoAmIHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/whoami.go:13 +0x5c
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Same two statements, but now involving one of the statements closing and tripping over the other.

This brings in https://github.com/lib/pq/blob/master/conn.go#L833

buro9 · 2014-05-01T19:31:00Z

On the sanity check question, I don't know sql.go well enough to clarify anything there. Will read up on it tomorrow.

johto · 2014-05-01T19:41:20Z

Hmm, no, that seems like a red herring in this case. The fact that sql.(_Stmt).Close() calls (_driverStmt).Close() directly means that the statement was prepared via sql.(_Tx).Prepare(), and not sql.(_DB).Prepare(). Does the application use per-transaction prepared statements?

johto · 2014-05-01T19:46:10Z

Ah-hah! I can reproduce it here by closing a Rows object in a transaction when an Stmt.Close() on a statement created in the same transaction is called concurrently. Will have to do more digging to see where to put the blame..

buro9 · 2014-05-01T19:48:24Z

Does the application use per-transaction prepared statements?

Yes.

An example of our code (minus extraneous logging and OTT error handling) for the auth function:

func GetPermission(ac AuthContext) PermissionType {
    tx, err := db.Begin()
    // err handle
    defer tx.Rollback()

    stmt, err := tx.Prepare(`SELECT 1 FROM blah($1)`)
    // err handle
    defer stmt.Close()

    m := PermissionType{}
    err = stmt.QueryRow(
        ac.SomeValue
    ).Scan(
        &m.SomeProperty,
    )
    // err handle with tx.Rollback()

    err = tx.Commit()
    // err handle

    return m
}

mattrco · 2014-05-01T19:52:55Z

Yes, just to be clear, our application uses both (_DB).Prepare and (_Tx).Prepare (for separate things).

johto · 2014-05-01T19:54:41Z

Right. Actually, in the case where I can reproduce it would probably be the application's fault, since sql.(*Tx) is not declared to be safe to use concurrently from different goroutines. Are you doing that, or is there only ever one goroutine working on a single transaction?

mattrco · 2014-05-01T19:58:29Z

Yes, we checked that there is no concurrent access to any transactions after reading #81.

Could you post your code that reproduces the race just so we can compare?

buro9 · 2014-05-01T19:59:50Z

In the cases highlighted by the race check, only a single statement, in a single function, in a single goroutine. My code snippet above is pretty much it, and we only do it in a transaction as the function being SELECTed from may modify data. We use transactions for all queries that modify data even if it is only a single statement.

There is potentially one place elsewhere in our code where we pass an entirely different transaction around where that might be true. I will triple check and confirm shortly.

buro9 · 2014-05-01T20:01:33Z

Checked and confirmed... we never use a transaction outside of a single goroutine.

johto · 2014-05-01T20:04:06Z

Could you post your code that reproduces the race just so we can compare?

Here's what I did: https://gist.github.com/johto/243016fa6fda3db541b4

But that would require unsafe concurrent access to the *Tx, so there must be something else going on here..

dgryski · 2014-05-01T20:05:32Z

I would trust the race detector in this case, since it never gives false positives. If it saw a race, there was definitely a race. The race may be harmless, but that doesn't change the fact that there was unprotected access to shared mutable state. Those two warnings listed above need to be looked at.

johto · 2014-05-01T20:12:46Z

@buro9: Can I get a similar overview of what UpdateLastActive() does as well?

johto · 2014-05-01T20:16:40Z

@dgryski: Oh definitely. The race found by the race detector would explain the symptoms of the original problem, so I think we're on to something.

buro9 · 2014-05-01T20:18:15Z

Sure... it's incredibly simple:

func UpdateLastActive(profileId int64, lastActive time.Time) {
    tx, err := db.Begin()
    // err handle
    defer tx.Rollback()

    stmt, err := tx.Prepare(`UPDATE profiles SET last_active = $2 WHERE profile_id = $1;`)
    // err handle
    defer stmt.Close()

    _, err = stmt.Exec(profileId, lastActive)
    // err handle with tx.Rollback()

    err = tx.Commit()
    // err handle
}

That is called fractionally before GetPermission() is called.

GetPermission may be called within another goroutine.

db is global to the application and all goroutines, but transactions are only ever created and used in a single goroutine.

johto · 2014-05-01T20:22:13Z

Ooh. Is it possible that you're calling tx.Rollback() before the deferred stmt.Close() is executed? That would explain this as well.

buro9 · 2014-05-01T20:23:37Z

Our logs show that none of the rollbacks have ever happened. Every "// err handle" logs a fair bit.

But if we're supposed to explicitly stmt.Close() before rollback I'll certainly make sure we do.

johto · 2014-05-01T20:35:56Z

Our logs show that none of the rollbacks have ever happened. Every "// err handle" logs a fair bit.

Hmm.. OK.

But if we're supposed to explicitly stmt.Close() before rollback I'll certainly make sure we do.

Yeah, you have to do that. Somewhat unreliable repro here: https://gist.github.com/johto/f8ce866e44944eeb8ac5

It would be great if you could fix that first and then if the problem still reproduces (or the race detector finds something) I'll have another look.

johto · 2014-05-01T20:36:45Z

Oh. I just realized that calling Commit() first would do this as well, so the fact that there aren't any log messages from the error path being taken doesn't disprove the theory.

buro9 · 2014-05-01T20:37:12Z

And defer stmt.Close() doesn't deal with this?

buro9 · 2014-05-01T20:37:43Z

Plus... we're using QueryRow() and Exec(), both of which are supposed to Close() the statement.

johto · 2014-05-01T20:43:00Z

Plus... we're using QueryRow() and Exec(), both of which are supposed to Close() the statement.

The problem is as follows:

A *Stmt is created within a transaction, and the connection is stored into the *Stmt struct.
The transaction is committed or rolled back, and the connection is released back into the pool.
Another goroutine grabs the (now unused) connection from the pool, runs a query, gets the results back. Now it's got a *Rows object, but is not holding a lock on the connection (inside database/sql it's still inUse=true, though).
The deferred statement closer is executed. It grabs a lock on the connection (so as to assume exclusive access on the connection).
The goroutine from step if db.Exec encount an error， it can not continue other db.Exec #3 calls *Rows.Close(), which doesn't take a lock on the connection. It's safe for it to do that since the connection's inUse flag is set, but nobody told the statement created in the now-committed transaction.
Concurrent access to the *pq.conn, and you have a race :-(

And defer stmt.Close() doesn't deal with this?

I'm not sure what you mean here. Could you expand that a bit?

buro9 · 2014-05-01T20:47:15Z

And defer stmt.Close() doesn't deal with this?

I'm not sure what you mean here. Could you expand that a bit?

In our example:

func UpdateLastActive(profileId int64, lastActive time.Time) {
    tx, err := db.Begin()
    // err handle
    defer tx.Rollback()

    stmt, err := tx.Prepare(`UPDATE profiles SET last_active = $2 WHERE profile_id = $1;`)
    // err handle
    defer stmt.Close()

    _, err = stmt.Exec(profileId, lastActive)
    // err handle with tx.Rollback()

    err = tx.Commit()
    // err handle
}

We call defer stmt.Close() straight after preparing the statement. Exec should also close the statement, but wouldn't the deferred Close() have deal with it too?

Or are you saying that Exec() may error, leaving the statement open, and that the we'll exit our function with a deferred tx.Rollback happening, and the deferred stmt.Close() will silently error too late?

dgryski · 2014-05-01T20:47:26Z

Hmm.. so purely an application level issue then. Any way to detect this from the library? Is there an https://en.wikipedia.org/wiki/ABA_problem .with inUse?

johto · 2014-05-01T20:49:26Z

We call defer stmt.Close() straight after preparing the statement. Exec should also close the statement, but wouldn't the deferred Close() have deal with it too?

No, Exec doesn't close the statement. The entire point about doing a separate Prepare() step is that the statement can be reused. If you don't want to reuse the statement, you should just call Tx.Exec() directly.

johto · 2014-05-01T20:53:00Z

Hmm.. so purely an application level issue then. Any way to detect this from the library?

I think it would be possible from database/sql, but not from the driver. I could look into that a bit later today.

Is there an https://en.wikipedia.org/wiki/ABA_problem .with inUse?

No, *Stmt.Close() never deals with inUse; it assumes that the transaction/connection pair in it hasn't gone anywhere, which in this case isn't true. Though even if we did notice this case, I'm not sure where to report the error since nobody's going to be checking the result of a deferred *Stmt.Close()..

buro9 · 2014-05-01T20:55:42Z

I'll put in stmt.Close() with error checking and logging before every tx.Commit() and tx.Rollback().

I'll do that now and will let you know tomorrow if this has prevented more instances of the race condition in the log file.

buro9 · 2014-05-01T21:04:24Z

For example, this makes UpdateLastActive() now look like this:

func UpdateLastActive(profileId int64, lastActive time.Time) (int, error) {

    tx, err := db.Begin()
    // log err
    defer tx.Rollback()

    stmt, err := tx.Prepare(`UPDATE profiles SET last_active = $2 WHERE profile_id = $1;`)
    // log err
    defer stmt.Close()

    _, err = stmt.Exec(profileId, lastActive)
    if err != nil {
        // log err

        merr = stmt.Close()
        // log merr

        nerr := tx.Rollback()
        // log nerr
    }

    err = stmt.Close()
    if err != nil {
        // log err

        nerr := tx.Rollback()
        // log nerr
    }

    err = tx.Commit()
    // log error

    return
}

johto · 2014-05-01T21:06:54Z

@buro9: Looks correct to me.

The code gets a lot nicer by avoiding the explicit Prepare(), but it would be great if you could run this for a few days to make sure the race goes away.

buro9 · 2014-05-01T21:11:41Z

The entire point about doing a separate Prepare() step is that the statement can be reused. If you don't want to reuse the statement, you should just call Tx.Exec() directly

The re-use is internal to Postgres, it's the cached query plan. If the name of the prepared statement server-side is a hash of the query then the same plan will be used for each query during the database session (connection) as asking to prepare something that is already in the plan cache should return a reference to the existing plan.

Surely it would only be necessary to explicitly re-use the Go stmt if the name of the prepared statement isn't actually related to the query being prepared (a name based on a serial or random id)?

This is my lack of knowledge about the inner workings of Postgres shining through. It may well be that Postgres is more primitive and overwrites an existing plan and thus this saves no work compared to a direct Exec() call.

johto · 2014-05-01T21:20:03Z

The re-use is internal to Postgres, it's the cached query plan. If the name of the prepared statement server-side is a hash of the query then the same plan will be used for each query during the database session (connection) as asking to prepare something that is already in the plan cache should return a reference to the existing plan.

You are free to argue that, but:

The statement name is not a hash. The statement's name is created in the driver, and is definitely not a hash of the query.
If a client asks postgres to discard a previous statement (as it's an error to prepare a statement with the same name twice) and then asks it to create it again, that's what postgres is going to do. It's not trying to be any smarter than that; it's up to the client to be.

And in this case, database/sql actually is smarter. If you create a prepared statement on the *sql.DB and then use *Tx.Stmt() and the statement has already been prepared on that connection, you get that statement. But doing a *Tx.Prepare() for every query prevents any of that from happening.

Surely it would only be necessary to explicitly re-use the Go stmt if the name of the prepared statement isn't actually related to the query being prepared (a name based on a serial or random id)?

Right. And that's exactly how it works with the scheme you've implemented; see above.

buro9 · 2014-05-01T21:47:23Z

Awesome, thanks for that too... we'll get back with any updates on trapped race's in the log tomorrow.

johto · 2014-06-04T10:33:16Z

I'm going to assume this fixed the problem.

I still haven't looked into what, if anything, could be done on the database/sql side to possibly improve matters, but I'll try and look into that during the 1.4 development cycle.

buro9 · 2014-06-04T10:40:28Z

I hope leaving a comment doesn't re-open this.

Yes, confirmed it's now working fine.

The key was to not use Prepare() on a transaction.

I'm not sure I'd call it fixed though, as this is just a workaround as people are likely to assume all of the database/sql interfaces are safe to use without unexpected side effect. It may be worth just documenting this in the project readme or having some examples that avoid tx.Prepare() that people are likely to find and copy.

mattrco · 2014-06-04T12:05:15Z

So to conclude, the following may cause a race condition (error handling omitted):

tx, err := db.Begin()
stmt, err := tx.Prepare(`QUERY`)

_, err = stmt.Exec(arg)

err = tx.Commit()
// Connection has now been returned to pool with putConn.
stmt.Close()

...because the connection is released by tx.Commit, but then used by stmt.Close (and in the meantime, anything else that acquires it). The fix is to ensure stmt.Close is called before Commit (or, indeed, Rollback).

Looking at these two lines in database/sql, it becomes clear:

http://golang.org/src/pkg/database/sql/sql.go#L1436
http://golang.org/src/pkg/database/sql/sql.go#L1046

@johto thanks for the help on this, much appreciated.

johto · 2014-09-28T21:34:18Z

For anyone interested, this shouldn't be a problem starting from Go 1.4 because of https://code.google.com/p/go/source/detail?r=50ce4ec65c4a.

johto closed this as completed Jun 4, 2014

johto mentioned this issue Jun 26, 2014

Warn about deferring the closure of a txn-scoped Stmt VividCortex/go-database-sql-tutorial#51

Merged

otiai10 mentioned this issue Oct 14, 2014

2 statements in 1 transaction #306

Closed

Multiple "pq: unexpected describe rows response" errors #254

Multiple "pq: unexpected describe rows response" errors #254

Comments

mattrco commented May 1, 2014

johto commented May 1, 2014

dgryski commented May 1, 2014

mattrco commented May 1, 2014

mattrco commented May 1, 2014

johto commented May 1, 2014

johto commented May 1, 2014

mattrco commented May 1, 2014

mattrco commented May 1, 2014

johto commented May 1, 2014

mattrco commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

johto commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

mattrco commented May 1, 2014

johto commented May 1, 2014

mattrco commented May 1, 2014

buro9 commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

dgryski commented May 1, 2014

johto commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

dgryski commented May 1, 2014

johto commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

johto commented May 1, 2014

buro9 commented May 1, 2014

johto commented Jun 4, 2014

buro9 commented Jun 4, 2014

mattrco commented Jun 4, 2014

johto commented Sep 28, 2014