Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple "pq: unexpected describe rows response" errors #254

Closed
mattrco opened this issue May 1, 2014 · 51 comments
Closed

Multiple "pq: unexpected describe rows response" errors #254

mattrco opened this issue May 1, 2014 · 51 comments

Comments

@mattrco
Copy link

mattrco commented May 1, 2014

We've been experiencing errors with lib/pq that eventually results in a state where no database connections in the pool are free. Max connections as set by SetMaxOpenConns is not reached.

The errors normally start with db.Prepare:

Could not prepare statement: pq: unexpected describe rows response: '3'

We also see the unexpected response 'C'. These develop into multiple occurrences of

sql: statement expects 0 inputs; got 4

and similar errors on statements that have previously been prepared with no errors.

To reiterate, we don't query any prepared statements that returned errors from Prepare. These errors are returned from queries on successfully prepared statements.

Then, we see occurrences of failures of Begin:

Could not start a transaction: pq: unknown response for simple query: '1'
Could not start a transaction: unexpected command tag 
Could not start a transaction: pq: unexpected transaction status idle in transaction

The only error in postgres around this time is:

FATAL:  invalid frontend message type 90

Which happened around 20s after the initial errors were seen in our application log.

pprof indicates that all goroutines querying the database were netpolling:

#   0x423886    netpollblock+0xa6                       /usr/local/src/go/src/pkg/runtime/netpoll.goc:280
#   0x4231ea    net.runtime_pollWait+0x6a                   /usr/local/src/go/src/pkg/runtime/netpoll.goc:116
#   0x695534    net.(*pollDesc).Wait+0x34                   /usr/local/src/go/src/pkg/net/fd_poll_runtime.go:81
#   0x695590    net.(*pollDesc).WaitRead+0x30                   /usr/local/src/go/src/pkg/net/fd_poll_runtime.go:86
#   0x696910    net.(*netFD).Read+0x2a0                     /usr/local/src/go/src/pkg/net/fd_unix.go:204
#   0x6a5825    net.(*conn).Read+0xc5                       /usr/local/src/go/src/pkg/net/net.go:122
#   0x48a1a0    bufio.(*Reader).fill+0x110                  /usr/local/src/go/src/pkg/bufio/bufio.go:91
#   0x48a5a4    bufio.(*Reader).Read+0x1a4                  /usr/local/src/go/src/pkg/bufio/bufio.go:159
#   0x46f566    io.ReadAtLeast+0xf6                     /usr/local/src/go/src/pkg/io/io.go:288
#   0x46f6d1    io.ReadFull+0x71                        /usr/local/src/go/src/pkg/io/io.go:306
#   0x4f5f9b    github.com/lib/pq.(*conn).recvMessage+0x10b         /home/matt/dev/go/src/github.com/lib/pq/conn.go:637
#   0x4f6357    github.com/lib/pq.(*conn).recv1+0x27                /home/matt/dev/go/src/github.com/lib/pq/conn.go:690
#   0x4f5242    github.com/lib/pq.(*conn).prepareToSimpleStmt+0x822     /home/matt/dev/go/src/github.com/lib/pq/conn.go:508
#   0x4f4997    github.com/lib/pq.(*conn).prepareTo+0x87            /home/matt/dev/go/src/github.com/lib/pq/conn.go:486
#   0x4f5670    github.com/lib/pq.(*conn).Prepare+0x120             /home/matt/dev/go/src/github.com/lib/pq/conn.go:539
#   0x4e5199    database/sql.(*driverConn).prepareLocked+0x49           /usr/local/src/go/src/pkg/database/sql/sql.go:250
#   0x4e7ce1    database/sql.(*DB).prepare+0xb1                 /usr/local/src/go/src/pkg/database/sql/sql.go:828
#   0x4e7b9c    database/sql.(*DB).Prepare+0x5c                 /usr/local/src/go/src/pkg/database/sql/sql.go:808

In a separate instance of unresponsiveness, all the goroutines were waiting on mutexes, e.g.:

#   0x4241d0    sync.runtime_Semacquire+0x30                            /usr/local/src/go/src/pkg/runtime/sema.goc:199
#   0x47f136    sync.(*Mutex).Lock+0xd6                             /usr/local/src/go/src/pkg/sync/mutex.go:66
#   0x4e6dd2    database/sql.(*DB).conn+0x42                            /usr/local/src/go/src/pkg/database/sql/sql.go:616
#   0x4e7c82    database/sql.(*DB).prepare+0x32                         /usr/local/src/go/src/pkg/database/sql/sql.go:823
#   0x4e7bbc    database/sql.(*DB).Prepare+0x5c                         /usr/local/src/go/src/pkg/database/sql/sql.go:808

The logged errors weren't present in that case, so these could be two totally different problems.

We're running Ubuntu 12.04 and pg 9.2 with go 1.2.1, and we're not assuming a trouble-free network by any means.

I appreciate this is somewhat vague and we don't have a reproducible test case (yet), but any pointers for further investigation would be appreciated. Let me know if further background info would be helpful.

@johto
Copy link
Contributor

johto commented May 1, 2014

The first thing to check would be that you're closing the previous Rows object before trying to run another query in the same transaction.

@dgryski
Copy link

dgryski commented May 1, 2014

It sounds like a connection that was being used for query A is somehow getting sucked to being used for query B. I've had similar problems with mysql and db handles surviving across fork(), but I'm not sure how that would end up happening with Go. I wonder if the race detector would show anything?

@mattrco
Copy link
Author

mattrco commented May 1, 2014

@johto I've doubled-checked, we actually don't have any transactions that execute more than one query. And separately, rows are closed everywhere, so I don't think we are leaking anything.

@mattrco
Copy link
Author

mattrco commented May 1, 2014

@dgryski I can certainly try and see if it turns anything up.

@johto
Copy link
Contributor

johto commented May 1, 2014

Which exact commit of pq are you using?

@johto
Copy link
Contributor

johto commented May 1, 2014

Also, is Could not prepare statement: pq: unexpected describe rows response: '3' the very first "weird" error you're seeing?

@mattrco
Copy link
Author

mattrco commented May 1, 2014

@johto That's the first error I saw this time but I can't say with certainty it is always the case. I can take another look at our logs.

We are on f08bb02.

@mattrco
Copy link
Author

mattrco commented May 1, 2014

I should've been clearer about the mutex waits above. I've edited the issue to reflect that the same errors weren't observed so we shouldn't assume it is the same issue.

@johto
Copy link
Contributor

johto commented May 1, 2014

I don't suppose the problem is reproducible enough that you could capture the network traffic of one such session?

I can't see anything obvious in lib/pq at this moment. I'll dive a bit deeper later tonight.

@mattrco
Copy link
Author

mattrco commented May 1, 2014

I've added race instrumentation so we'll see if it turns anything up.

Theoretically yes, we could tcpdump the traffic. It'll take me a little time to get that set up, but I can probably get it done tomorrow. At worst it can take a few days for this issue to appear, though.

Thanks for the suggestions so far.

@johto
Copy link
Contributor

johto commented May 1, 2014

The only error in postgres around this time is:

FATAL: invalid frontend message type 90

Hmm.. I wonder if this is interesting or not. Message type 90 is 'Z', i.e. ReadyForQuery, which the frontend would never send. This might suggest that there is, indeed, concurrent access to the pq.conn happening as Damian suggested.

@buro9
Copy link

buro9 commented May 1, 2014

We've deployed a build with "-race" enabled.

And this hasn't triggered the usual symptoms (which ultimately result in what mattcottingham describes), but figured you'd want to know about them anyway.

==================
WARNING: DATA RACE
Write by goroutine 39:
  bufio.(*Reader).fill()
      /usr/local/src/go/src/pkg/bufio/bufio.go:86 +0x158
  bufio.(*Reader).Read()
      /usr/local/src/go/src/pkg/bufio/bufio.go:159 +0x3fb
  io.ReadAtLeast()
      /usr/local/src/go/src/pkg/io/io.go:288 +0x119
  io.ReadFull()
      /usr/local/src/go/src/pkg/io/io.go:306 +0x7e
  github.com/lib/pq.(*conn).recvMessage()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:637 +0x204
  github.com/lib/pq.(*conn).recv1()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:690 +0x34
  github.com/lib/pq.(*rows).Next()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1061 +0x19f
  github.com/lib/pq.(*rows).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1032 +0x42
  database/sql.(*Rows).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1580 +0xc3
  database/sql.(*Row).Scan()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1636 +0x7d
  github.com/microcosm-cc/microcosm/models.GetPermission()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/authorisation.go:109 +0xa22
  github.com/microcosm-cc/microcosm/controller.(*MicrocosmsController).ReadMany()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosms.go:85 +0x206
  github.com/microcosm-cc/microcosm/controller.MicrocosmsHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosms.go:30 +0x160
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Previous read by goroutine 115:
  bufio.(*Reader).Read()
      /usr/local/src/go/src/pkg/bufio/bufio.go:145 +0x110
  io.ReadAtLeast()
      /usr/local/src/go/src/pkg/io/io.go:288 +0x119
  io.ReadFull()
      /usr/local/src/go/src/pkg/io/io.go:306 +0x7e
  github.com/lib/pq.(*conn).recvMessage()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:637 +0x204
  github.com/lib/pq.(*conn).recv1()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:690 +0x34
  github.com/lib/pq.(*stmt).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:840 +0x638
  database/sql.(*driverStmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:333 +0xbd
  database/sql.(*Stmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1452 +0x1a0
  github.com/microcosm-cc/microcosm/models.UpdateLastActive()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/profiles.go:408 +0x204
  github.com/microcosm-cc/microcosm/models.(*Context).authenticate()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:203 +0xe4a
  github.com/microcosm-cc/microcosm/models.MakeContext()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:120 +0x3ff
  github.com/microcosm-cc/microcosm/controller.WhoAmIHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/whoami.go:13 +0x5c
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Both of the calling queries are pretty much identical, they are single statement transactions that prepare a statement, and use QueryRow() (in auth) or Exec() (in updateLastActive) to execute it. There's no nested transaction statements or rows.Next() issues.

This is using the latest version of lib/pq at this commit: c808a1b

The line in question is this one: https://github.com/lib/pq/blob/master/conn.go#L637

Which is the attempt to read the type and 4 byte length value from the start of the message buffer.

We'll continue monitoring for the actual issue.

@johto
Copy link
Contributor

johto commented May 1, 2014

Uhh.. Can I get a sanity check here? What business does func (rs *Rows) Close() calling rs.rowsi.Close() without holding its lock?

@johto
Copy link
Contributor

johto commented May 1, 2014

Though strictly looking at this, http://golang.org/pkg/database/sql/driver/#Rows doesn't give the same guarantees as driver.Stmt. I wonder if we're relying on undefined behaviour here, or if we should shift the blame on database/sql.

@johto
Copy link
Contributor

johto commented May 1, 2014

Looks like at least the mysql driver is doing the same thing, though. Will continue looking at this..

@buro9
Copy link

buro9 commented May 1, 2014

I've found another unique race in the log file:

WARNING: DATA RACE
Write by goroutine 48:
  runtime.copy()
      /usr/local/src/go/src/pkg/runtime/slice.c:120 +0x0
  bufio.(*Reader).Read()
      /usr/local/src/go/src/pkg/bufio/bufio.go:168 +0x634
  io.ReadAtLeast()
      /usr/local/src/go/src/pkg/io/io.go:288 +0x119
  io.ReadFull()
      /usr/local/src/go/src/pkg/io/io.go:306 +0x7e
  github.com/lib/pq.(*conn).recvMessage()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:637 +0x204
  github.com/lib/pq.(*conn).recv1()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:690 +0x34
  github.com/lib/pq.(*rows).Next()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1061 +0x19f
  github.com/lib/pq.(*rows).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:1032 +0x42
  database/sql.(*Rows).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1580 +0xc3
  database/sql.(*Row).Scan()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1636 +0x7d
  github.com/microcosm-cc/microcosm/models.GetPermission()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/authorisation.go:109 +0xa22
  github.com/microcosm-cc/microcosm/controller.(*MicrocosmController).Read()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosm.go:57 +0x1dc
  github.com/microcosm-cc/microcosm/controller.MicrocosmHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/microcosm.go:31 +0x1bd
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Previous write by goroutine 122:
  github.com/lib/pq.(*stmt).Close()
      /home/buro9/Dev/Go/src/github.com/lib/pq/conn.go:833 +0xf5
  database/sql.(*driverStmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:333 +0xbd
  database/sql.(*Stmt).Close()
      /usr/local/src/go/src/pkg/database/sql/sql.go:1452 +0x1a0
  github.com/microcosm-cc/microcosm/models.UpdateLastActive()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/profiles.go:408 +0x204
  github.com/microcosm-cc/microcosm/models.(*Context).authenticate()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:203 +0xe4a
  github.com/microcosm-cc/microcosm/models.MakeContext()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/models/context.go:120 +0x3ff
  github.com/microcosm-cc/microcosm/controller.WhoAmIHandler()
      /home/buro9/Dev/Go/src/github.com/microcosm-cc/microcosm/controller/whoami.go:13 +0x5c
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1220 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      /home/buro9/Dev/Go/src/github.com/gorilla/mux/mux.go:98 +0x337
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1496 +0x1dd
  net/http.serverHandler.ServeHTTP()
      /usr/local/src/go/src/pkg/net/http/server.go:1597 +0x1ca
  net/http.(*conn).serve()
      /usr/local/src/go/src/pkg/net/http/server.go:1167 +0xc00

Same two statements, but now involving one of the statements closing and tripping over the other.

This brings in https://github.com/lib/pq/blob/master/conn.go#L833

@buro9
Copy link

buro9 commented May 1, 2014

On the sanity check question, I don't know sql.go well enough to clarify anything there. Will read up on it tomorrow.

@johto
Copy link
Contributor

johto commented May 1, 2014

Hmm, no, that seems like a red herring in this case. The fact that sql.(_Stmt).Close() calls (_driverStmt).Close() directly means that the statement was prepared via sql.(_Tx).Prepare(), and not sql.(_DB).Prepare(). Does the application use per-transaction prepared statements?

@johto
Copy link
Contributor

johto commented May 1, 2014

Ah-hah! I can reproduce it here by closing a Rows object in a transaction when an Stmt.Close() on a statement created in the same transaction is called concurrently. Will have to do more digging to see where to put the blame..

@buro9
Copy link

buro9 commented May 1, 2014

Does the application use per-transaction prepared statements?

Yes.

An example of our code (minus extraneous logging and OTT error handling) for the auth function:

func GetPermission(ac AuthContext) PermissionType {
    tx, err := db.Begin()
    // err handle
    defer tx.Rollback()

    stmt, err := tx.Prepare(`SELECT 1 FROM blah($1)`)
    // err handle
    defer stmt.Close()

    m := PermissionType{}
    err = stmt.QueryRow(
        ac.SomeValue
    ).Scan(
        &m.SomeProperty,
    )
    // err handle with tx.Rollback()

    err = tx.Commit()
    // err handle

    return m
}

@mattrco
Copy link
Author

mattrco commented May 1, 2014

Yes, just to be clear, our application uses both (_DB).Prepare and (_Tx).Prepare (for separate things).

@johto
Copy link
Contributor

johto commented May 1, 2014

Right. Actually, in the case where I can reproduce it would probably be the application's fault, since sql.(*Tx) is not declared to be safe to use concurrently from different goroutines. Are you doing that, or is there only ever one goroutine working on a single transaction?

@mattrco
Copy link
Author

mattrco commented May 1, 2014

Yes, we checked that there is no concurrent access to any transactions after reading #81.

Could you post your code that reproduces the race just so we can compare?

@buro9
Copy link

buro9 commented May 1, 2014

In the cases highlighted by the race check, only a single statement, in a single function, in a single goroutine. My code snippet above is pretty much it, and we only do it in a transaction as the function being SELECTed from may modify data. We use transactions for all queries that modify data even if it is only a single statement.

There is potentially one place elsewhere in our code where we pass an entirely different transaction around where that might be true. I will triple check and confirm shortly.

@buro9
Copy link

buro9 commented May 1, 2014

Checked and confirmed... we never use a transaction outside of a single goroutine.

@johto
Copy link
Contributor

johto commented May 1, 2014

Could you post your code that reproduces the race just so we can compare?

Here's what I did: https://gist.github.com/johto/243016fa6fda3db541b4

But that would require unsafe concurrent access to the *Tx, so there must be something else going on here..

@dgryski
Copy link

dgryski commented May 1, 2014

I would trust the race detector in this case, since it never gives false positives. If it saw a race, there was definitely a race. The race may be harmless, but that doesn't change the fact that there was unprotected access to shared mutable state. Those two warnings listed above need to be looked at.

@johto
Copy link
Contributor

johto commented May 1, 2014

@buro9: Can I get a similar overview of what UpdateLastActive() does as well?

@johto
Copy link
Contributor

johto commented May 1, 2014

@dgryski: Oh definitely. The race found by the race detector would explain the symptoms of the original problem, so I think we're on to something.

@buro9
Copy link

buro9 commented May 1, 2014

Sure... it's incredibly simple:

func UpdateLastActive(profileId int64, lastActive time.Time) {
    tx, err := db.Begin()
    // err handle
    defer tx.Rollback()

    stmt, err := tx.Prepare(`UPDATE profiles SET last_active = $2 WHERE profile_id = $1;`)
    // err handle
    defer stmt.Close()

    _, err = stmt.Exec(profileId, lastActive)
    // err handle with tx.Rollback()

    err = tx.Commit()
    // err handle
}

That is called fractionally before GetPermission() is called.

GetPermission may be called within another goroutine.

db is global to the application and all goroutines, but transactions are only ever created and used in a single goroutine.

@johto
Copy link
Contributor

johto commented May 1, 2014

Ooh. Is it possible that you're calling tx.Rollback() before the deferred stmt.Close() is executed? That would explain this as well.

@buro9
Copy link

buro9 commented May 1, 2014

Our logs show that none of the rollbacks have ever happened. Every "// err handle" logs a fair bit.

But if we're supposed to explicitly stmt.Close() before rollback I'll certainly make sure we do.

@johto
Copy link
Contributor

johto commented May 1, 2014

Our logs show that none of the rollbacks have ever happened. Every "// err handle" logs a fair bit.

Hmm.. OK.

But if we're supposed to explicitly stmt.Close() before rollback I'll certainly make sure we do.

Yeah, you have to do that. Somewhat unreliable repro here: https://gist.github.com/johto/f8ce866e44944eeb8ac5

It would be great if you could fix that first and then if the problem still reproduces (or the race detector finds something) I'll have another look.

@johto
Copy link
Contributor

johto commented May 1, 2014

Oh. I just realized that calling Commit() first would do this as well, so the fact that there aren't any log messages from the error path being taken doesn't disprove the theory.

@buro9
Copy link

buro9 commented May 1, 2014

And defer stmt.Close() doesn't deal with this?

@buro9
Copy link

buro9 commented May 1, 2014

Plus... we're using QueryRow() and Exec(), both of which are supposed to Close() the statement.

@johto
Copy link
Contributor

johto commented May 1, 2014

Plus... we're using QueryRow() and Exec(), both of which are supposed to Close() the statement.

The problem is as follows:

  1. A *Stmt is created within a transaction, and the connection is stored into the *Stmt struct.
  2. The transaction is committed or rolled back, and the connection is released back into the pool.
  3. Another goroutine grabs the (now unused) connection from the pool, runs a query, gets the results back. Now it's got a *Rows object, but is not holding a lock on the connection (inside database/sql it's still inUse=true, though).
  4. The deferred statement closer is executed. It grabs a lock on the connection (so as to assume exclusive access on the connection).
  5. The goroutine from step if db.Exec encount an error, it can not continue other db.Exec #3 calls *Rows.Close(), which doesn't take a lock on the connection. It's safe for it to do that since the connection's inUse flag is set, but nobody told the statement created in the now-committed transaction.
  6. Concurrent access to the *pq.conn, and you have a race :-(

And defer stmt.Close() doesn't deal with this?

I'm not sure what you mean here. Could you expand that a bit?

@buro9
Copy link

buro9 commented May 1, 2014

And defer stmt.Close() doesn't deal with this?

I'm not sure what you mean here. Could you expand that a bit?

In our example:

func UpdateLastActive(profileId int64, lastActive time.Time) {
    tx, err := db.Begin()
    // err handle
    defer tx.Rollback()

    stmt, err := tx.Prepare(`UPDATE profiles SET last_active = $2 WHERE profile_id = $1;`)
    // err handle
    defer stmt.Close()

    _, err = stmt.Exec(profileId, lastActive)
    // err handle with tx.Rollback()

    err = tx.Commit()
    // err handle
}

We call defer stmt.Close() straight after preparing the statement. Exec should also close the statement, but wouldn't the deferred Close() have deal with it too?

Or are you saying that Exec() may error, leaving the statement open, and that the we'll exit our function with a deferred tx.Rollback happening, and the deferred stmt.Close() will silently error too late?

@dgryski
Copy link

dgryski commented May 1, 2014

Hmm.. so purely an application level issue then. Any way to detect this from the library? Is there an https://en.wikipedia.org/wiki/ABA_problem .with inUse?

@johto
Copy link
Contributor

johto commented May 1, 2014

We call defer stmt.Close() straight after preparing the statement. Exec should also close the statement, but wouldn't the deferred Close() have deal with it too?

No, Exec doesn't close the statement. The entire point about doing a separate Prepare() step is that the statement can be reused. If you don't want to reuse the statement, you should just call Tx.Exec() directly.

@johto
Copy link
Contributor

johto commented May 1, 2014

Hmm.. so purely an application level issue then. Any way to detect this from the library?

I think it would be possible from database/sql, but not from the driver. I could look into that a bit later today.

Is there an https://en.wikipedia.org/wiki/ABA_problem .with inUse?

No, *Stmt.Close() never deals with inUse; it assumes that the transaction/connection pair in it hasn't gone anywhere, which in this case isn't true. Though even if we did notice this case, I'm not sure where to report the error since nobody's going to be checking the result of a deferred *Stmt.Close()..

@buro9
Copy link

buro9 commented May 1, 2014

I'll put in stmt.Close() with error checking and logging before every tx.Commit() and tx.Rollback().

I'll do that now and will let you know tomorrow if this has prevented more instances of the race condition in the log file.

@buro9
Copy link

buro9 commented May 1, 2014

For example, this makes UpdateLastActive() now look like this:

func UpdateLastActive(profileId int64, lastActive time.Time) (int, error) {

    tx, err := db.Begin()
    // log err
    defer tx.Rollback()

    stmt, err := tx.Prepare(`UPDATE profiles SET last_active = $2 WHERE profile_id = $1;`)
    // log err
    defer stmt.Close()

    _, err = stmt.Exec(profileId, lastActive)
    if err != nil {
        // log err

        merr = stmt.Close()
        // log merr

        nerr := tx.Rollback()
        // log nerr
    }

    err = stmt.Close()
    if err != nil {
        // log err

        nerr := tx.Rollback()
        // log nerr
    }

    err = tx.Commit()
    // log error

    return
}

@johto
Copy link
Contributor

johto commented May 1, 2014

@buro9: Looks correct to me.

The code gets a lot nicer by avoiding the explicit Prepare(), but it would be great if you could run this for a few days to make sure the race goes away.

@buro9
Copy link

buro9 commented May 1, 2014

The entire point about doing a separate Prepare() step is that the statement can be reused. If you don't want to reuse the statement, you should just call Tx.Exec() directly

The re-use is internal to Postgres, it's the cached query plan. If the name of the prepared statement server-side is a hash of the query then the same plan will be used for each query during the database session (connection) as asking to prepare something that is already in the plan cache should return a reference to the existing plan.

Surely it would only be necessary to explicitly re-use the Go stmt if the name of the prepared statement isn't actually related to the query being prepared (a name based on a serial or random id)?

This is my lack of knowledge about the inner workings of Postgres shining through. It may well be that Postgres is more primitive and overwrites an existing plan and thus this saves no work compared to a direct Exec() call.

@johto
Copy link
Contributor

johto commented May 1, 2014

The re-use is internal to Postgres, it's the cached query plan. If the name of the prepared statement server-side is a hash of the query then the same plan will be used for each query during the database session (connection) as asking to prepare something that is already in the plan cache should return a reference to the existing plan.

You are free to argue that, but:

  1. The statement name is not a hash. The statement's name is created in the driver, and is definitely not a hash of the query.
  2. If a client asks postgres to discard a previous statement (as it's an error to prepare a statement with the same name twice) and then asks it to create it again, that's what postgres is going to do. It's not trying to be any smarter than that; it's up to the client to be.

And in this case, database/sql actually is smarter. If you create a prepared statement on the *sql.DB and then use *Tx.Stmt() and the statement has already been prepared on that connection, you get that statement. But doing a *Tx.Prepare() for every query prevents any of that from happening.

Surely it would only be necessary to explicitly re-use the Go stmt if the name of the prepared statement isn't actually related to the query being prepared (a name based on a serial or random id)?

Right. And that's exactly how it works with the scheme you've implemented; see above.

@buro9
Copy link

buro9 commented May 1, 2014

Awesome, thanks for that too... we'll get back with any updates on trapped race's in the log tomorrow.

@johto
Copy link
Contributor

johto commented Jun 4, 2014

I'm going to assume this fixed the problem.

I still haven't looked into what, if anything, could be done on the database/sql side to possibly improve matters, but I'll try and look into that during the 1.4 development cycle.

@johto johto closed this as completed Jun 4, 2014
@buro9
Copy link

buro9 commented Jun 4, 2014

I hope leaving a comment doesn't re-open this.

Yes, confirmed it's now working fine.

The key was to not use Prepare() on a transaction.

I'm not sure I'd call it fixed though, as this is just a workaround as people are likely to assume all of the database/sql interfaces are safe to use without unexpected side effect. It may be worth just documenting this in the project readme or having some examples that avoid tx.Prepare() that people are likely to find and copy.

@mattrco
Copy link
Author

mattrco commented Jun 4, 2014

So to conclude, the following may cause a race condition (error handling omitted):

tx, err := db.Begin()
stmt, err := tx.Prepare(`QUERY`)

_, err = stmt.Exec(arg)

err = tx.Commit()
// Connection has now been returned to pool with putConn.
stmt.Close()

...because the connection is released by tx.Commit, but then used by stmt.Close (and in the meantime, anything else that acquires it). The fix is to ensure stmt.Close is called before Commit (or, indeed, Rollback).

Looking at these two lines in database/sql, it becomes clear:

http://golang.org/src/pkg/database/sql/sql.go#L1436
http://golang.org/src/pkg/database/sql/sql.go#L1046

@johto thanks for the help on this, much appreciated.

@johto
Copy link
Contributor

johto commented Sep 28, 2014

For anyone interested, this shouldn't be a problem starting from Go 1.4 because of https://code.google.com/p/go/source/detail?r=50ce4ec65c4a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants