-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple "pq: unexpected describe rows response" errors #254
Comments
The first thing to check would be that you're closing the previous Rows object before trying to run another query in the same transaction. |
It sounds like a connection that was being used for query A is somehow getting sucked to being used for query B. I've had similar problems with mysql and db handles surviving across fork(), but I'm not sure how that would end up happening with Go. I wonder if the race detector would show anything? |
@johto I've doubled-checked, we actually don't have any transactions that execute more than one query. And separately, rows are closed everywhere, so I don't think we are leaking anything. |
@dgryski I can certainly try and see if it turns anything up. |
Which exact commit of pq are you using? |
Also, is |
I should've been clearer about the mutex waits above. I've edited the issue to reflect that the same errors weren't observed so we shouldn't assume it is the same issue. |
I don't suppose the problem is reproducible enough that you could capture the network traffic of one such session? I can't see anything obvious in lib/pq at this moment. I'll dive a bit deeper later tonight. |
I've added race instrumentation so we'll see if it turns anything up. Theoretically yes, we could tcpdump the traffic. It'll take me a little time to get that set up, but I can probably get it done tomorrow. At worst it can take a few days for this issue to appear, though. Thanks for the suggestions so far. |
Hmm.. I wonder if this is interesting or not. Message type 90 is 'Z', i.e. ReadyForQuery, which the frontend would never send. This might suggest that there is, indeed, concurrent access to the pq.conn happening as Damian suggested. |
We've deployed a build with "-race" enabled. And this hasn't triggered the usual symptoms (which ultimately result in what mattcottingham describes), but figured you'd want to know about them anyway.
Both of the calling queries are pretty much identical, they are single statement transactions that prepare a statement, and use QueryRow() (in auth) or Exec() (in updateLastActive) to execute it. There's no nested transaction statements or rows.Next() issues. This is using the latest version of lib/pq at this commit: c808a1b The line in question is this one: https://github.com/lib/pq/blob/master/conn.go#L637 Which is the attempt to read the type and 4 byte length value from the start of the message buffer. We'll continue monitoring for the actual issue. |
Uhh.. Can I get a sanity check here? What business does |
Though strictly looking at this, http://golang.org/pkg/database/sql/driver/#Rows doesn't give the same guarantees as driver.Stmt. I wonder if we're relying on undefined behaviour here, or if we should shift the blame on database/sql. |
Looks like at least the mysql driver is doing the same thing, though. Will continue looking at this.. |
I've found another unique race in the log file:
Same two statements, but now involving one of the statements closing and tripping over the other. This brings in https://github.com/lib/pq/blob/master/conn.go#L833 |
On the sanity check question, I don't know sql.go well enough to clarify anything there. Will read up on it tomorrow. |
Hmm, no, that seems like a red herring in this case. The fact that sql.(_Stmt).Close() calls (_driverStmt).Close() directly means that the statement was prepared via sql.(_Tx).Prepare(), and not sql.(_DB).Prepare(). Does the application use per-transaction prepared statements? |
Ah-hah! I can reproduce it here by closing a Rows object in a transaction when an Stmt.Close() on a statement created in the same transaction is called concurrently. Will have to do more digging to see where to put the blame.. |
Yes. An example of our code (minus extraneous logging and OTT error handling) for the auth function:
|
Yes, just to be clear, our application uses both (_DB).Prepare and (_Tx).Prepare (for separate things). |
Right. Actually, in the case where I can reproduce it would probably be the application's fault, since sql.(*Tx) is not declared to be safe to use concurrently from different goroutines. Are you doing that, or is there only ever one goroutine working on a single transaction? |
Yes, we checked that there is no concurrent access to any transactions after reading #81. Could you post your code that reproduces the race just so we can compare? |
In the cases highlighted by the race check, only a single statement, in a single function, in a single goroutine. My code snippet above is pretty much it, and we only do it in a transaction as the function being SELECTed from may modify data. We use transactions for all queries that modify data even if it is only a single statement. There is potentially one place elsewhere in our code where we pass an entirely different transaction around where that might be true. I will triple check and confirm shortly. |
Checked and confirmed... we never use a transaction outside of a single goroutine. |
Here's what I did: https://gist.github.com/johto/243016fa6fda3db541b4 But that would require unsafe concurrent access to the *Tx, so there must be something else going on here.. |
I would trust the race detector in this case, since it never gives false positives. If it saw a race, there was definitely a race. The race may be harmless, but that doesn't change the fact that there was unprotected access to shared mutable state. Those two warnings listed above need to be looked at. |
@buro9: Can I get a similar overview of what UpdateLastActive() does as well? |
@dgryski: Oh definitely. The race found by the race detector would explain the symptoms of the original problem, so I think we're on to something. |
Sure... it's incredibly simple:
That is called fractionally before GetPermission() is called. GetPermission may be called within another goroutine. db is global to the application and all goroutines, but transactions are only ever created and used in a single goroutine. |
Ooh. Is it possible that you're calling tx.Rollback() before the deferred stmt.Close() is executed? That would explain this as well. |
Our logs show that none of the rollbacks have ever happened. Every "// err handle" logs a fair bit. But if we're supposed to explicitly stmt.Close() before rollback I'll certainly make sure we do. |
Hmm.. OK.
Yeah, you have to do that. Somewhat unreliable repro here: https://gist.github.com/johto/f8ce866e44944eeb8ac5 It would be great if you could fix that first and then if the problem still reproduces (or the race detector finds something) I'll have another look. |
Oh. I just realized that calling Commit() first would do this as well, so the fact that there aren't any log messages from the error path being taken doesn't disprove the theory. |
And defer stmt.Close() doesn't deal with this? |
Plus... we're using QueryRow() and Exec(), both of which are supposed to Close() the statement. |
The problem is as follows:
I'm not sure what you mean here. Could you expand that a bit? |
In our example:
We call Or are you saying that Exec() may error, leaving the statement open, and that the we'll exit our function with a deferred tx.Rollback happening, and the deferred stmt.Close() will silently error too late? |
Hmm.. so purely an application level issue then. Any way to detect this from the library? Is there an https://en.wikipedia.org/wiki/ABA_problem .with inUse? |
No, Exec doesn't close the statement. The entire point about doing a separate Prepare() step is that the statement can be reused. If you don't want to reuse the statement, you should just call Tx.Exec() directly. |
I think it would be possible from database/sql, but not from the driver. I could look into that a bit later today.
No, *Stmt.Close() never deals with inUse; it assumes that the transaction/connection pair in it hasn't gone anywhere, which in this case isn't true. Though even if we did notice this case, I'm not sure where to report the error since nobody's going to be checking the result of a deferred *Stmt.Close().. |
I'll put in stmt.Close() with error checking and logging before every tx.Commit() and tx.Rollback(). I'll do that now and will let you know tomorrow if this has prevented more instances of the race condition in the log file. |
For example, this makes UpdateLastActive() now look like this:
|
@buro9: Looks correct to me. The code gets a lot nicer by avoiding the explicit Prepare(), but it would be great if you could run this for a few days to make sure the race goes away. |
The re-use is internal to Postgres, it's the cached query plan. If the name of the prepared statement server-side is a hash of the query then the same plan will be used for each query during the database session (connection) as asking to prepare something that is already in the plan cache should return a reference to the existing plan. Surely it would only be necessary to explicitly re-use the Go stmt if the name of the prepared statement isn't actually related to the query being prepared (a name based on a serial or random id)? This is my lack of knowledge about the inner workings of Postgres shining through. It may well be that Postgres is more primitive and overwrites an existing plan and thus this saves no work compared to a direct Exec() call. |
You are free to argue that, but:
And in this case, database/sql actually is smarter. If you create a prepared statement on the *sql.DB and then use *Tx.Stmt() and the statement has already been prepared on that connection, you get that statement. But doing a *Tx.Prepare() for every query prevents any of that from happening.
Right. And that's exactly how it works with the scheme you've implemented; see above. |
Awesome, thanks for that too... we'll get back with any updates on trapped race's in the log tomorrow. |
I'm going to assume this fixed the problem. I still haven't looked into what, if anything, could be done on the database/sql side to possibly improve matters, but I'll try and look into that during the 1.4 development cycle. |
I hope leaving a comment doesn't re-open this. Yes, confirmed it's now working fine. The key was to not use Prepare() on a transaction. I'm not sure I'd call it fixed though, as this is just a workaround as people are likely to assume all of the database/sql interfaces are safe to use without unexpected side effect. It may be worth just documenting this in the project readme or having some examples that avoid tx.Prepare() that people are likely to find and copy. |
So to conclude, the following may cause a race condition (error handling omitted): tx, err := db.Begin()
stmt, err := tx.Prepare(`QUERY`)
_, err = stmt.Exec(arg)
err = tx.Commit()
// Connection has now been returned to pool with putConn.
stmt.Close() ...because the connection is released by Looking at these two lines in database/sql, it becomes clear: http://golang.org/src/pkg/database/sql/sql.go#L1436 @johto thanks for the help on this, much appreciated. |
For anyone interested, this shouldn't be a problem starting from Go 1.4 because of https://code.google.com/p/go/source/detail?r=50ce4ec65c4a. |
We've been experiencing errors with lib/pq that eventually results in a state where no database connections in the pool are free. Max connections as set by
SetMaxOpenConns
is not reached.The errors normally start with
db.Prepare
:We also see the unexpected response 'C'. These develop into multiple occurrences of
and similar errors on statements that have previously been prepared with no errors.
To reiterate, we don't query any prepared statements that returned errors from
Prepare
. These errors are returned from queries on successfully prepared statements.Then, we see occurrences of failures of
Begin
:The only error in postgres around this time is:
Which happened around 20s after the initial errors were seen in our application log.
pprof
indicates that all goroutines querying the database were netpolling:In a separate instance of unresponsiveness, all the goroutines were waiting on mutexes, e.g.:
The logged errors weren't present in that case, so these could be two totally different problems.
We're running Ubuntu 12.04 and pg 9.2 with go 1.2.1, and we're not assuming a trouble-free network by any means.
I appreciate this is somewhat vague and we don't have a reproducible test case (yet), but any pointers for further investigation would be appreciated. Let me know if further background info would be helpful.
The text was updated successfully, but these errors were encountered: