Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: retry logic is unsafe #11978

Closed
jackc opened this issue Aug 1, 2015 · 6 comments
Closed

database/sql: retry logic is unsafe #11978

jackc opened this issue Aug 1, 2015 · 6 comments
Milestone

Comments

@jackc
Copy link

@jackc jackc commented Aug 1, 2015

What version of Go are you using (go version)?

jack@edi:~$ go version
go version go1.5beta3 linux/amd64

What operating system and processor architecture are you using?

jack@edi:~$ uname -a
Linux edi 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

Execute a query with database/sql over an unreliable connection. For example:

update t set n=n+1;

What did you expect to see?

An error message that means the query failed and n is unchanged, or no error meaning the query executed exactly one time and n has been incremented by 1.

What did you see instead?

The query may have actually executed multiple times and thereby n may have been incremented multiple times.

Reproducing the error

It was necessary to build some infrastructure to reproduce this error. I wrote cavein, a TCP tunnel server that purposely breaks connections to aid in reproduction. That, combined with go_database_sql_retry_bug, a test application that executes many updates over the unreliable connections provided by cavein can reliably reproduce queries being executed extra times.

Suggested solution

Automatic retry should be removed from Exec and Query. It is impossible at the library layer to know if it is safe to retry a query.

@ianlancetaylor ianlancetaylor added this to the Go1.6 milestone Aug 3, 2015
@mikioh mikioh changed the title database/sql retry logic is unsafe database/sql: retry logic is unsafe Aug 3, 2015
@gopherbot
Copy link

@gopherbot gopherbot commented Aug 7, 2015

CL https://golang.org/cl/13350 mentions this issue.

@johto
Copy link
Contributor

@johto johto commented Aug 7, 2015

Automatic retry should be removed from Exec and Query. It is impossible at the library layer to know if it is safe to retry a query.

The driver documentation sayeth the following:

ErrBadConn should be returned by a driver to signal to the sql package
that a driver.Conn is in a bad state (such as the server having earlier
closed the connection) and the sql package should retry on a new
connection.

To prevent duplicate operations, ErrBadConn should NOT be returned
if there's a possibility that the database server might have performed
the operation. Even if the server sends back an error, you shouldn't
return ErrBadConn.

If this test case results in duplicate operations, it's the driver that's at fault, not database/sql's. Not a bug.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Aug 7, 2015

Yes, what @johto said.

@bradfitz bradfitz closed this Aug 7, 2015
@jackc
Copy link
Author

@jackc jackc commented Aug 7, 2015

Under what conditions could an Exec or Query correctly return ErrBadConn?

Once the query command is written to the socket, the driver has to consider that the query could have been executed. The only way it can safely assert that the query did not execute is if it receives a query failure response from the database server -- but if it receives a query failure message then the connection would still be usable -- so ErrBadConn would not be appropriate.

So it really seems the only situation where ErrBadConn is the correct response is on an error prior to socket write.

I suppose the driver can be blamed, but it feels like retry on Query and Exec is a foot-gun waiting to happen -- for minimal convenience improvement. This is not a theoretical assertion either, both drivers I tested, pq and my own pgx, trigger this undesirable behaviour.

@johto
Copy link
Contributor

@johto johto commented Aug 7, 2015

So it really seems the only situation where ErrBadConn is the correct response is on an error prior to socket write.

Right; you need to return it to get a bad connection to be discarded from the pool. So the order of events for e.g. Exec should be:

  1. See if the connection is already marked bad. If so, return ErrBadConn
  2. Write the query, its parameters etc. into the socket
  3. If a network error happens before you're completely "done" with processing the resulting row set or database-level error, mark the connection bad, and return the network error directly back to database/sql
  4. If the query was processed completely (in postgres' case, that would mean you saw ReadyForQuery), return either the result or the error from the database without marking the connection bad

Note that even without this issue, you already have to have the "bad" marker on a connection implemented because of #11264, so implementing it like this should not result in any unreasonable extra work.

I suppose the driver can be blamed, but it feels like retry on Query and Exec is a foot-gun waiting to happen -- for minimal convenience improvement.

Fine, maybe, but how would you discard bad connections from the pool in whatever it is you're suggesting?

This is not a theoretical assertion either, both drivers I tested, pq and my own pgx, trigger this undesirable behaviour.

Last time this came up on pq's issue tracker, I argued that pq's current behavior is broken, but lost that argument.

@jackc
Copy link
Author

@jackc jackc commented Aug 7, 2015

Right; you need to return it to get a bad connection to be discarded from the pool.

Okay, I wish it didn't work this way, but I understand the now.

Fine, maybe, but how would you discard bad connections from the pool in whatever it is you're suggesting?

The bad connection would have been caused by a previous query. The connection is marked as bad then. I would suggest checking for connection health on releasing the connection back to the pool instead of on next attempted use. This is what pgx's native connection pool does (when not using database/sql). Though I suppose that would require changing the database/sql driver interface, so that is not possible for Go 1.x.

Last time this came up on pq's issue tracker, I argued that pq's current behavior is broken, but lost that argument.

I've re-examined my test app and pgx, and determined I was incorrect in my assertion that it was vulnerable to this issue. However, pq definitely is. Perhaps this test app could be useful in persuading pq's maintainers.

@golang golang locked and limited conversation to collaborators Aug 8, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.