-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix driver.ErrBadConn behavior, stop repeated queries #302
Conversation
The "unless they are in an exported function and appear before the network is hit" concerns me. Does this mean the driver won't trigger a retry if a connection has timed out server-side? One of the things we do at VividCortex is set There doesn't seem to be a perfect answer to "how do I manage a connection pool, lots of connections, server-side closes, and retries" -- some use cases always have the potential to misbehave. But it seems to me that the way you've described this change could cause lots of retry and error handling to be pushed into the app, which currently is relatively insulated from that. Am I understanding right? |
Also, just for the record, the driver doc says:
It seems to me that you're currently doing what the driver package intends. And retries on killed connections are just a side effect of how this works inside database/sql. |
@xaprb (and paging @julienschmidt) I probably overlooked the description for ErrBadConn initially, but as I reread it after you posted #295, each repeated query is unintended and a bug in the driver. Emphasis mine: ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection. To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn't return ErrBadConn. ... so ... Think of the consequences for a repeated I read performed as may have started. We can only guarantee that the server has not done anything if we did not write to the network yet. That is why - though this PR may be improvable - each instance not using My view on the implementation: I also really, really want to have this discussion, so please comment and invite others. It's bad if we did it wrong before, it's really bad if we get it wrong here. |
I thought of a way to optionally keep the existing behavior: |
I changed the approach. Each exported function using the network starts by sending a command to the server. I used these commands to find the affected parts: # find all invocations of Write - only writePacket and password hashing (irrelevant here)
grep -r -n '\.Write' *.go | grep -v _test.go
# find exported function definitions - driver.ErrBadConn for Open, Begin, Prepare, Query, Exec
grep -r -n ') [A-Z]' *.go | grep -v _test.go To allow for testing, I pushed it to my master branch. Change the import from PTAL & review carefully |
packets.go
Outdated
errLog.Print(err) | ||
if n == 0 && pktLen == len(data)-4 { | ||
// first loop iteration, nothing was written | ||
return errNoWrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you directly return driver.ErrBadConn
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way the driver wouldn't handle read timeouts properly anymore. I see another wave of "broken pipe" or "EOF" bug reports coming... |
@julienschmidt concerning the read timeouts - in what way? Otherwise: do you consider the PR redeemable in general, can it be wrangled into shape? Do we need another approach? Btw - I didn't follow up on Travis yet. |
@julienschmidt in the second to last commit, I somehow messed up with git (because I added a change in the merge commit). |
Or just mess it up even more: |
git reset
git push -f origin fixErrBadConn:fixErrBadConn
-> Everything up-to-date doesn't help. Damage is already done in 0cdc5aa... |
|
14ffb5c
to
74b890f
Compare
@julienschmidt now you want me to do input redirection? There's no file SHA-1 on my machine... |
ugh.. OK, try this then, please: |
umm... that's not the SHA-1 I picked in the end :-) |
I don't like how this moves the error handling logic from the internal packets stuff to the "end" functions. We now have to make sure in every of those functions, that we don't forget to handle the |
I don't like it either. It's just the only sensible idea I had that does not require additional reconnection logic in calling code. Better in the driver than in the callers... |
Related: golang/go#11978 |
@arnehormann unfortunately this statement "The server doesn't send data without us requesting it" isnt true in the case of a connection being killed from the server side (i.e by executing |
FYI, I sent post to golang-nuts ML relating this issue. One idea I have is Ping before sending query if long time passed since the connection used last. |
What is the latest status on this pull request? What is left to be done/checked? I can pickup anything left, the forced retry is causing errors in some of our projects with using the timeout parameters. |
Status is "needs sync with master" - I can take care of that. |
My suggestion is that we postpone this change until after v1.3 which should happen soon. TL;DR of this change is: Only return |
Do we have any lose ETA? I'm not sure what context changes you are tracking or when a 1.4 would be slated. We are seeing errors that I think will be fixed with this issue. |
This got more complex by the driver support for multi results and context. It might even help with context by providing stricter guarantees, but I'm not sure. |
According to the database/sql/driver documentation, ErrBadConn should only be used when the database was not affected. The driver restarts the same query on a different connection, then. The mysql driver did not follow this advice, so queries were repeated if ErrBadConn is returned but a query succeeded. This is fixed by changing most ErrBadConn errors to ErrInvalidConn. The only valid returns of ErrBadConn are at the beginning of a database interaction when no data was sent to the database yet. Those valid cases are located the following funcs before attempting to write to the network or if 0 bytes were written: * Begin * BeginTx * Exec * ExecContext * Prepare * PrepareContext * Query * QueryContext Commit and Rollback could arguably also be on that list, but are left out as some engines like MyISAM are not supporting transactions. Tests in b/packets_test.go were changed because they simulate a read not preceded by a write to the db. This cannot happen as the client has to send the query first.
74b890f
to
2b7a3e9
Compare
This took me far too long, sorry. PTAL |
@julienschmidt @methane I force-pushed a refreshed attempt. Up for a review? From the commit msg with a little more polish: According to the database/sql/driver documentation, ErrBadConn should only This is fixed by changing most ErrBadConn errors to ErrInvalidConn. The only valid returns of ErrBadConn are at the beginning of a database Those cases are the following functions when no byte was written to the network yet:
Commit and Rollback could arguably also be on that list, but are left out as Tests in packets_test.go were changed because they simulate a read not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
According to the database/sql/driver documentation, ErrBadConn should only be used when the database was not affected. The driver restarts the same query on a different connection, then. The mysql driver did not follow this advice, so queries were repeated if ErrBadConn is returned but a query succeeded. This is fixed by changing most ErrBadConn errors to ErrInvalidConn. The only valid returns of ErrBadConn are at the beginning of a database interaction when no data was sent to the database yet. Those valid cases are located the following funcs before attempting to write to the network or if 0 bytes were written: * Begin * BeginTx * Exec * ExecContext * Prepare * PrepareContext * Query * QueryContext Commit and Rollback could arguably also be on that list, but are left out as some engines like MyISAM are not supporting transactions. Tests in b/packets_test.go were changed because they simulate a read not preceded by a write to the db. This cannot happen as the client has to send the query first.
Hi! As some people on this thread feared, this PR breaks the driver whenever the MySQL server timeouts a connection. I've left an explanation in #657 |
this go-sql-driver/mysql#302 seems to have pretty much crippled the ability to use mysql, so we need to lock a version before that until that issue gets fixed.
when initializing a connection, we can return drivers.ErrBadConn so that the sql package can use this information to retry getting a connection. the sql package will not give us this behavior unless we return that specific error. pursuant to go-sql-driver#302, this operation is another in the limited set of things that seems safe to retry. we are trying to move to the post-302 world and ran into this :) I've run into this when using the new drivers.Conn() interface specifically, though after digging into the sql package this seems like a similar path for any query. After changing this return, I end up eventually getting a new conn that is valid. It appears in my repro like the db was not yet ready, and would just end up returning an invalid conn and giving up, the InvalidConn err came from https://github.com/go-sql-driver/mysql/blob/4a0c3d73d8579f9fc535cf5e654a651cbd57dd6e/packets.go#L38 which was the first read on the packet (an EOF). At least, returning a drivers.ErrBadConn to let the db retry initializing the connection seems like a safe operation. thanks for maintaining this library :)
when initializing a connection, we can return drivers.ErrBadConn so that the sql package can use this information to retry getting a connection. the sql package will not give us this behavior unless we return that specific error. pursuant to go-sql-driver#302, this operation is another in the limited set of things that seems safe to retry. we are trying to move to the post-302 world and ran into this :) I've run into this when using the new drivers.Conn() interface specifically, though after digging into the sql package this seems like a similar path for any query. After changing this return, I end up eventually getting a new conn that is valid. It appears in my repro like the db was not yet ready, and would just end up returning an invalid conn and giving up, the InvalidConn err came from https://github.com/go-sql-driver/mysql/blob/4a0c3d73d8579f9fc535cf5e654a651cbd57dd6e/packets.go#L38 which was the first read on the packet (an EOF). At least, returning a drivers.ErrBadConn to let the db retry initializing the connection seems like a safe operation. thanks for maintaining this library :)
This comment has been minimized.
This comment has been minimized.
Don't do multipost. I replied it already. |
The docs for database/sql/driver note that
driver.ErrBadConn
should only be returned when a connection is in a bad state.We overused it, which has lead to re-executed queries in some cases.
With this change, all instances of
driver.ErrBadConn
are replaced withErrInvalidConn
unless they are in an exported function and appear before the network is hit.I also replaced it as a return value in
Close
, where retrying makes no sense whatsoever (EDIT database/sql does not retry in that case anyway).I'm on the fence, maybe we should drop it altogether - usage is optional anyway.
Inspired by / fixes #295
Probably also fixes at least parts of #185 and maybe even #257 and #281...
This could impact legacy client code - but only if it ignores errors and has no internal retry logic.