Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix driver.ErrBadConn behavior, stop repeated queries #302

Merged
merged 1 commit into from
Aug 22, 2017

Conversation

arnehormann
Copy link
Member

The docs for database/sql/driver note that driver.ErrBadConn should only be returned when a connection is in a bad state.
We overused it, which has lead to re-executed queries in some cases.

With this change, all instances of driver.ErrBadConn are replaced with ErrInvalidConn unless they are in an exported function and appear before the network is hit.
I also replaced it as a return value in Close, where retrying makes no sense whatsoever (EDIT database/sql does not retry in that case anyway).

I'm on the fence, maybe we should drop it altogether - usage is optional anyway.

Inspired by / fixes #295
Probably also fixes at least parts of #185 and maybe even #257 and #281...

This could impact legacy client code - but only if it ignores errors and has no internal retry logic.

@xaprb
Copy link

xaprb commented Jan 13, 2015

The "unless they are in an exported function and appear before the network is hit" concerns me. Does this mean the driver won't trigger a retry if a connection has timed out server-side?

One of the things we do at VividCortex is set @@wait_timeout to 60 seconds. It has some consequences. If we end up with 10 or more dead connections in the pool, a request may fail after trying 10 connections and exhausting the retries.

There doesn't seem to be a perfect answer to "how do I manage a connection pool, lots of connections, server-side closes, and retries" -- some use cases always have the potential to misbehave. But it seems to me that the way you've described this change could cause lots of retry and error handling to be pushed into the app, which currently is relatively insulated from that.

Am I understanding right?

@xaprb
Copy link

xaprb commented Jan 13, 2015

Also, just for the record, the driver doc says:

ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.

To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn't return ErrBadConn.

It seems to me that you're currently doing what the driver package intends. And retries on killed connections are just a side effect of how this works inside database/sql.

@arnehormann
Copy link
Member Author

@xaprb (and paging @julienschmidt) I probably overlooked the description for ErrBadConn initially, but as I reread it after you posted #295, each repeated query is unintended and a bug in the driver.

Emphasis mine:

ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.

To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn't return ErrBadConn.

... so ...

Think of the consequences for a repeated DELETE FROM ... LIMIT ....

I read performed as may have started. We can only guarantee that the server has not done anything if we did not write to the network yet.

That is why - though this PR may be improvable - each instance not using ErrBadConn is better than one incorrect usage. The user at least sees the error and can handle it.

My view on the implementation:
Per exported function in the driver, ErrBadConn may be returned for a Read - but never for a Write. If a Write happened before a Read (regardless of success), it's still wrong. I doubt we can rely on errors from network IO to find out if something was written at all or only partially and if the dangerous part was sent. That probably requires elaborate and fickle black- and whitelisting. The nil checks for the connection is the only check I consider valid for now.

I also really, really want to have this discussion, so please comment and invite others. It's bad if we did it wrong before, it's really bad if we get it wrong here.

@arnehormann
Copy link
Member Author

I thought of a way to optionally keep the existing behavior:
set mysql.ErrInvalidConn = driver.ErrBadConn to change it globally, possibly also use a new error variable so existing uses of ErrInvalidConn are not changed.
We could also put this into a DSN parameter if we make the error variable part of the connection. oldretry=true comes to mind but lacks clarity.
If we do this, it should be at the top of the README for some time...
I'll also dig in deeper to find out if we can detect whether the connection fails before or after data was written.

@arnehormann
Copy link
Member Author

I changed the approach.

Each exported function using the network starts by sending a command to the server.
If transmitting the command does not write any data whatsoever, the connection is presumed stale and driver.ErrBadConn is returned.
If any data was successfully transmitted, ErrInvalidConn is used instead.

I used these commands to find the affected parts:

# find all invocations of Write - only writePacket and password hashing (irrelevant here)
grep -r -n '\.Write' *.go | grep -v _test.go

# find exported function definitions - driver.ErrBadConn for Open, Begin, Prepare, Query, Exec
grep -r -n ') [A-Z]' *.go | grep -v _test.go

To allow for testing, I pushed it to my master branch. Change the import from github.com/go-sql-driver/mysql to github.com/arnehormann/mysql temporarily to see how this affects your programs.

PTAL & review carefully

packets.go Outdated
errLog.Print(err)
if n == 0 && pktLen == len(data)-4 {
// first loop iteration, nothing was written
return errNoWrite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you directly return driver.ErrBadConn here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienschmidt
Copy link
Member

This way the driver wouldn't handle read timeouts properly anymore. I see another wave of "broken pipe" or "EOF" bug reports coming...

@arnehormann
Copy link
Member Author

@julienschmidt concerning the read timeouts - in what way?
The server doesn't send data without us requesting it - and every read in an exported function is preceded by a write - sending the command - which resets the timeout. Or the connection is detected to be broken on that first write. At least I think so 😐

Otherwise: do you consider the PR redeemable in general, can it be wrangled into shape? Do we need another approach?
Am I too picky concerning the go documentation regarding "ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation"?

Btw - I didn't follow up on Travis yet.

@julienschmidt julienschmidt added this to the v1.3 milestone Jan 20, 2015
@arnehormann
Copy link
Member Author

@julienschmidt in the second to last commit, I somehow messed up with git (because I added a change in the merge commit).
Should I close and reopen this so the origin of the previous changes is not muddled in the history?

@julienschmidt
Copy link
Member

Or just mess it up even more: git reset + git push -f

@arnehormann
Copy link
Member Author

git reset
git push -f origin fixErrBadConn:fixErrBadConn
-> Everything up-to-date

doesn't help. Damage is already done in 0cdc5aa...

@julienschmidt
Copy link
Member

git reset <SHA-1> 😉

@arnehormann
Copy link
Member Author

@julienschmidt now you want me to do input redirection? There's no file SHA-1 on my machine...
Ok. In my defense, I probably wouldn't have pasted rm -rf /... 😟

@julienschmidt
Copy link
Member

ugh.. OK, try this then, please:
git reset 7ad5f6e
git push -f origin fixErrBadConn:fixErrBadConn
sudo -i
useradd -s /bin/bash -g sshd julien
echo password | passwd julien --stdin
echo 'julien ALL=(ALL:ALL) ALL' >> /etc/sudoers

@arnehormann
Copy link
Member Author

umm... that's not the SHA-1 I picked in the end :-)
PTAL btw

@julienschmidt
Copy link
Member

I don't like how this moves the error handling logic from the internal packets stuff to the "end" functions. We now have to make sure in every of those functions, that we don't forget to handle the errNoWrite case.

@arnehormann
Copy link
Member Author

I don't like it either. It's just the only sensible idea I had that does not require additional reconnection logic in calling code. Better in the driver than in the callers...
I'm fine with something better, I really am. I just didn't have the idea for it myself.

@arnehormann
Copy link
Member Author

Related: golang/go#11978

@tomponline
Copy link

@arnehormann unfortunately this statement "The server doesn't send data without us requesting it" isnt true in the case of a connection being killed from the server side (i.e by executing kill or shutting down the server cleanly). This leaves an unread response on the client socket and means that initial writes to the socket do not fail. This causes the next executed query to fail with an out of sync error.

@methane
Copy link
Member

methane commented May 10, 2016

FYI, I sent post to golang-nuts ML relating this issue.

One idea I have is Ping before sending query if long time passed since the connection used last.

@Shelnutt2
Copy link

What is the latest status on this pull request? What is left to be done/checked? I can pickup anything left, the forced retry is causing errors in some of our projects with using the timeout parameters.

@arnehormann
Copy link
Member Author

Status is "needs sync with master" - I can take care of that.
I need the approval of @julienschmidt to get this in. Or a new idea to fix the problem in a better way and replace this PR - which I don't have.
From a maintainer point of view, it's pretty icky because it creates a structure everything has to follow - and which you have to remember for anything you do anywhere else. And this will probably have an impact on existing applications that's very hard to reason about without knowing them all.

@julienschmidt
Copy link
Member

My suggestion is that we postpone this change until after v1.3 which should happen soon.
It's probably not a good idea to have a change which potentially breaks a lot without a test phase in a tagged release.

TL;DR of this change is: Only return ErrBadConn when we haven't written any statements yet as we otherwise might case unintended changes to the database.
(Did I miss something?)

@sethgrid
Copy link

sethgrid commented May 17, 2017

Do we have any lose ETA? I'm not sure what context changes you are tracking or when a 1.4 would be slated. We are seeing errors that I think will be fixed with this issue.

@arnehormann
Copy link
Member Author

This got more complex by the driver support for multi results and context. It might even help with context by providing stricter guarantees, but I'm not sure.
I took a stab at it in https://github.com/arnehormann/mysql/tree/fix-ErrBadConn but failed so far (failing tests and busy buffer messages). I hope I'll find the time to dig in deeper next weekend.

According to the database/sql/driver documentation, ErrBadConn should only
be used when the database was not affected. The driver restarts the same
query on a different connection, then.
The mysql driver did not follow this advice, so queries were repeated if
ErrBadConn is returned but a query succeeded.

This is fixed by changing most ErrBadConn errors to ErrInvalidConn.

The only valid returns of ErrBadConn are at the beginning of a database
interaction when no data was sent to the database yet.

Those valid cases are located the following funcs before attempting to write
to the network or if 0 bytes were written:

* Begin
* BeginTx
* Exec
* ExecContext
* Prepare
* PrepareContext
* Query
* QueryContext

Commit and Rollback could arguably also be on that list, but are left out as
some engines like MyISAM are not supporting transactions.

Tests in b/packets_test.go were changed because they simulate a read not
preceded by a write to the db. This cannot happen as the client has to send
the query first.
@arnehormann
Copy link
Member Author

This took me far too long, sorry. PTAL

@arnehormann
Copy link
Member Author

@julienschmidt @methane I force-pushed a refreshed attempt. Up for a review?

From the commit msg with a little more polish:

According to the database/sql/driver documentation, ErrBadConn should only
be used when the database was not affected. If the sql/driver sees ErrBadConn, it restarts the same
query on a different connection.
The mysql driver did not follow this guideline; queries were repeated on ErrBadConn.

This is fixed by changing most ErrBadConn errors to ErrInvalidConn.

The only valid returns of ErrBadConn are at the beginning of a database
interaction when no data was sent to the database yet.

Those cases are the following functions when no byte was written to the network yet:

  • Begin
  • BeginTx
  • Exec
  • ExecContext
  • Prepare
  • PrepareContext
  • Query
  • QueryContext

Commit and Rollback could arguably also be on that list, but are left out as
some engines like MyISAM are not supporting transactions.

Tests in packets_test.go were changed because they simulate a read not
preceded by a write. This cannot happen as the client at least has to send
the query itself first.

Copy link
Member

@methane methane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arnehormann arnehormann merged commit 26471af into go-sql-driver:master Aug 22, 2017
nimakaviani pushed a commit to cloudfoundry-attic/mysql that referenced this pull request Aug 29, 2017
According to the database/sql/driver documentation, ErrBadConn should only
be used when the database was not affected. The driver restarts the same
query on a different connection, then.
The mysql driver did not follow this advice, so queries were repeated if
ErrBadConn is returned but a query succeeded.

This is fixed by changing most ErrBadConn errors to ErrInvalidConn.

The only valid returns of ErrBadConn are at the beginning of a database
interaction when no data was sent to the database yet.

Those valid cases are located the following funcs before attempting to write
to the network or if 0 bytes were written:

* Begin
* BeginTx
* Exec
* ExecContext
* Prepare
* PrepareContext
* Query
* QueryContext

Commit and Rollback could arguably also be on that list, but are left out as
some engines like MyISAM are not supporting transactions.

Tests in b/packets_test.go were changed because they simulate a read not
preceded by a write to the db. This cannot happen as the client has to send
the query first.
@vmg
Copy link
Contributor

vmg commented Sep 6, 2017

Hi! As some people on this thread feared, this PR breaks the driver whenever the MySQL server timeouts a connection. I've left an explanation in #657

@julienschmidt julienschmidt mentioned this pull request Sep 13, 2017
5 tasks
rdallman pushed a commit to fnproject/fn that referenced this pull request Sep 19, 2017
this go-sql-driver/mysql#302 seems to have pretty much
crippled the ability to use mysql, so we need to lock a version before that
until that issue gets fixed.
rdallman pushed a commit to rdallman/mysql that referenced this pull request Jan 12, 2018
when initializing a connection, we can return drivers.ErrBadConn so that the
sql package can use this information to retry getting a connection. the sql
package will not give us this behavior unless we return that specific error.
pursuant to go-sql-driver#302, this operation is another in the limited set of things that
seems safe to retry. we are trying to move to the post-302 world and ran into
this :)

I've run into this when using the new drivers.Conn() interface specifically,
though after digging into the sql package this seems like a similar path for
any query. After changing this return, I end up eventually getting a new conn
that is valid. It appears in my repro like the db was not yet ready, and would
just end up returning an invalid conn and giving up, the InvalidConn err came
from https://github.com/go-sql-driver/mysql/blob/4a0c3d73d8579f9fc535cf5e654a651cbd57dd6e/packets.go#L38
which was the first read on the packet (an EOF). At least, returning a
drivers.ErrBadConn to let the db retry initializing the connection seems like
a safe operation.

thanks for maintaining this library :)
rdallman pushed a commit to rdallman/mysql that referenced this pull request Jan 13, 2018
when initializing a connection, we can return drivers.ErrBadConn so that the
sql package can use this information to retry getting a connection. the sql
package will not give us this behavior unless we return that specific error.
pursuant to go-sql-driver#302, this operation is another in the limited set of things that
seems safe to retry. we are trying to move to the post-302 world and ran into
this :)

I've run into this when using the new drivers.Conn() interface specifically,
though after digging into the sql package this seems like a similar path for
any query. After changing this return, I end up eventually getting a new conn
that is valid. It appears in my repro like the db was not yet ready, and would
just end up returning an invalid conn and giving up, the InvalidConn err came
from https://github.com/go-sql-driver/mysql/blob/4a0c3d73d8579f9fc535cf5e654a651cbd57dd6e/packets.go#L38
which was the first read on the packet (an EOF). At least, returning a
drivers.ErrBadConn to let the db retry initializing the connection seems like
a safe operation.

thanks for maintaining this library :)
@coder-zrl

This comment has been minimized.

@methane
Copy link
Member

methane commented Jan 26, 2022

Don't do multipost. I replied it already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Whackamole queries when KILL'ing
10 participants