fix driver.ErrBadConn behavior, stop repeated queries #302

arnehormann · 2015-01-13T17:24:55Z

The docs for database/sql/driver note that driver.ErrBadConn should only be returned when a connection is in a bad state.
We overused it, which has lead to re-executed queries in some cases.

With this change, all instances of driver.ErrBadConn are replaced with ErrInvalidConn unless they are in an exported function and appear before the network is hit.
I also replaced it as a return value in Close, where retrying makes no sense whatsoever (EDIT database/sql does not retry in that case anyway).

I'm on the fence, maybe we should drop it altogether - usage is optional anyway.

Inspired by / fixes #295
Probably also fixes at least parts of #185 and maybe even #257 and #281...

This could impact legacy client code - but only if it ignores errors and has no internal retry logic.

xaprb · 2015-01-13T17:33:03Z

The "unless they are in an exported function and appear before the network is hit" concerns me. Does this mean the driver won't trigger a retry if a connection has timed out server-side?

One of the things we do at VividCortex is set @@wait_timeout to 60 seconds. It has some consequences. If we end up with 10 or more dead connections in the pool, a request may fail after trying 10 connections and exhausting the retries.

There doesn't seem to be a perfect answer to "how do I manage a connection pool, lots of connections, server-side closes, and retries" -- some use cases always have the potential to misbehave. But it seems to me that the way you've described this change could cause lots of retry and error handling to be pushed into the app, which currently is relatively insulated from that.

Am I understanding right?

xaprb · 2015-01-13T17:35:07Z

Also, just for the record, the driver doc says:

ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.

To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn't return ErrBadConn.

It seems to me that you're currently doing what the driver package intends. And retries on killed connections are just a side effect of how this works inside database/sql.

arnehormann · 2015-01-13T22:36:00Z

@xaprb (and paging @julienschmidt) I probably overlooked the description for ErrBadConn initially, but as I reread it after you posted #295, each repeated query is unintended and a bug in the driver.

Emphasis mine:

ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.

To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn't return ErrBadConn.

... so ...

Think of the consequences for a repeated DELETE FROM ... LIMIT ....

I read performed as may have started. We can only guarantee that the server has not done anything if we did not write to the network yet.

That is why - though this PR may be improvable - each instance not using ErrBadConn is better than one incorrect usage. The user at least sees the error and can handle it.

My view on the implementation:
Per exported function in the driver, ErrBadConn may be returned for a Read - but never for a Write. If a Write happened before a Read (regardless of success), it's still wrong. I doubt we can rely on errors from network IO to find out if something was written at all or only partially and if the dangerous part was sent. That probably requires elaborate and fickle black- and whitelisting. The nil checks for the connection is the only check I consider valid for now.

I also really, really want to have this discussion, so please comment and invite others. It's bad if we did it wrong before, it's really bad if we get it wrong here.

arnehormann · 2015-01-14T06:29:55Z

I thought of a way to optionally keep the existing behavior:
set mysql.ErrInvalidConn = driver.ErrBadConn to change it globally, possibly also use a new error variable so existing uses of ErrInvalidConn are not changed.
We could also put this into a DSN parameter if we make the error variable part of the connection. oldretry=true comes to mind but lacks clarity.
If we do this, it should be at the top of the README for some time...
I'll also dig in deeper to find out if we can detect whether the connection fails before or after data was written.

arnehormann · 2015-01-19T11:32:18Z

I changed the approach.

Each exported function using the network starts by sending a command to the server.
If transmitting the command does not write any data whatsoever, the connection is presumed stale and driver.ErrBadConn is returned.
If any data was successfully transmitted, ErrInvalidConn is used instead.

I used these commands to find the affected parts:

# find all invocations of Write - only writePacket and password hashing (irrelevant here)
grep -r -n '\.Write' *.go | grep -v _test.go

# find exported function definitions - driver.ErrBadConn for Open, Begin, Prepare, Query, Exec
grep -r -n ') [A-Z]' *.go | grep -v _test.go

To allow for testing, I pushed it to my master branch. Change the import from github.com/go-sql-driver/mysql to github.com/arnehormann/mysql temporarily to see how this affects your programs.

PTAL & review carefully

julienschmidt · 2015-01-19T17:48:08Z

packets.go

+		errLog.Print(err)
+		if n == 0 && pktLen == len(data)-4 {
+			// first loop iteration, nothing was written
+			return errNoWrite


Why don't you directly return driver.ErrBadConn here?

https://github.com/go-sql-driver/mysql/pull/302/files#diff-2357b8494bbd2f27c09e61fc8ef5f092R738 ... nvm

julienschmidt · 2015-01-19T18:17:15Z

This way the driver wouldn't handle read timeouts properly anymore. I see another wave of "broken pipe" or "EOF" bug reports coming...

arnehormann · 2015-01-19T21:02:05Z

@julienschmidt concerning the read timeouts - in what way?
The server doesn't send data without us requesting it - and every read in an exported function is preceded by a write - sending the command - which resets the timeout. Or the connection is detected to be broken on that first write. At least I think so 😐

Otherwise: do you consider the PR redeemable in general, can it be wrangled into shape? Do we need another approach?
Am I too picky concerning the go documentation regarding "ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation"?

Btw - I didn't follow up on Travis yet.

arnehormann · 2015-02-17T12:47:28Z

@julienschmidt in the second to last commit, I somehow messed up with git (because I added a change in the merge commit).
Should I close and reopen this so the origin of the previous changes is not muddled in the history?

julienschmidt · 2015-02-17T12:50:58Z

Or just mess it up even more: git reset + git push -f

arnehormann · 2015-02-17T12:55:48Z

git reset
git push -f origin fixErrBadConn:fixErrBadConn
-> Everything up-to-date

doesn't help. Damage is already done in 0cdc5aa...

julienschmidt · 2015-02-17T12:57:54Z

git reset <SHA-1> 😉

arnehormann · 2015-02-17T13:02:24Z

@julienschmidt now you want me to do input redirection? There's no file SHA-1 on my machine...
Ok. In my defense, I probably wouldn't have pasted rm -rf /... 😟

julienschmidt · 2015-02-17T13:22:29Z

ugh.. OK, try this then, please:
git reset 7ad5f6e
git push -f origin fixErrBadConn:fixErrBadConn
sudo -i
useradd -s /bin/bash -g sshd julien
echo password | passwd julien --stdin
echo 'julien ALL=(ALL:ALL) ALL' >> /etc/sudoers

arnehormann · 2015-02-17T13:29:52Z

umm... that's not the SHA-1 I picked in the end :-)
PTAL btw

julienschmidt · 2015-02-28T06:40:25Z

I don't like how this moves the error handling logic from the internal packets stuff to the "end" functions. We now have to make sure in every of those functions, that we don't forget to handle the errNoWrite case.

arnehormann · 2015-02-28T08:27:26Z

I don't like it either. It's just the only sensible idea I had that does not require additional reconnection logic in calling code. Better in the driver than in the callers...
I'm fine with something better, I really am. I just didn't have the idea for it myself.

arnehormann · 2015-08-03T13:58:37Z

Related: golang/go#11978

tomponline · 2016-05-09T21:55:44Z

@arnehormann unfortunately this statement "The server doesn't send data without us requesting it" isnt true in the case of a connection being killed from the server side (i.e by executing kill or shutting down the server cleanly). This leaves an unread response on the client socket and means that initial writes to the socket do not fail. This causes the next executed query to fail with an out of sync error.

methane · 2016-05-10T02:59:27Z

FYI, I sent post to golang-nuts ML relating this issue.

One idea I have is Ping before sending query if long time passed since the connection used last.

Shelnutt2 · 2016-05-13T12:57:40Z

What is the latest status on this pull request? What is left to be done/checked? I can pickup anything left, the forced retry is causing errors in some of our projects with using the timeout parameters.

arnehormann · 2016-05-13T13:11:38Z

Status is "needs sync with master" - I can take care of that.
I need the approval of @julienschmidt to get this in. Or a new idea to fix the problem in a better way and replace this PR - which I don't have.
From a maintainer point of view, it's pretty icky because it creates a structure everything has to follow - and which you have to remember for anything you do anywhere else. And this will probably have an impact on existing applications that's very hard to reason about without knowing them all.

julienschmidt · 2016-10-25T16:21:16Z

My suggestion is that we postpone this change until after v1.3 which should happen soon.
It's probably not a good idea to have a change which potentially breaks a lot without a test phase in a tagged release.

TL;DR of this change is: Only return ErrBadConn when we haven't written any statements yet as we otherwise might case unintended changes to the database.
(Did I miss something?)

sethgrid · 2017-05-17T14:17:55Z

Do we have any lose ETA? I'm not sure what context changes you are tracking or when a 1.4 would be slated. We are seeing errors that I think will be fixed with this issue.

arnehormann · 2017-05-21T07:49:13Z

This got more complex by the driver support for multi results and context. It might even help with context by providing stricter guarantees, but I'm not sure.
I took a stab at it in https://github.com/arnehormann/mysql/tree/fix-ErrBadConn but failed so far (failing tests and busy buffer messages). I hope I'll find the time to dig in deeper next weekend.

According to the database/sql/driver documentation, ErrBadConn should only be used when the database was not affected. The driver restarts the same query on a different connection, then. The mysql driver did not follow this advice, so queries were repeated if ErrBadConn is returned but a query succeeded. This is fixed by changing most ErrBadConn errors to ErrInvalidConn. The only valid returns of ErrBadConn are at the beginning of a database interaction when no data was sent to the database yet. Those valid cases are located the following funcs before attempting to write to the network or if 0 bytes were written: * Begin * BeginTx * Exec * ExecContext * Prepare * PrepareContext * Query * QueryContext Commit and Rollback could arguably also be on that list, but are left out as some engines like MyISAM are not supporting transactions. Tests in b/packets_test.go were changed because they simulate a read not preceded by a write to the db. This cannot happen as the client has to send the query first.

arnehormann · 2017-06-23T20:21:24Z

This took me far too long, sorry. PTAL

arnehormann · 2017-07-03T20:19:09Z

@julienschmidt @methane I force-pushed a refreshed attempt. Up for a review?

From the commit msg with a little more polish:

According to the database/sql/driver documentation, ErrBadConn should only
be used when the database was not affected. If the sql/driver sees ErrBadConn, it restarts the same
query on a different connection.
The mysql driver did not follow this guideline; queries were repeated on ErrBadConn.

This is fixed by changing most ErrBadConn errors to ErrInvalidConn.

The only valid returns of ErrBadConn are at the beginning of a database
interaction when no data was sent to the database yet.

Those cases are the following functions when no byte was written to the network yet:

Begin
BeginTx
Exec
ExecContext
Prepare
PrepareContext
Query
QueryContext

Commit and Rollback could arguably also be on that list, but are left out as
some engines like MyISAM are not supporting transactions.

Tests in packets_test.go were changed because they simulate a read not
preceded by a write. This cannot happen as the client at least has to send
the query itself first.

methane

LGTM

According to the database/sql/driver documentation, ErrBadConn should only be used when the database was not affected. The driver restarts the same query on a different connection, then. The mysql driver did not follow this advice, so queries were repeated if ErrBadConn is returned but a query succeeded. This is fixed by changing most ErrBadConn errors to ErrInvalidConn. The only valid returns of ErrBadConn are at the beginning of a database interaction when no data was sent to the database yet. Those valid cases are located the following funcs before attempting to write to the network or if 0 bytes were written: * Begin * BeginTx * Exec * ExecContext * Prepare * PrepareContext * Query * QueryContext Commit and Rollback could arguably also be on that list, but are left out as some engines like MyISAM are not supporting transactions. Tests in b/packets_test.go were changed because they simulate a read not preceded by a write to the db. This cannot happen as the client has to send the query first.

vmg · 2017-09-06T15:27:39Z

Hi! As some people on this thread feared, this PR breaks the driver whenever the MySQL server timeouts a connection. I've left an explanation in #657

this go-sql-driver/mysql#302 seems to have pretty much crippled the ability to use mysql, so we need to lock a version before that until that issue gets fixed.

when initializing a connection, we can return drivers.ErrBadConn so that the sql package can use this information to retry getting a connection. the sql package will not give us this behavior unless we return that specific error. pursuant to go-sql-driver#302, this operation is another in the limited set of things that seems safe to retry. we are trying to move to the post-302 world and ran into this :) I've run into this when using the new drivers.Conn() interface specifically, though after digging into the sql package this seems like a similar path for any query. After changing this return, I end up eventually getting a new conn that is valid. It appears in my repro like the db was not yet ready, and would just end up returning an invalid conn and giving up, the InvalidConn err came from https://github.com/go-sql-driver/mysql/blob/4a0c3d73d8579f9fc535cf5e654a651cbd57dd6e/packets.go#L38 which was the first read on the packet (an EOF). At least, returning a drivers.ErrBadConn to let the db retry initializing the connection seems like a safe operation. thanks for maintaining this library :)

methane · 2022-01-26T11:21:13Z

Don't do multipost. I replied it already.

julienschmidt reviewed Jan 19, 2015
View reviewed changes

julienschmidt added the code review label Jan 19, 2015

julienschmidt added this to the v1.3 milestone Jan 20, 2015

julienschmidt added the bug label Feb 12, 2015

arnehormann force-pushed the fixErrBadConn branch from 14ffb5c to 74b890f Compare February 17, 2015 13:00

arnehormann mentioned this pull request Mar 18, 2015

packets.go: read tcp [ip-address]: connection timed out #257

Closed

arnehormann mentioned this pull request Oct 14, 2015

Add support for read/write timeouts #375

Closed

arnehormann mentioned this pull request May 9, 2016

Mark connections as bad on MariaDB shutdown #451

Closed

5 tasks

shogo82148 mentioned this pull request Jun 11, 2017

Stop repeated queries #617

Closed

5 tasks

arnehormann mentioned this pull request Jun 14, 2017

Support Go MySQL Stored Procedure calls #569

Closed

arnehormann force-pushed the fixErrBadConn branch from 74b890f to 2b7a3e9 Compare June 23, 2017 20:10

This was referenced Jun 30, 2017

On server restart clients get this error: MySQL error: commands out of sync. You can't run this command now #449

Closed

pr-451 with additional tests and some documentation editing #631

Closed

methane approved these changes Aug 21, 2017

View reviewed changes

arnehormann merged commit 26471af into go-sql-driver:master Aug 22, 2017

vmg mentioned this pull request Sep 6, 2017

Server timeouts broken since #302 #657

Closed

rdallman mentioned this pull request Sep 13, 2017

make sql db wait for connections instead of returning 'too many connections' fnproject/fn#187

Closed

julienschmidt mentioned this pull request Sep 13, 2017

travis: fix MySQL config #661

Merged

5 tasks

rdallman pushed a commit to fnproject/fn that referenced this pull request Sep 19, 2017

rollback mysql driver

1a55c80

this go-sql-driver/mysql#302 seems to have pretty much crippled the ability to use mysql, so we need to lock a version before that until that issue gets fixed.

rdallman mentioned this pull request Sep 19, 2017

rollback mysql driver fnproject/fn#343

Merged

methane mentioned this pull request Sep 21, 2017

mysql does not reconnect if connection was closed #670

Closed

julienschmidt mentioned this pull request Oct 17, 2017

Return ErrBusyBuffer instead of driver.ErrBadConn #611

Closed

julienschmidt mentioned this pull request Oct 27, 2017

Please tag a new release #694

Closed

rdallman mentioned this pull request Jan 13, 2018

driver.ErrBadConn when init packet read fails #736

Merged

5 tasks

daniellee mentioned this pull request Mar 13, 2018

mysql panic occurring occasionally upon Grafana dashboard access grafana/grafana#11155

Closed

shuhaowu mentioned this pull request Apr 12, 2018

Weird errors in packets.go (mysql-driver layer, don't think it actually causes issue in production) Shopify/ghostferry#11

Closed

lysu mentioned this pull request Mar 9, 2020

upgrade pd-client to support pd client tls reload pingcap/tidb#15071

Merged

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix driver.ErrBadConn behavior, stop repeated queries #302

fix driver.ErrBadConn behavior, stop repeated queries #302

arnehormann commented Jan 13, 2015

xaprb commented Jan 13, 2015

xaprb commented Jan 13, 2015

arnehormann commented Jan 13, 2015

arnehormann commented Jan 14, 2015

arnehormann commented Jan 19, 2015

julienschmidt Jan 19, 2015

julienschmidt Jan 19, 2015

julienschmidt commented Jan 19, 2015

arnehormann commented Jan 19, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 17, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 17, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 17, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 28, 2015

arnehormann commented Feb 28, 2015

arnehormann commented Aug 3, 2015

tomponline commented May 9, 2016

methane commented May 10, 2016

Shelnutt2 commented May 13, 2016

arnehormann commented May 13, 2016

julienschmidt commented Oct 25, 2016

sethgrid commented May 17, 2017 •

edited

Loading

arnehormann commented May 21, 2017

arnehormann commented Jun 23, 2017

arnehormann commented Jul 3, 2017

methane left a comment

vmg commented Sep 6, 2017

This comment has been minimized.

methane commented Jan 26, 2022

fix driver.ErrBadConn behavior, stop repeated queries #302

fix driver.ErrBadConn behavior, stop repeated queries #302

Conversation

arnehormann commented Jan 13, 2015

xaprb commented Jan 13, 2015

xaprb commented Jan 13, 2015

arnehormann commented Jan 13, 2015

arnehormann commented Jan 14, 2015

arnehormann commented Jan 19, 2015

julienschmidt Jan 19, 2015

Choose a reason for hiding this comment

julienschmidt Jan 19, 2015

Choose a reason for hiding this comment

julienschmidt commented Jan 19, 2015

arnehormann commented Jan 19, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 17, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 17, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 17, 2015

arnehormann commented Feb 17, 2015

julienschmidt commented Feb 28, 2015

arnehormann commented Feb 28, 2015

arnehormann commented Aug 3, 2015

tomponline commented May 9, 2016

methane commented May 10, 2016

Shelnutt2 commented May 13, 2016

arnehormann commented May 13, 2016

julienschmidt commented Oct 25, 2016

sethgrid commented May 17, 2017 • edited Loading

arnehormann commented May 21, 2017

arnehormann commented Jun 23, 2017

arnehormann commented Jul 3, 2017

methane left a comment

Choose a reason for hiding this comment

vmg commented Sep 6, 2017

This comment has been minimized.

methane commented Jan 26, 2022

sethgrid commented May 17, 2017 •

edited

Loading