Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Server timeouts fixed since #302, then people notice their bad configuration. #657
After more than 2 years being blocked, PR #302 was merged just two weeks ago. It purportedly fixes issues were queries were being wrongly "retried" (i.e. being ran twice by the
In practice, however, this breaks the driver in any situation where the MySQL server can kill connections on the server side. Let's analyse why and try to find a workaround for this issue.
We're gonna make the driver crash with a pretty simple reproduction recipe.
First, we need a MySQL server configured to timeout idle connections. We're gonna use MySQL 5.7 and simply
With this MySQL server in place, we can reproduce the issue with a few lines of Go code, using the
db, err := sql.Open("mysql", dsn) assert.NoError(t, err) defer db.Close() assert.NoError(t, db.Ping()) // Wait for 5 seconds. This should be enough to timeout the conn, since `wait_timeout` is 3s time.Sleep(5 * time.Second) // Simply attempt to begin a transaction tx, err := db.Begin() assert.NoError(t, err)
The result of pinging the DB, sleeping for 5 seconds, then attempting to begin a transaction, is a crash:
It is clear that this is not the right behaviour. The
So, what's the underlying issue here, and why is
The changes in #302 are sensible and appear correct when reviewed. The "most important" modification happens in
The PR adds line 140, where
This is perfectly sensible, but the previous reproduction recipe shows it doesn't work in practice. In our code, the following sequence of actions happens:
When written like this, the problem becomes really clear. #302 made an assumption about the TCP protocol that doesn't hold in practice: that after MySQL server has killed our connection, trying to write a packet from the driver to the server will return an error. That's not how TCP works: let us prove it by looking at some packet captures.
This are steps 1 through 8 seen through Wireshark:
The beginning is straightforward. We open the connection with a TCP handshake, we send a ping packet and we get a reply back (you can actually see two pings + two replies in that capture). We then go to sleep.
...And just 3 seconds later, in frame 14, the MySQL server enacts the timeout we've configured. Our TCP connection gets closed on the server side and we receive a
Seven seconds later, at T+10 (frame 18), we wake up from our sleep and attempt to exec a
Of course, the MySQL server immediately replies -- with a
And meanwhile, the kernel has received the
What are the potential fixes?
There is, in theory, a "right way" to handle this general case of the server closing a connection on us: performing a zero-size read from the socket right before we attempt to write our query packet should let us know about that
Here's the issue: the "zero read" semantics from Golang are a bit fuzzy right now. This commit from
This closed issue in
Ian Lance Taylor comments:
The "final write system call" that Ian talks about here is the same as our "trying to write
Brad Fitzpatrick wraps up the issue with a comment on zero-size reads from Golang:
Another still open issue in
I hope this in-depth analysis wasn't too boring, and I also hope it serves as definitive proof that the concerns that some people (@xaprb, @julienschmidt) had on the original PR are indeed very real: we tried rolling out the
As for a conclusive fix: I honestly have no idea. I would start by reverting the PR; I think the risk of duplicated queries is worrisome, but the driver as it is right now is unusable for large MySQL clusters that run tight on connections and do pruning.
I would love to hear from the maintainers on approaches to fix this. I have time and obviously interest on using the
referenced this issue
Sep 6, 2017
Fast of all, you should use
Yes. I reached same conclusion. And that's why I strongly recommend to use SetConnMaxLifetime
If MySQL cluster's
Thank you for the suggestion! That seems like a good workaround. I'm afraid it's not up to me how our MySQL clusters handle idle connections. We have many different clusters with different capacities and many different languages/clients connecting to them, so connections are gonna get killed -- either "statically" through
I think using an aggressive timeout on
Would you file an issue on golang/go ?
Adding new method to TCP (and Unix socket, maybe) to check connection is killed from peer
I'm in a hurry right now, but from memory:
In the end, https://golang.org/src/database/sql/sql.go?s=34900:34969#L630 is called.
Robust client code has to include retry logic to prevent this failure mode. This is very, very annoying, but I still see an improvement in #302 - we reduce the number of spec violations and increase the visibility of a potential problem, the one I mentioned above.
That was from memory and I might be wrong in the
You're welcome! I'm glad it's of some use!
No, it will not. The
This new connection will never fail. But, of course, to reach that logic the driver must be returning
I think you may indeed be wrong about the way the driver interacts with the standard library. To me, it seems like
Again, I'm aware that duplicate queries are a real issue, but the fix in #302 breaks the semantics in
It's a big regression from my point of view. Please reconsider! Cheers!
I'll clear the egg from my face, dig deeper into it again, take some time to digest it and get back to you. Probably tomorrow. Thanks again! Am 08.09.2017 10:09 schrieb "Vicent Martí" <firstname.lastname@example.org>: thank you for the reports, it's amongst the best ones I've seen so far. You're welcome! I'm glad it's of some use! What if the connection pool contains maxBadConnRetries connections which were all invisibly closed by the server and your query is retried on all of those? The query will fail and you'd get the same error. No, it will not. The database/sql will force a new connection if after maxBadConnRetries, they all return ErrBaddConn. The logic is just 3 lines above the link you posted: https://github.com/golang/go/blob/ab40107708042ded6bdc1fb841c7cf 2c2ab002ab/src/database/sql/sql.go#L1286-L1288 This new connection will never fail. But, of course, to reach that logic the driver must be returning ErrBadConn -- which is not the case as of #302 <#302>. I think you may indeed be wrong about the way the driver interacts with the standard library. To me, it seems like database/sql is designed to be resilient around the case of broken connections -- but for the retry logic to work, the driver needs to play along and return ErrBadConn. Again, I'm aware that duplicate queries are a real issue, but the fix in #302 <#302> breaks the semantics in database/sql and causes queries that were previously succeeding (i.e. being retried) *in all circumstances* to be reliably broken *in all circumstances*. It's a big regression from my point of view. Please reconsider! Cheers! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9u1ZGtQyhSR8o2ETj1_cx8WqXTOtxZks5sgPYpgaJpZM4POimv> .
The retry was dangerous.
Checking connection closed by peer before sending query may increase "successful retry" rate.
So best approach is shorten connection lifetime enough. Close connection from client side, before
One can't have automatic 'magic' retries without potentially executing queries twice.
Imagine a situation where the server wants to notify that an auto-committed query was successful.
Now one could argue, that a good database design can prevent this. Unfortunately not every user is a database expert and the database/sql doc is clear that drivers should be rather "safe than sorry".
We could probably do what we always do - add a DSN flag
#302 fixes real bugs and improves the spec conformity. Still, as diagnosed above, it comes with its own set of problems.
If we look at the situations where the driver returns ErrBadConn, we traded false positives for false negatives and gained spec conformity (see the first coments in #302). Sadly, the false negatives were pretty rare (judging from report frequency); the false positives not so much.
I don't see a sensible way to fix both false positives and false negatives. Still, as methane mentioned, there is a workaround for the problems introduced by #302. That and the spec conformity makes me want to keep the current state - with the PR merged.
I do see that this causes problems for existing clients. I dislike the introduction of a new DSN parameter as it fragments the user base and complicates triaging bug reports in the future, but we have to deal with it anyway until everyone uses the current version (never, probably). That's why I'd reluctantly agree to it - in combination with additional documentation.
Also, I'm sorry for all the confusion - esp. concerning my unfounded response earlier.
I just had an idea I prefer to a DSN flag. DSN is per connection, but the required changes are per package as they might require different code. So they should be changed globally. Let's introduce a new exported error variable, only for #302 changes (placeholder: ErrXYZ). Then clients can get the old behavior by mysql.ErrXYZ = driver.ErrBadConn... This can preserve legacy behavior and still nudges developers in the right direction. And it suggests reduced accountability. And it will probably not be used by new clients. Thoughts?
I prefer a DSN flag. Global state should be avoided when possible. Imagine for example two 3rd party packages that depend on a different value of the global state to work properly.
And I don't see any changes in #302 that require a global state. Simple if-else checking of the value of an attribute of
referenced this issue
Sep 21, 2017
just to add a little humble opinion on the subject : The #302 broke the behaviour on our systems too.
After deploying this, the daemons that are using one shared *sql.DB object began to fail completely very quickly with invalid connections.
We are currently doing a rollback. IMHO the sql.DB object is a cluster of connections that handles the complexity of broken connection and should only return invalid connection after x retries to reconnect.
In addition can I suggest to tag commit before you merge heavy changes like that ? Thank you very much ;)
Just upgraded a service to take advantage of
Any update on this? Would a PR on a new DSN flag be accepted? :)
Why don't you shorten MaxConnLifetime?
I don't think adding complexity into this lib only for special use cases
Unless we're reconnecting per query, no value passed to
Because this was a major breaking change. database/sql previously handled this behavior with
I don't think having connections to a "read only" DB is a special use case. Primary-replica topologies are very common. All queries targeting replicas are read-only and idempotent.
I don't know your specific case. But most people who see this error log has wrong setting:
If it is only reason, you can use old version.
If it's not so special, it should be fixed by
I'm trying to setup graceful automatic failover of the primary DB, and recovery from reader/replica failures without seeing a bunch of
That's fair. :) Appreciate the reply!
referenced this issue
Dec 31, 2017
referenced this issue
Feb 19, 2018
referenced this issue
Mar 13, 2018
referenced this issue
Apr 12, 2018
added a commit
Aug 20, 2018
But problem may happen even with very short
working as intended
good first issue
Oct 24, 2018
I close this issue because it's hard to follow discussion. I don't want to continue disucssion here.
My conclusion is:
I'll create new issues for them.