New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server timeouts fixed since #302, then people notice their bad configuration. #657

Closed
vmg opened this Issue Sep 6, 2017 · 24 comments

Comments

Projects
None yet
7 participants
@vmg

vmg commented Sep 6, 2017

Hey everyone!

After more than 2 years being blocked, PR #302 was merged just two weeks ago. It purportedly fixes issues were queries were being wrongly "retried" (i.e. being ran twice by the database/sql client even though the original query went through). To accomplish this, it reduces the number of places where the driver returns a driver.ErrBadConn error: it only returns this value when we're 100% sure that the database has not received the original query, as returning ErrBadConn forces the database/sql client to retry the query altogether with a different connection, hence leading to redundant/repeated queries.

In practice, however, this breaks the driver in any situation where the MySQL server can kill connections on the server side. Let's analyse why and try to find a workaround for this issue.

Reproduction

We're gonna make the driver crash with a pretty simple reproduction recipe.

First, we need a MySQL server configured to timeout idle connections. We're gonna use MySQL 5.7 and simply set global wait_timeout=3;. This will make mysqld kill any connections that haven't been active (i.e. sending queries) for 3 seconds. This whole example also works with any maintained MySQL release, including 5.6, and all MariaDB releases. The 3 second timeout is a bit aggressive but it's something similar to what we run in our main MySQL cluster here at GitHub.

With this MySQL server in place, we can reproduce the issue with a few lines of Go code, using the master branch of go-sql-driver/mysql:

db, err := sql.Open("mysql", dsn)
assert.NoError(t, err)
defer db.Close()

assert.NoError(t, db.Ping())

// Wait for 5 seconds. This should be enough to timeout the conn, since `wait_timeout` is 3s
time.Sleep(5 * time.Second)

// Simply attempt to begin a transaction
tx, err := db.Begin()
assert.NoError(t, err)

The result of pinging the DB, sleeping for 5 seconds, then attempting to begin a transaction, is a crash:

Error:          Received unexpected error: invalid connection

It is clear that this is not the right behaviour. The mysqld instance is still up throughout the test, so even if the initial connection got killed by the server, calling db.Begin() should create a new connection instead of returning an error. This was the behaviour before #302 was merged.

So, what's the underlying issue here, and why is database/sql not retrying the connection after the changes in #302?

Deep Dive

The changes in #302 are sensible and appear correct when reviewed. The "most important" modification happens in packets.go, in the code to handle writes:

mysql/packets.go

Lines 121 to 147 in 26471af

n, err := mc.netConn.Write(data[:4+size])
if err == nil && n == 4+size {
mc.sequence++
if size != maxPacketSize {
return nil
}
pktLen -= size
data = data[size:]
continue
}
// Handle error
if err == nil { // n != len(data)
mc.cleanup()
errLog.Print(ErrMalformPkt)
} else {
if cerr := mc.canceled.Value(); cerr != nil {
return cerr
}
if n == 0 && pktLen == len(data)-4 {
// only for the first loop iteration when nothing was written yet
return errBadConnNoWrite
}
mc.cleanup()
errLog.Print(err)
}
return ErrInvalidConn

The PR adds line 140, where if n == 0 && pktLen == len(data)-4 is checked. The rationale is as follows:

  1. Any interaction (i.e. query, exec) from the driver to the MySQL server happens in two phases: First, we write the packet with the query/exec command to the socket connected to the server. If this write succeeds, it means the server must write back an "answer" packet, so we immediately read the response from the socket.

  2. If this write doesn't succeed, it means the connection has gone bad (maybe the server killed it?), but we know for sure the server has not received our query, so we can return errBadConnNoWrite, which will get translated down the driver into driver.ErrBadConn and force database/sql to retry the query with another connection.

  3. However: If the write succeeds but the follow-up read from the socket fails (i.e. we never receive the "answer packet" from MySQL), we cannot return ErrBadConn, because the server has received our query and may have executed it even though we received no answer. We cannot let database/sql retry this query.

This is perfectly sensible, but the previous reproduction recipe shows it doesn't work in practice. In our code, the following sequence of actions happens:

  1. We connect to the MySQL server. sql.Open
  2. We ping the MySQL server by sending a "ping" packet
  3. We receive a "pong" packet response from MySQL -- db.Ping returns successfully
  4. We sleep for 5 seconds
  5. After just 3 seconds, the MySQL server is going to kill our driver's only open connection
  6. We wake up from sleep and attempt to begin a transaction, by execing a START TRANSACTION.
  7. We write an exec packet to the server with START TRANSACTION. The write succeeds
  8. We then read the response packet from MySQL: This read fails, because MySQL killed our connection -- but now the driver cannot return ErrBadConn to retry this operation, because our initial write succeeded! The server may have executed START TRANSACTION! So the driver's only option is to hard crash.

When written like this, the problem becomes really clear. #302 made an assumption about the TCP protocol that doesn't hold in practice: that after MySQL server has killed our connection, trying to write a packet from the driver to the server will return an error. That's not how TCP works: let us prove it by looking at some packet captures. 😄

This are steps 1 through 8 seen through Wireshark:

screen shot 2017-09-06 at 16 43 42

The beginning is straightforward. We open the connection with a TCP handshake, we send a ping packet and we get a reply back (you can actually see two pings + two replies in that capture). We then go to sleep.

...And just 3 seconds later, in frame 14, the MySQL server enacts the timeout we've configured. Our TCP connection gets closed on the server side and we receive a [FIN, ACK] pair... To which our Golang client simply responds with an ACK! There's no FIN response from the driver because we haven't closed our side of the connection, as we're sleeping.

Seven seconds later, at T+10 (frame 18), we wake up from our sleep and attempt to exec a START TRANSACTION. And (here's what throws us off), the Linux kernel accepts the write just fine. It returns no error to Golang, because this is how TCP works. TCP is a bidirectional protocol, and our side of the connection is still open, so the write goes through.

Of course, the MySQL server immediately replies -- with a RST. "Dude, this connection is over". But it's too late for our driver: the write of the exec packet already succeeded, and with this new logic, we assume that the query may have been executed by the server (even though the MySQL server was not even listening on our socket anymore).

And meanwhile, the kernel has received the RST packet from the MySQL server. The next time we attempt to read from our connection, we will definitely crash. But with this new logic, a retry won't be attempted! This is bad news, because our original exec was not even close to making it to the server -- let alone being executed.

What are the potential fixes?

There is, in theory, a "right way" to handle this general case of the server closing a connection on us: performing a zero-size read from the socket right before we attempt to write our query packet should let us know about that FIN packet we've received: "MySQL has closed this connection, buddy. Your side is open, so you can still write, and the kernel will swallow it... But it's pointless because MySQL isn't reading anymore and -- even if it were --, the FIN means it wouldn't be able to reply".

Here's the issue: the "zero read" semantics from Golang are a bit fuzzy right now. This commit from bradfitz special-cases this situation so the read never makes it to the kernel; before that, the read would always return io.EOF.

This closed issue in golang/go has a very similar analysis to this one (at the TCP level -- not really related to MySQL), with several Golang core devs chiming in.

Ian Lance Taylor comments:

I don't see what Go can do here. Try using "go test -c" to build your test into an executable and then run "strace -f EXECUTABLE". You will see that the final write system call succeeds and the kernel reports it as succeeding. Go is just returning that success to the caller. The same thing would happen for a C program.

At the TCP level, you have closed one side of the TCP connection but not the other. The kernel will accept writes until both sides are closed. This is not as simple a matter as you seem to be expecting. I found a decent description of the issue at http://stackoverflow.com/questions/11436013/writing-to-a-closed-local-tcp-socket-not-failing .

The "final write system call" that Ian talks about here is the same as our "trying to write START TRANSACTION after our sleep, and indeed, it succeeds at the kernel level.

Brad Fitzpatrick wraps up the issue with a comment on zero-size reads from Golang:

FWIW, the 0-byte-Read-always-returns-EOF was changed in Go 1.7 with 5bcdd63. But as of Go 1.7 it now always returns nil, which is at least a bit more Go-like, but still doesn't tell you whether FIN was ever seen arriving. There aren't great APIs anywhere (especially not portably) to know that without reading data.

Another still open issue in golang/go explains why is there --still-- no way to see if the socket is good for reading (and hence it makes sense writing to it).

In Conclusion

I hope this in-depth analysis wasn't too boring, and I also hope it serves as definitive proof that the concerns that some people (@xaprb, @julienschmidt) had on the original PR are indeed very real: we tried rolling out the master branch of this driver in one of our Golang services and it's unusable with any of our MySQL clusters here at GitHub.

As for a conclusive fix: I honestly have no idea. I would start by reverting the PR; I think the risk of duplicated queries is worrisome, but the driver as it is right now is unusable for large MySQL clusters that run tight on connections and do pruning.

Doing a Ping before attempting new queries on connections that could be "stale" seems like a feasible workaround. But how do we mark such connections as stale? What's the timeframe? Should we let the users configure it? What's the overhead going to be like?

I would love to hear from the maintainers on approaches to fix this. I have time and obviously interest on using the master branch of this driver in production and I'll gladly write & PR any fix for this issue that you deem acceptable.

Cheers!

@methane

This comment has been minimized.

Contributor

methane commented Sep 6, 2017

First, we need a MySQL server configured to timeout idle connections. We're gonna use MySQL 5.7 and simply set global wait_timeout=3;.

Fast of all, you should use DB.SetConnMaxLifetime() instead of wait_timeout.
Closing connection from client is always better than closing from server, because
client may send query just when server start closing the connection. In such case,
client can't know sent query is received or not.

Here's the issue: the "zero read" semantics from Golang are a bit fuzzy right now. This commit from bradfitz special-cases this situation so the read never makes it to the kernel; before that, the read would always return io.EOF.

Yes. I reached same conclusion. And that's why I strongly recommend to use SetConnMaxLifetime
with short lifetime always.

If MySQL cluster's wait_timeout can't be longer than 3sec, I think you should use
1sec or less timeout. In real world application processing thousands of requests per second,
1sec is long enough if servers are on same LAN.

@vmg

This comment has been minimized.

vmg commented Sep 6, 2017

Fast of all, you should use DB.SetConnMaxLifetime() instead of wait_timeout.

Thank you for the suggestion! That seems like a good workaround. I'm afraid it's not up to me how our MySQL clusters handle idle connections. We have many different clusters with different capacities and many different languages/clients connecting to them, so connections are gonna get killed -- either "statically" through wait_timeout or dynamically by monitoring processes. I would definitely appreciate it if the Golang driver could handle killed connections gracefully by retrying them. 😅

I think using an aggressive timeout on SetConnMaxLifetime is gonna work around the issue for now, but I'd love to discuss a proper long term fix so the driver always retries connections. I think we can all agree that the current behaviour as of #302 is not "correct" even if we can work around it by having the driver timeout connections before the server does.

@methane

This comment has been minimized.

Contributor

methane commented Sep 6, 2017

Would you file an issue on golang/go ?
I think proper solution is adding new Error which means BadConn, but isn't safe to retry.

Adding new method to TCP (and Unix socket, maybe) to check connection is killed from peer
is also nice. But all discussion should be in golang's forum or issue tracker.

@arnehormann

This comment has been minimized.

Member

arnehormann commented Sep 8, 2017

@vmg thank you for the reports, it's amongst the best ones I've seen so far.
I still disagree that #302 is not correct, but it definitely changes the driver's behaviour.

I'm in a hurry right now, but from memory:

In the end, https://golang.org/src/database/sql/sql.go?s=34900:34969#L630 is called.
What if the connection pool contains maxBadConnRetries connections which were all invisibly closed by the server and your query is retried on all of those? The query will fail and you'd get the same error.

Robust client code has to include retry logic to prevent this failure mode. This is very, very annoying, but I still see an improvement in #302 - we reduce the number of spec violations and increase the visibility of a potential problem, the one I mentioned above.

That was from memory and I might be wrong in the sql - driver - mysql interactions. If I am, I am very eager to rollback the change. If not, we should have a discussion how to best change documentation or (I really don't want to) even introduce a DSN parameter for the old behaviour.

@vmg

This comment has been minimized.

vmg commented Sep 8, 2017

thank you for the reports, it's amongst the best ones I've seen so far.

You're welcome! I'm glad it's of some use!

What if the connection pool contains maxBadConnRetries connections which were all invisibly closed by the server and your query is retried on all of those? The query will fail and you'd get the same error.

No, it will not. The database/sql will force a new connection if after maxBadConnRetries, they all return ErrBaddConn. The logic is just 3 lines above the link you posted:

https://github.com/golang/go/blob/ab40107708042ded6bdc1fb841c7cf2c2ab002ab/src/database/sql/sql.go#L1286-L1288

This new connection will never fail. But, of course, to reach that logic the driver must be returning ErrBadConn -- which is not the case as of #302.

I think you may indeed be wrong about the way the driver interacts with the standard library. To me, it seems like database/sql is designed to be resilient around the case of broken connections -- but for the retry logic to work, the driver needs to play along and return ErrBadConn.

Again, I'm aware that duplicate queries are a real issue, but the fix in #302 breaks the semantics in database/sql and causes queries that were previously succeeding (i.e. being retried) in all circumstances to be reliably broken in all circumstances.

It's a big regression from my point of view. Please reconsider! Cheers!

@arnehormann

This comment has been minimized.

Member

arnehormann commented Sep 8, 2017

@arnehormann

This comment has been minimized.

Member

arnehormann commented Sep 9, 2017

@methane

This comment has been minimized.

Contributor

methane commented Sep 10, 2017

I think you may indeed be wrong about the way the driver interacts with the standard library. To me, it seems like database/sql is designed to be resilient around the case of broken connections -- but for the retry logic to work, the driver needs to play along and return ErrBadConn.

No. We understand database/sql design. Before #302, we violate the design and #302 fixes it.
See #654 and golang/go#11978 (comment)

Again, I'm aware that duplicate queries are a real issue, but the fix in #302 breaks the semantics in database/sql and causes queries that were previously succeeding (i.e. being retried) in all circumstances to be reliably broken in all circumstances.

The retry was dangerous.

@methane

This comment has been minimized.

Contributor

methane commented Sep 10, 2017

Checking connection closed by peer before sending query may increase "successful retry" rate.
But there are still timings we can't retry safely: server closed the connection while we sending query.

So best approach is shorten connection lifetime enough. Close connection from client side, before
it closed by server side.

@julienschmidt

This comment has been minimized.

Member

julienschmidt commented Sep 10, 2017

One can't have automatic 'magic' retries without potentially executing queries twice.

Imagine a situation where the server wants to notify that an auto-committed query was successful.
We have a problem if the server assumes the query done but the driver never received the notification that it was successful. The client is not sending back some kind of ACK. So it might happen that the server could successfully write the notification to the socket but it got eventually lost on the path or in the OS buffer.

Now one could argue, that a good database design can prevent this. Unfortunately not every user is a database expert and the database/sql doc is clear that drivers should be rather "safe than sorry".

We could probably do what we always do - add a DSN flag 😆

@arnehormann

This comment has been minimized.

Member

arnehormann commented Sep 11, 2017

#302 fixes real bugs and improves the spec conformity. Still, as diagnosed above, it comes with its own set of problems.

If we look at the situations where the driver returns ErrBadConn, we traded false positives for false negatives and gained spec conformity (see the first coments in #302). Sadly, the false negatives were pretty rare (judging from report frequency); the false positives not so much.

I don't see a sensible way to fix both false positives and false negatives. Still, as methane mentioned, there is a workaround for the problems introduced by #302. That and the spec conformity makes me want to keep the current state - with the PR merged.

I do see that this causes problems for existing clients. I dislike the introduction of a new DSN parameter as it fragments the user base and complicates triaging bug reports in the future, but we have to deal with it anyway until everyone uses the current version (never, probably). That's why I'd reluctantly agree to it - in combination with additional documentation.

Also, I'm sorry for all the confusion - esp. concerning my unfounded response earlier.

@arnehormann

This comment has been minimized.

Member

arnehormann commented Sep 12, 2017

@vmg

This comment has been minimized.

vmg commented Sep 12, 2017

That sounds like a very good idea! Allowing users to customise the default BadConn error seems like the perfect trade-off.

@julienschmidt

This comment has been minimized.

Member

julienschmidt commented Sep 12, 2017

I prefer a DSN flag. Global state should be avoided when possible. Imagine for example two 3rd party packages that depend on a different value of the global state to work properly.

And I don't see any changes in #302 that require a global state. Simple if-else checking of the value of an attribute of mysqlConn should also work.

@mrsinham

This comment has been minimized.

Contributor

mrsinham commented Oct 16, 2017

just to add a little humble opinion on the subject : The #302 broke the behaviour on our systems too.

After deploying this, the daemons that are using one shared *sql.DB object began to fail completely very quickly with invalid connections.

We are currently doing a rollback. IMHO the sql.DB object is a cluster of connections that handles the complexity of broken connection and should only return invalid connection after x retries to reconnect.

In addition can I suggest to tag commit before you merge heavy changes like that ? Thank you very much ;)

@julienschmidt julienschmidt added the bug label Oct 27, 2017

@julienschmidt julienschmidt added this to the v1.4 milestone Oct 27, 2017

@julienschmidt julienschmidt added help wanted bug and removed bug labels Oct 27, 2017

@danielstjules

This comment has been minimized.

danielstjules commented Nov 22, 2017

Just upgraded a service to take advantage of rejectReadOnly for easy aurora failover, and was sad to see a breaking change requiring client code to handle reconnects. The logic behind the change makes sense, but in our case, we only perform reads and writes that are idempotent, so the concerns about exactly-once vs at-least-once delivery don't affect us.

Any update on this? Would a PR on a new DSN flag be accepted? :)

@methane

This comment has been minimized.

Contributor

methane commented Nov 22, 2017

Why don't you shorten MaxConnLifetime?
Why don't you retry by yourself?

I don't think adding complexity into this lib only for special use cases
(e.g. all queries are idempotant) is good idea.
Unless all other solution (or workaround) can't solve this, I'm -1 on adding option.

@danielstjules

This comment has been minimized.

danielstjules commented Nov 22, 2017

@methane

Why don't you shorten MaxConnLifetime?

Unless we're reconnecting per query, no value passed to SetConnMaxLifetime will guarantee a good connection.

Why don't you retry by yourself?

Because this was a major breaking change. database/sql previously handled this behavior with alwaysNewConn and maxBadConnRetries. Wrapping database/sql with retry logic for Begin, BeginTx, Exec, ExecContext, Prepare, PrepareContext, Query, QueryContext, in our code, and more importantly for all dependencies that used to assume retry behavior, is a lot of work.

I don't think adding complexity into this lib only for special use cases (e.g. all queries are idempotant) is good idea.

I don't think having connections to a "read only" DB is a special use case. Primary-replica topologies are very common. All queries targeting replicas are read-only and idempotent.

@methane

This comment has been minimized.

Contributor

methane commented Nov 22, 2017

Unless we're reconnecting per query, no value passed to SetConnMaxLifetime will guarantee a good connection.

I don't know your specific case. But most people who see this error log has wrong setting:
server (or network) timeout is longer than client timeout. It's broken setting.

Because this was a major breaking change.

If it is only reason, you can use old version.

I don't think having connections to a "read only" DB is a special use case. Primary-replica topologies are very common. All queries targeting replicas are read-only and idempotent.

If it's not so special, it should be fixed by database/sql design.

@danielstjules

This comment has been minimized.

danielstjules commented Nov 22, 2017

I don't know your specific case. But most people who see this error log has wrong setting:
server (or network) timeout is longer than client timeout. It's broken setting.

I'm trying to setup graceful automatic failover of the primary DB, and recovery from reader/replica failures without seeing a bunch of connection.go:372: invalid connection when each connection in the pool subsequently fails after #302. It's not a common occurrence, but it's an important enough failure case.

If it's not so special, it should be fixed by database/sql design.

That's fair. :) Appreciate the reply!

@methane

This comment has been minimized.

Contributor

methane commented Nov 29, 2017

but it's an important enough failure case.

Seeing some logs is so important?
I don't think some logs are shown is not important than breaking some data silently.

@julienschmidt julienschmidt modified the milestones: v1.4.0, v1.5.0 May 15, 2018

evanelias added a commit to skeema/tengo that referenced this issue Aug 20, 2018

Update driver, and limit connection lifetime
This commit updates dep github.com/go-sql-driver/mysql to its current HEAD of
master, bringing in the ability to use some of MySQL 8.0's newest auth plugins.

This commit also makes Instance.Connect() set a max conn lifetime on all
returned connection pools, to avoid problems with a change to driver behavior.
For background, see go-sql-driver/mysql#657.
@elaPa

This comment has been minimized.

elaPa commented Sep 14, 2018

Any updates on this?

The connection can be terminated by the mysql server for many other reasons, not only because of the timeout, so setting ConnMaxLifetime to value smaller than mysql's wait_timeout doesn't really solve this problem.

@vmg / @danielstjules did you find any other solution?

@methane

This comment has been minimized.

Contributor

methane commented Sep 18, 2018

The connection can be terminated by the mysql server for many other reasons, not only because of the timeout, so setting ConnMaxLifetime to value smaller than mysql's wait_timeout doesn't really solve this problem.

Yes, ConnMaxLifetime should be much shorter than wait_timeout. I recommend 10sec~10min.
It can avoid most of the problems.

But problem may happen even with very short ConnMaxLifetime.
In this case, there are nothing we can do other than reporting the error.

@methane methane changed the title from Server timeouts broken since #302 to Server timeouts fixed since #302, then people notice their bad configuration. Oct 24, 2018

@methane

This comment has been minimized.

Contributor

methane commented Oct 24, 2018

I close this issue because it's hard to follow discussion. I don't want to continue disucssion here.

My conclusion is:

  • Connection closed from server / router / middleware is always dangerous. User must avoid it by suing DB.SetConnMaxLifetime().
  • On the other hand, when it happened, it can be handled more safely.

I'll create new issues for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment