bad connection errors are incomplete and do not provide context of the instance being connected to #1385

sjmudd · 2023-01-18T08:27:48Z

Issue description

I manage an application called orchestrator which talks to thousands of MySQL servers and is used to detect MySQL failures and if a master or intermediate master fails to re-arrange the replication topology of the cluster to ensure it can continue to be used. This system has a reasonably short connect timeout of 1 second.

The problem I see is that we get quite a lot of errors when connecting to "the database" prior to doing operations such as reconfiguring the server or reconfiguring from which other MySQL server it should replicate. The failed connect errors are the concern. See: https://jira.percona.com/browse/DISTMYSQL-261 for some context of the issue from the application side.

Orchestrator is currently using the v1.6.0 version of the MySQL driver. I'm aware there's a newer version v1.7.0 but as far as I can see there's no change in logic around the topic being discussed.

The error message itself is "extremely vague": driver: bad connection does not indicate the actual problem and the logging does not indicate to which host:port this is actually happening and given the fact I'm continually polling a large number of MySQL servers identifying the source of the problem and the exact cause is actually quite important as orchestrator is intended to determine if the MySQL host is healthy or not, so identifying the reason for a connection failure is important.

The bad connection errors are not that frequent but do add up. The problem is it's hard with current code to identify the source of the issue OR the specific issue that is being caused in a single log line.

I see that currently the driver logging combines multiple different conditions under the same umbrella term errBadConnNoWrite and in some cases it logs the error independently of the error returned to the caller.

It would seem better, given recent changes in go error handling, to extend the errBadConnNoWrite errors returned to the caller into separate errors for each condition triggered so that applications can still detect this error with errors.Is(err, errBadConnNoWrite) but by wrapping the specific error with the error picked up earlier in the code (within the driver) the full error can be returned to the caller and identified more completely.

Ideally I'd like the driver to report in the error something like the mc.cfg.Addr value of the host being talked to. If that is not considered acceptable then it would be necessary for the caller to be adapted to record this information for all connections so it can be logged with the error received from the driver when it happens.

Summarising: the exact cause of driver: bad connection errors is not clearly identified in the error returned to the caller.
I believe it would be good to return a more detailed error to clarify each of the different cases where it is returned more explicitly and suggest that users use errors.Is() to check for errBadConnNoWrite if existing functionality is required. If possible include the address of the host where this happens.

If there is a better way to identify the specific errors and the host to which they correspond then please share your thoughts.

Error log

See: https://jira.percona.com/browse/DISTMYSQL-261 for some details/examples

Configuration

Driver version (or git SHA):

v1.6.0

*Go version:

Not entirely sure. This is seen on CentOS 8 where I see the version pulled in by yum to be 1.18.4
current quay.io/centos/centos:stream8 docker images now show 1.19.2.
packages were built in Nov '22 so the version may be a bit older. I can find out if needed.

*Server version:

MySQL 8.0, versions currently from 8.0.28-8.0.31
MySQL 5.7.40 or so

*Server OS:

CentOS 8, x86_64

The text was updated successfully, but these errors were encountered:

methane · 2023-01-19T09:40:22Z

"talk to thousands of MySQL" is not typical use case of most user of this driver.
Most users know the server because they configure the sql.DB.

I won't break backward compatibility only for such rare use case.

By the way, we print actual low level error behind errBadConnNoWrite into stderr.
Can't you see it?

sjmudd · 2023-01-19T18:45:22Z

"talking to thousands of MySQL servers" may not be a typical use case, but you provide the MySQL golang driver, so while I do not expect special handling of my specific use case I do believe the current driver's error handling in this case could be improved.
Yes, you print the low level error to stderr sometimes. However, given I'm polling a large number of servers concurrently these log lines may overlap so the logging to stderr is not I think an appropriate solution. The caller can not collect the error and report it appropriately within the application.
In many cases you do not report the error so I can't be sure of exactly which of the actual return calls was being triggered and therefore which of the exact specific cases might be happening here.
Again this application's primary concern is to connect to a server, identify if it's healthy and do that periodically. The response returned back to the driver, especially since you may be doing lazy connections may appear anywhere a SQL command is called and so the caller can not easily determine what's going on.

Which "backward compatibility" changes do you think this makes?

A colleague said:

It seems other drivers were affected by a similar issue: denisenkom/go-mssqldb#586

This lead to golang upstream changes, merged in golang 1.18: https://go-review.googlesource.com/c/go/+/333949

When drivers return driver.ErrBadConn, no meaningful information about what the cause of the problem is returned.
Ideally the driver.ErrBadConn would be always caught with the retry loop, but this is not always the case. Drivers today must choose between returning a useful error and use the rety logic.This allows supporting both.

So my concern right now is still the same:

I can not identify the actual "driver: bad connection" error. This is happening but I don't know why and the stderr logging is not seen by the upstream caller, and in some cases nothing is logged.
I do see errors. They happen a lot. Clearly they're a small percentage of errors but they are important. These errors basically break sql calls which are required for our application to manage MySQL instances and specifically during error situations. The odd thing I see this happening against healthy servers, or servers which a few seconds ago had appeared to be healthy.

e.g. A master dies (this happens and is expected) and so we need to choose the most up to date replacement, and then relocate the other replicas under the new master. Often we see the relocation request to a server we'd seen as being healthy < 5 seconds previously fails. A new connection is opened and we get a "driver: bad connection" error. That's all I see and I don't know why the connection failed. That's a real concern. Of course if I log into the server it's working fine.

I can certainly try v1.7.0 of your driver together built with the latest version of golang to see if this helps and report here. I was told that it looks like orchestrator was built on golang 1.16 even though that's a bit old, so the update in the golang version the MySQL driver and database/sql might trigger a change in behaviour. I can verify that if you think it might make a difference, but it does not entirely help with logging some things to stderr become effectively invisible to the caller. Also the current logging is not easily identifiable against a specific database connection. If it were returned to the user, wrapped in some way, then it would not be your problem: the user would have the information and could "track" which database connection this referred to.

methane · 2023-01-20T04:21:34Z

However, given I'm polling a large number of servers concurrently these log lines may overlap so the logging to stderr is not I think an appropriate solution.

If stderr is not good for you, you can use SetLogger.

We can not support 100% use case. In general, it's sysadmin job that reading logs and investigate errors. Applications don't need precise information.

We do not know about orchestrator. We can not fix your errors. We do not know what erorr/log is needed to fix your errors.

Please vendor & patch (or fork) and add anything you need in it.
After fixing your problem, please create a pull request to add error/log that is useful for 99% users, not only for you.

methane · 2023-01-20T06:48:44Z

Which "backward compatibility" changes do you think this makes?

A colleague said:

It seems other drivers were affected by a similar issue: denisenkom/go-mssqldb#586
This lead to golang upstream changes, merged in golang 1.18: https://go-review.googlesource.com/c/go/+/333949
When drivers return driver.ErrBadConn, no meaningful information about what the cause of the problem is returned.
Ideally the driver.ErrBadConn would be always caught with the retry loop, but this is not always the case. Drivers today must choose between returning a useful error and use the rety logic.This allows supporting both.

Thank you for teaching this. We support Go 1.13 currently. But don't wait we drop Go 1.13-1.17 support.

Note that most ErrBadConn errors are consumed by database/sql retry loop. They are not returned to caller unless it happened in the last retry.

So checking log is much better way than trying to retrieve all debug information from error.
That's how we generally use in developing this driver. Adding much debug logs that is only needed during debugging.

sjmudd · 2023-01-20T17:34:09Z

Hi @methane,

I will look to see at least if for now forking the driver to get a better insight into what is actually happen will give us hints on the problems we are seeing. Yet partially I find it somewhat disconcerting that the error: driver: bad connection returned to the application is extremely generic. There are likely to be 2 types of error:

internal driver issues
external connection issues which the operating system will report to the driver

In both cases from what I can see there may be multiple "flavours" of this. MySQL natively returns an indication of the MySQL error but often part of the error response provides something that comes from the OS:

$ mysql -h 10.1.2.3
ERROR 2003 (HY000): Can't connect to MySQL server on '10.1.2.3:3306' (60)
$ mysql -S /path/to/nowhere
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/path/to/nowhere' (2)

So you get some information from the MySQL perspective but it might be that the MySQL server gives you more details which come from the OS itself. Perhaps that additional information might be useful when the driver.ErrBadConn is returned by wrapping it and providing the detail?

I see that the code returns driver.ErrBadConn if it internally generates errBadConnNoWrite. I'm guessing this may be the contract that you need to fulfil with database/sql. Is that the case?

Either way the error in packet.go comes from writePacket, writeHandshakeResponsePacket, writeAuthSwitchPacket, writeCommandPacket, writeCommandPacketStr, writeCommandPacketUint32, writeExecutePacket at least as far as I could tell and the it might perhaps be useful to know which of these specific cases is actually the cause of the generic errBadConnNoWrite error you return now. Given I see the current error I can not see this detail which would be useful to me. I would imagine that same information might be useful to other users of your driver if they want to better understand a connection failure when talking to their backend MySQL server.

sjmudd · 2023-01-20T18:35:59Z

A further thought here is that if there is an error of this type the upper layer retries the statement and the application may be completely unaware there was a problem. This is partly good as the app sees no problem, but it is also unaware the connection may be flaky.

If an error is reported by database/sql it will be on the retried attempt so the original error will be lost. Is that right? Even worse.

It might be nice to have such metrics/statistics available somewhere so the application can detect or determine such issues. If you have any sort of long running process or daemon which talks to one or more databases I can see this being interesting, but I would expect this to be perhaps exposed via database/sql or through database/sql. Maybe that is not feasible now. Exposing some metrics from the MySQL driver would be fine.

If the statitics provided were only error statics at least the happy path would not be affected performancewise and the overhead of updating error counters should be negligible anyway.

Do you have any thoughts on these points?

methane · 2023-01-21T00:07:05Z

If an error is reported by database/sql it will be on the retried attempt so the original error will be lost. Is that right? Even worse.

Yes. database/sql catches most ErrBadConn and doesn't return it to application.
Most errors are lost.
That's why I strongly recommend to use debug log repeatedly.

Do you have any thoughts on these points?

"print debug" is the best way.

sjmudd · 2023-01-25T18:50:17Z

FWIW: I am not sure this really works. I also checked about golang/go#48309 and this is a feature request to make the retry setting optional (or configurable) which I think is what I would want for normal processing in my case of "testing health of servers I manage/monitor". That still requires you to modify your driver to return different, wrapped errors depending on the specific issues currently seen that generate driver.ErrBadConn. As things stand pouring over logs as mentioned previously is not helpful as the identity of the connection that fails is not provided. Returning the real error to the caller makes it my problem to correlate the error with the connection not yours.

Note the comment from golang/go#48309 (comment):

It is encouraged for database drivers to wrap BadConn.

Related I also notice that setting timeout=1s may not work as expected in the DSN as the retry logic if you return a driver.ErrBadConn with trigger a new connection with the same timeout, thus potentially making the connect timeout take 2 seconds in some scenarios. That's something I'm looking to fix in my code by using the context option and add a timeout at the database/sql layer.

So it may be good to mention this option to people as the configured behaviour might not match what's expected.

abscondment · 2023-01-27T16:48:41Z

We support Go 1.13 currently. But don't wait we drop Go 1.13-1.17 support.

What is this project's philosophy around supporting different versions of go? Go itself no longer supports any of the versions you mentioned, even with security patches, because their commitment is "Each major Go release is supported until there are two newer major releases" (ref).

When could we expect changes that require the 1.18 interface to be incorporated? Our use case for application awareness of driver.ErrBadConn isn't debugging, but allowing our services to make decisions.

Edit: I see you have this stated in the README already! "We aim to support the 3 latest versions of Go." So, since it's likely go1.20 will be released in 1-2 months that is the earliest it would be reasonable to request any breaking change requiring go1.18.

sjmudd · 2023-04-10T12:41:39Z

I'm concerned there is no progress on this. I think we can probably provide a PR which we're trying internally to resolve the issue and it seems to work fine for us. Reporting of the issue is provided in a better way and also the error is wrapped to ensure that the upstream database/sql layer gets the expected error, but they already handle that.

So if you'd like us to provide a PR to inspect and that would make progress I see that as being very valuable. I just need to confirm with colleagues prior to doing that.

methane · 2023-05-22T03:01:09Z

Now logger can be configurable per DB. See #1408
You can add information about which server cause the error log.

sjmudd mentioned this issue Jan 25, 2023

database/sql: make maxBadConnRetries configurable golang/go#48309

Open

methane closed this as completed May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bad connection errors are incomplete and do not provide context of the instance being connected to #1385

bad connection errors are incomplete and do not provide context of the instance being connected to #1385

sjmudd commented Jan 18, 2023 •

edited

methane commented Jan 19, 2023

sjmudd commented Jan 19, 2023 •

edited

methane commented Jan 20, 2023

methane commented Jan 20, 2023

sjmudd commented Jan 20, 2023 •

edited

sjmudd commented Jan 20, 2023

methane commented Jan 21, 2023

sjmudd commented Jan 25, 2023 •

edited

abscondment commented Jan 27, 2023 •

edited

sjmudd commented Apr 10, 2023

methane commented May 22, 2023

bad connection errors are incomplete and do not provide context of the instance being connected to #1385

bad connection errors are incomplete and do not provide context of the instance being connected to #1385

Comments

sjmudd commented Jan 18, 2023 • edited

Issue description

Error log

Configuration

methane commented Jan 19, 2023

sjmudd commented Jan 19, 2023 • edited

methane commented Jan 20, 2023

methane commented Jan 20, 2023

sjmudd commented Jan 20, 2023 • edited

sjmudd commented Jan 20, 2023

methane commented Jan 21, 2023

sjmudd commented Jan 25, 2023 • edited

abscondment commented Jan 27, 2023 • edited

sjmudd commented Apr 10, 2023

methane commented May 22, 2023

sjmudd commented Jan 18, 2023 •

edited

sjmudd commented Jan 19, 2023 •

edited

sjmudd commented Jan 20, 2023 •

edited

sjmudd commented Jan 25, 2023 •

edited

abscondment commented Jan 27, 2023 •

edited