Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likely race that can prevent to recycle closed connections #1542

Closed
pfreixes opened this issue Jan 30, 2024 · 5 comments
Closed

Likely race that can prevent to recycle closed connections #1542

pfreixes opened this issue Jan 30, 2024 · 5 comments

Comments

@pfreixes
Copy link

pfreixes commented Jan 30, 2024

We have upgraded our mysql client from 1.2.0 to 1.7.1 recently, and after a while we have observed that one of our instances that was running the client started to fail for all of the operations with an error that was saying invalid connection.

This error seemed to come from here

Our initial guess was that an unhealthy TCP connection was used all the time, and for some reason the TCP connection was not automatically recycled due to its supposed closing state.

We have found out that from the 1.2.0 version to the 1.7.1 there was this PR which changed drastically the way that the mysql driver was informing to the SQL Golang connection pool that connections were considered in bad state, and they should be recycled from the pool.

While the original changes seemed legit, with the propose of preventing retries for operations that might already touched the server, we found out that this was also changed. So returning the internal ErrInvalidConn in case the buffer could not be taken.

We could not yet reproduce the issue but we believe that some scenarios could lead to a situation where the length of the buffer is not 0, and the interpolation can not be done. For example consider the following scenario:

  • A new query is being done
  • data sent by the client is being processed by the Mysql server
  • Mysql returns a response
  • The client reads N bytes, which involves the header + some data from the response which is already in the buffer. But the overall packet len is not yet in the buffer. So the buffer will contain the header + some extra data
  • A RST packet is sent by the Mysql server, or an Proxy that could be placed in the middle
  • A subsequent read will get as a result an error since the RST package would take precedence over the reading of the remaining bytes of the packet len. At that point we would be leaving some orphan data into the buffer.
  • the ErrInvalidConn will be returned in that case
  • A new query is being done and reuses the same connection
  • When the interpolation function is being called, it fails since there are still some remaing bytes in the buffer from the previous operation.

As I said we could not yet probe that this was the root cause of the original issue, but seemed a plausible scenario that could have lead to the situation that we experimented.

Also you might want to consider if its semantically correct to return the ErrInvalidConn when the sever was not yet touched, considering maybe on just returning the driver.ErrBadConn which would automatically resolve any issue with unhealthy TCP connections.

Let me know what you think, we are more than happy to create a PR with this change if you believe that its worth it.

@pfreixes
Copy link
Author

pfreixes commented Feb 2, 2024

I went ahead and Ive created a PR #1545 let me know what you think 🙇

@methane
Copy link
Member

methane commented Feb 4, 2024

Did you see busy buffer error log?

@methane
Copy link
Member

methane commented Feb 5, 2024

#1513 may be same issue. But I don't know because no reproduce step yet.

@pfreixes
Copy link
Author

pfreixes commented Feb 5, 2024

Did you see busy buffer error log?

Unfortunately we did not have the logger configured and we never sent the stderr to our central logs, we will make sure that we configure it in a way that allow us to gather these kind of logs later.

#1513 may be same issue. But I don't know because no reproduce step yet.

Yes, I would say that #1513 suffered from a similar issue which could have been triggered from the same scenario that I was describing

@pfreixes
Copy link
Author

pfreixes commented Feb 5, 2024

Found the issue thanks to @owbone, seems that we suffered from same issue #1394 which was fixed (but not yet realeased! here #1473

Seems that Ive missed this mc.Close() here, which makes impossible the race that I was speaking about.

Ill close the issue and the PR.

@pfreixes pfreixes closed this as completed Feb 5, 2024
This was referenced Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants