Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
packets.go: read tcp [ip-address]: connection timed out #257
UPDATE: My resolution was to remove all "Idle" connections from the pool. See this comment:
I am currently experiencing a stalling or broken web app after a period of idle between 15 to 48 minutes. The most critical issue is described below:
A typical request is logged like this:
After a long period of time (ranging from 15m to 48m), the system all of a sudden logs these lines below with no interaction - the web app has been idle this entire time:
Notice the "TOTAL TIME" is 31 minutes and 19 seconds? Also, notice the MySql driver error that is logged at the same time?
There was no activity / no web request made. The web app was simply idle.
The most critical issue is what comes next after these log messages: _the very next web request is stalls completely, never returning a response_:
And it sits idle, no response, for 15 minutes until wget times out.
Now, if I make a 2nd or 3rd request immediately after this one is stalled and anytime while it is stalled, the go web app responds and returns a full page for other requests. No issues. And then, the cycle starts over from the last request I make and let it site idle.
After this 15m, you can guess exactly what is logged next:
Another 15m wait time.
I eliminated Windows Azure, the Cluster VIP and Firewall/Linux VM running the go web app as an issue because I ran
There are a number of factors in my web app so I will try to outline them accordingly.
Do note that the Linux box running MySql is a different Linux box running the cluster of GoLang apps - and they are in separate dedicated Cloud Services. The MySql vm is a single VM, no cluserting.
Here is some related code:
5 more DB queries, per request
In addition to this query, my "Context" you see being passed into the handler runs 4 to 6 additional SQL queries. Therefore, each "article" handler that loads runs about 5 to 7 SQL queries, minimal, using the exact same pattern and
Timeouts / errors are always on the same DB query
Here's one of the "context" queries as a comparison:
Nothing special there.
I do call
Why am I getting request timeouts logged in excess of 15 to 30 minutes, from an idle site? That seems like a bug with the mysql driver I am using, possibly holding a connection open. But, the last http request was successful and returned a complete page + template.
I even have the Timeout set in the connection string, which is 5 seconds. Even if it is a problem with the mysql server, why the 15 minute timeout/request logged? Where did that request come from?
It still could be a MySql driver issue, blocking the request from completing - maybe being blocked by the MySql dedicated VM and an issue there. If that is the case, then how come nothing is logged? What is this random timeout of 15m to 49m minutes? It is usually only 15m or 31m, but sometimes 48m is logged.
It is very interesting on the "15m" multiples there in the timeouts (@15m, 31m and 48m), allotting for some padding there in seconds.
Thanks in advance.
Regarding your biggest question: read this part of our README and pay special attention to the link at the end.
Answer 1: No, you should only
Answer 2: I don't know what problems you have with Prepare, but I guess it's that it is connected with the opening and closing of the connection pool. The prepared statements are invalid when the connection pool they are prepared in is closed.
Please get back to us here if that didn't help you.
Thanks Arne. I have updated the SO with more debugging. It is an issue with the go app and/or the mysql driver explicitly. Please take another look at the SO as I completely re-worded it just now.
Also, I do set
With the same timeout of 5s.
But as I stated in the updated SO, there is an interval of 15m timeouts (15m, 31m and 48m) that is happening, stalling the entire web request.
I don't think this issue should be closed. It still may be the MySql VM issue in Azure; but even then, why is there a 15 minute timeout happening in the mysql driver/packets?
Notice the logged datetime below from the MySql driver, and then the 31 minute request logged:
I've updated the original post with the exact details now after more debugging.
I've confirmed the MySql driver is the one blocking for 15 minutes.
Once I make 5 to 8 requests to the homepage, I then let the wwwgo app sit idle for at least 30m. I then attempt to make a wget request and it gets blocked on db.Query().
And that's it. It sits idle until a wget error of a timeout response for 15 minutes. Do note that my other 15 minutes came from web browser requests, not wget.
I added additional logging and when the following line executes, it blocks the handler (web request) for 15 minutes. See my previous comment for those errors.
When that line executes, it blocks the method for 15 minutes.
Then after waiting with no activity, all of the sudden this is logged below. It happens in one of the 15m intervals previously mentioned (15m, 31m or 48m), this is logged:
This is obviously a timeout in the MySql driver, perhaps in the
Again, this may be an issue with my MySql VM I have running causing a tcp issue. But even so, there is a 15 minute timeout happening in the Go application using this MySql driver. That should not be happening.
Ps: note the datetimes logged above. There was no activity (no web requests) on the go app during this period. It just sat and waited, in idle.
Not that I know of. Not using the "net" package, other than HTTP.
And the app/main is pretty boiler plate standard website setup stuff.
So I had a theory... Since the MySql query is stalling on execution of an "idle" application for a long period of time, perhaps Windows Azure does something to that TCP connection. Was there a persistent connection in the goland mysql driver that maybe Azure doesn't like across cloud services? (a
So i set out to investigate these... Also, the SetMaxIdleConns() had me courious...
Sure enough, this showed 1 single persistent connection from my wwwgo app to my mysql IP address and 3306 port - even after letting it sit for several hours without any interation, it was still open and persistent.
This got me thinking... Maybe Azure didn't like long running tcp connections with no activity. So i set these:
So the issue seems to be with a long running persistent connection that never closes.
I know I know.. The next words out of anyone's mouth are, "Are you sure you are calling rows.Close(), and that you are iterating over the entire collection? That will hold the connection open if you don't!"
The answer is that I either do
Note that I call defer
Shouldn't a long-running idle connection be dropped?
Entity Framework, for Microsoft SQL Server, doesn't do this "long running" idle connection pooling. It closes all queries.
If it is to remain open, shouldn't there be a timed "ping" across the wire to keep it active?
Similar to how Android tcp connections are handled in where you constantly send a ping to keep the connection alive.
It's great that you found the solution yourself. Congrats!
Still, this is out of scope for drivers.
The driver itself does not provide any pooling functionality, it is based on simple connections.
Sending a ping-like query at regular intervals would require an additional goroutine per connection and synchronisation with any other query running at the same time. All of this would complicate maintenance of the driver and degrade its performance.
I also have a hunch Azure is misbehaving here.
IMO, we should not make any changes in the driver based on this issue.
@julienschmidt close if you agree, take over if you don't
When I ran netstat on the mysql VM, it did not show the connection - only the go app showed a connection
What I suspect is the packets.go code, or the mysql driver code that is calling it, isn't detecting that the connection has been broken by the network - nor is it timing out. It continues to think there is a connection open, and to attempts to use it an hour later.
This theory matches the exact sympons I originally had: after a long idle, attempting to make a SQL query blocks for 15m. But during that block, even 5 seconds after its was initially blocked, I could make another query and it was OK. I suspect a 2nd connection was created in the pool, which worked fine.
Therefore two issues need to be addressed here:
If neither of these are part of the driver but instead the underlying Go packages, then I am more than happy to close the issue and take these two issues to the go team.
But I believe we can agree that there are two known issues here that do need to be addressed.
@eduncan911 I think your theory is right. This is what's happening:
The mystery of the second log line (
What is broken?
Can you please try what happens if you remove the timeout from the dsn?
Interesting. I'll remove the timeout; but, on average my pages load in 12ms overall, including all 7 SQL queries ran synchronously (I was going to move to channels later). It is lightning fast.
Therefore, the 5 second thought i don't think is valid - unless that accounts for dozens of page requests back to back - so the connection stays open for longer than 5 seconds as it is being used over and over again. But, even that doesn't hold water - as I can 1) start the go app, 2) make 1 single request (which ends in 38ms, for the first request at the start of the go app), and 3) wait an hour. The pipe is broken, so after 1 hour when I attempt to make a 2nd request - the first initial SQL query is blocked in this state.
When you said:
Let me clarify it below...
And if that
Then yes, that all sounds right.
No progress on our side.
If you also get this error, please edit your driver version, add
Sorry I haven't been able to "crash production" yet for an update to this issue. It's been fine since setting the SetMaxIdleConns() to zero, even under load tests across 3 go instances on 3 VMs and 1 MySql backend.
Quite surprised actually that MySql can take a beating like that on a single core VM with a limit of 300 IOPS for the VHD - it got about 700 RPS with heavy 5/6 queries per request, across 3 Go instances on 3 VMs. I account it to the 1.5 GB of memory the VM has since they are all READ queries. That's before I add indexes. I always maximize code performance and queries first, before moving to caching.
actually, you can re-produce this debug by set mysql's wait_timeout = 2, then, after 2 seconds, issue some sql commands.
the stack log is:
I found it to be the state of the connection pool, where the pool thinks there still is a connection open. But in fact, it has been closed remotely by the mysqld.
Scroll up for my entire story and debugging: I verified this with netstat where I saw the Go application having a connection to port 3306 on the remote server. But yet, the remote server running MySql no longer had any open pipes (after the timeout I noted above). I did verify that upon making the application active again, opening many mysql connections, that I saw both open pipes on both servers. But again, after some time, the remote mysqld server would drop the open "idle" connection (still not sure if it is mysqld, or Windows Azure doing it).
The root problem is that the net package, and/or this mysql driver, does not detect the dropped connection and thinks that there is still an open idle connection - there isn't. So upon the next attempt by the mysql driver to make a query, with this stale idle connection that doesn't exist, it creates a block and eventually times out after a long period of time. Any additional hits to the mysql driver for queries works fine, as the connection pool simply creates new connections going forward - after that 1 idle connection (e.g. pooling the connections, use the idle open one for the first 1, then create new connections). New connections are fine; it is that idle one that is the problem because it simply isn't connected.
I resolved my issues by getting rid of all idle connections. See my posts above for the history and resolutions. But in short:
It's mysqld, as I'm only using Ubuntu/CentOS and MySQL all in the same machine (both vagrant and production).
Getting rid of all idle connections it's obviously a very good option (hasn't any appreciable performance hit in my responses times). Anyway I think we should bring this to the end and find out if it's a library problem or an issue of the standard one that should be reported and fixed.
At the very least if there's no way to solve this it should be documented where appropriate that a pool of permanent connections cannot be used with this library.
Anyway thanks for writing the solution in this issue explicitly.
Anyone in this thread:
UPDATE if it only happens on a VPS or in a specific environment I'd like to have a look there. If so, I can send you my public key for access.
@arnehormann Put a delay in for 15 or 20 minutes after the first query.
then attempt 2 SQL queries after that 20 minutes in two go routines (GOMAXPROCS = 1, so you can somewhat control the order they run in). you want to wait for the amount of time that it takes for mysql to drop that idle connection.
the first query after 20 minutes will eventually timeout/error after several minutes of waiting - the reason is that it attempts to use what the driver/netpipes thinks is an "idle" live-still-connected connection in the pool. but, mysql has dropped that connection remotely already and it doesn't exist.
the second query will execute almost instantly, and will return without error. you'll still be waiting for the first query to timeout over several minutes. the second query works fine because it did NOT use an idle connection; but instead, the pool created a new tcp connection for the query. this is why it works.
^- as long as you have set idle connections to 1, that is.
it may also require mysql to be installed a remote machine. perhaps mysql breaks the long-idle connection for remote connections only after a period of time (where perhaps long-idle connections locally are fine).
@eduncan911 thanks, but please don't describe it - change my program linked above so it reliably hangs on your own machine when you start it there.
I have an idea how to tackle this issue, but it's a little too brittle for my taste. I still don't see another way after a lot of hard thinking.
As I see it, the cause is the server cutting of the connection in a way that's not easily detectable by the client (same problem as ripping out the network cable). To detect this faster and to let the server know we are still there, we need a MySQL protocol based keepalive on the same connection, the TCP keepalive with it's overly long intervals doesn't help.
Ironically, @xaprb recently published a blog post praising our driver for not using timeouts... but what I propose is different to what other connectors do.
Steps to do this:
From #257 (comment)
So what we have to avoid is that we write to an connection, which we think is still alive but in reality the server closed it already.
We could could also do something like golang/go#9851 driver side by returning
And regarding Arne's keepalive feature... Déjà-vu? 9d66799
I was having a problem that intuitively seems like what you are describing. I am writing a login/authentication micro-service, and frankly just the initial prototype that generically checks against a user/pass fetching from the database was resulting in read time out, then a broken pipe, then a write time out on different lines.
I have since tried your solution of:
...and that seemed to do the trick. I was checking it for 1-2 hours later, and it wasn't hanging after the initial request. But I left the go server running over night and when I checked this morning, the first request failed. However, I don't get the errors logged in the console, but the behavior is identical, just a MUCH longer interval between the breaks.
The long period of time seems to be a trait from the post above me, but the response isn't long, it just breaks or blocks.
I figure I can probably work around this since the micro-service is for private consumption and I can just issue single retry from JS on the front-end since I can expect 1 failure on requests that make a trip to the database and the subsequent request will succeed, but clearly that is tacky and would prefer to avoid that.
Have you any more experience with this problem over a longer period of time of it running after an initial request from a connection?
Had the same problem:
From what I understood, the
However, I still get the error. Any insight here?