Pool connection doesn't reconnect well. #821

miningpoolhub · 2014-05-18T14:21:33Z

I use amazon cloud server, with rds multi-AZ.

I'm using pooled connection, but it doesn't reconnect as expected.

I restarted RDS machine, and it started fail-over to other multi-AZ.
But at that point, connection was lost and not recovered well.

I'm just using pool.query() which don't have connection release leak. It must work as expected isn't it?

dougwilson · 2014-05-18T14:24:16Z

It must work as expected isn't it?

I'm not really sure what you are asking. I also don't know anything about how Amazon's services work, so all that information about Amazon-specific terms is over my head. an you give me a more detailed description about what is not working, exactly? Also, are you getting any error from the pool?

miningpoolhub · 2014-05-18T14:43:17Z

Oh, the connection count graph increased at certain point is not done automatically. The connection count didn't increase but increased after I restarted node process.

Amazon EC2 (just virtual machine computer) connects to RDS which is DB server.

Two RDS works, one is primary (master) and the other is slave. Each other changes when fail-over happens.
EC2 doesn't connect to RDS by IP, but it connects with dns name. dns endpoint is provided by Amazon and it redirects to primary RDS server at that time.

This RDS fail-over works well with other program written in python, reconnects very smoothly when RDS reboots or some problem occurs.
But node-mysql doesn't reconnect as expected.

Sorry that I couldn't get any info from pool. I didn't log about node-mysql that much.
But it's reproducible to this kind of problem.

I used cluster to run many nodejs process and each process has 10 connections max.

Maybe it's dns resolve problem, which previous pool connection trying to connect to same IP again and again. I'm not sure but just guessing because RDS works different, needs DNS info refreshed.

Curious thing is that it worked a bit too. I'm not sure it worked because it reconnected or some idle connection was established but some data was sent correctly.

Any idea?

dougwilson · 2014-05-18T14:48:24Z

Any idea?

I don't really have any. So for the host -> DNS, that is just using the internal node.js core mechanisms, so I don't have any comment there about how that is working. Also, if the connection gets lost in this pool, a new connection will only be reestablished the next time to try to make a query, so if you are not making queries, the connections won't come back automatically.

Also, what version of this module are you using. Can you try the newest 2.3.0 if you are not using it and see if anything changes (there are various improvements/fixes that may be impacting you)?

miningpoolhub · 2014-05-18T15:09:05Z

Queries are inserted frequently like 10~50 per second. Maybe I had to wait more time to see if things were going well, but I couldn't wait long because it is real production service.

I'm using 2.2.0
I will try new version.

dougwilson · 2014-05-18T15:17:23Z

I'm using 2.2.0

OK, 2.2.0 should be the newest I really wanted you to try, but you are using it.

But it's reproducible to this kind of problem.

I don't have access to any Amazon services to try and reproduce the issue to determine what is happening. It would be really helpful if you could figure out what the issue is so I would know how to make a patch for it, otherwise I'm just in the dark about what the issue you are experiencing is.

dougwilson · 2014-05-18T15:24:08Z

/cc @sidorares

miningpoolhub · 2014-05-19T00:12:19Z

What information do you need?
How can I enable debug info or things with node-mysql especially for pooled query?
I just use

pool = mysql.createPool()
pool.query()

Sorry that I'm new to nodejs, not sure how to examine it.

sidorares · 2014-05-19T01:25:44Z

It's not very clear to me as well what problem are you solving. Number of connections in AWS console is expected to drop after restart - pool opens new connections only if there is no idle connection available and you requesting new one.

Are you actually getting errors on mysql client side because it's using old IP after failover? If not, could you add some connection identifying SQL and log results ( server id / thread id )?

miningpoolhub · 2014-05-20T02:14:21Z

I will test more and write it soon. Thanks all.

miningpoolhub · 2014-06-09T09:31:00Z

I analyzed deep into it today.

When Amazon tries Multi-AZ fail-over, two DB instance changes their position by redirecting endpoint url.
Amazon provides endpoint address like "testtesttest.cqofkxwqldcj.us-east-1.rds.amazonaws.com" not an IP and it redirects to master DB instance at that moment.
So connection IP may change time to time when fail-over occurs.

I couldn't get any error when doing fail-over with node-mysql. It works as it didn't disconnected at all.
DB records inserted successfully (not showing any error), but DB record data is not shown on DB. I think these newly inserted data have gone away to Backup DB instance.

Pool connection works perfectly when I tried just restarting DB instance not using Multi-AZ fail over. It says PROTOCOL_CONNECTION_LOST, ECONNREFUSED properly and works well when DB instance restarted some minutes later.

One more thing, I connected HeidiSQL to Master DB from my client PC and it disconnected immediately when I fail-over the DB instance. So HeidiSQL detects disconnection at right moment and tried to reconnect to endpoint url again.

I think node-mysql doesn't detect disconnection well or Amazon doesn't disconnect backup DB connection from which is from their server farm. Since other python based mysql projects works well with Amazon MySQL Multi-AZ, it maybe a specific issue with node-mysql only.

I'm using latest node-mysql version. I used today's latest souce.

dougwilson · 2014-06-09T13:27:30Z

Yea, it does sound like there must be some disconnection indicator if other libs are seeing the disconnect. Just as a curiosity, would it be possible to try the mysql2 library to see if it has the same problem?

miningpoolhub · 2014-06-09T18:46:04Z

I tried mysql2 too but no luck.
I found that setting "queueLimit: 10" makes the pool query to output error. (It didn't even show error before) Since queries were stacked on queue unlimitedly, it didn't return error code. I thought it was inserted to DB correctly but it didn't at all.

When error occurs, I tried to reconnect something like this code.

pool.query(querystring, ..., ...,
function(err, result) {
if (err) {
pool.end();
pool = mysql.createPool(......);
}

It worked as expected but had some timing issue that createPool() was called constantly by other query's fail.
I increased queueLimit to 100 and the error did not happen frequently but happened time to time.

I want to know for clean and robust code to reset pool connection and reinitialization.
One thing the main problem is that Amazon Multi-AZ doesn't disconnect previous connection, pool also doesn't get disconnect event too.
node-mysql catches disconnection event when Amazon RDS just reboots, so this is some bug about Multi-AZ fail over on Amazon side.

Any idea for workaround or clean code?

dougwilson · 2014-06-09T18:49:52Z

Can you see what the behavior is if you set your connectTimeout value low? For example, add " connectTimeout: 10" to your configuration.

miningpoolhub · 2014-06-10T02:24:57Z

When I set connectTimeout: 10
it seems to work!
But I don't think it's stable.

I'm curious that connectTimeout is millisecond but 'ETIMEDOUT error' log didn't appear immediately. It took about 30sec ~ 10 minutes to reconnect. This is weird. Is it normal that connectTimeout value is not affected much?
And when query is not inputted much after failover, new connections are made but some previous connections are still connected to backup server. I checked it by netstat and it was ESTABLISHED state. It's not a big problem, but it doesn't use pool connection in round-robin style.

It is stated as
"Connections are also cycled round-robin style, with connections being taken from the top of the pool and returning to the bottom."

I am worried about it that some potential problem lies in it. Like round robin is working as expected but timeout didn't happen to previous connection and something stuck. Well DB records are inserted successfully though. Hope it's just round robin problem or false documented.

Anyway, Amazon RDS is the core problem that didn't disconnect client when failover occurs. I submitted this issue to Amazon RDS forum.

dougwilson · 2014-06-10T02:31:38Z

Oops, I was thinking more like 10s, not 10ms :) How about "connectTimeout: 10000"? I think setting it to 10ms will cause weird behavior for sure, though, because that is way too quick.

mseddon · 2014-06-12T17:06:23Z

Anyway, Amazon RDS is the core problem that didn't disconnect client when failover occurs.
I submitted this issue to Amazon RDS forum.

Unfortunately, if for example, the AZ's datacentre goes completely offline for some reason, Amazon 'fixing' this problem this way will not help, you are merely treating the symptoms and failover will still not happen. It is perfectly feasible in many failure conditions that the connection could just die without ever sending you a TCP RST, and that all users are susceptible to this problem, however the behaviour of RDS has drawn attention to the problem.

I haven't had time to look into this just yet, but I can confirm it happening here on our RDS cluster. The fix must be applied here (or upstream if that is where the problem actually is).

I'll check back with more information once I've had a chance to look at it.

dougwilson · 2014-06-12T17:17:01Z

It is perfectly feasible in many failure conditions that the connection could just die without ever sending you a TCP RST

Yes, that is definitely an issue. Without a RST packet, there is no way to know the TCP connection has closed by the other side. We'll need to set a timeout when doing operations and let that timeout expire to know that the remote host has disappeared. I can definitely implement that.

mseddon · 2014-06-12T19:00:25Z

Yeah, that should hopefully do it (assuming obviously you can dial in the timeout and not prematurely close horribly loaded servers ;))

I'm hoping to allocate some time next week to look at this issue properly and will let you know what tcpdump sees and what's going on inside the code as it happens. Happy to test anything you need, but my time is a bit limited at the moment, so I can't be sure of when it will be done.

dougwilson · 2014-06-12T19:01:57Z

assuming obviously you can dial in the timeout and not close horribly loaded servers

If this means "is the timeout configurable"... yes, yes, it will definitely be :)

I'm hoping to allocate some time next week to look at this issue properly and will let you know what tcpdump sees and what's going on inside the code as it happens.

Awesome!

mseddon · 2014-06-12T21:45:30Z

Necessity is the mother of pull requests ;)

mseddon · 2014-06-20T15:28:31Z

Hi again, finally had a chance to briefly look at the problem today in more detail. It's very much to do with the failure to receive that final RST, so this is a very critical bug. When we simulate the AWS failover amazon abruptly kill the server instance and the connection dies. A simple setInterval query selecting CURRENT_TIMESTAMP from the server does the following as examined by tcpdump:

14:36:32.093640 IP mysqlA > client.34221: Flags [.], ack 562, win 114, options                                   [nop,nop,TS val 389305 ecr 1065078], length 0
14:36:34.095415 IP mysqlA > client.34221: Flags [P.], seq 1287:1404, ack 562,                                   win 114, options [nop,nop,TS val 389805 ecr 1065078], length 117
14:36:34.096335 IP client.34221 > mysqlA: Flags [P.], seq 562:613, ack 1404, w                                  in 229, options [nop,nop,TS val 1065579 ecr 389805], length 51
14:36:34.096651 IP mysqlA > client.34221: Flags [.], ack 613, win 114, options                                   [nop,nop,TS val 389806 ecr 1065579], length 0
14:36:36.097251 IP mysqlA > client.34221: Flags [P.], seq 1404:1521, ack 613,                                   win 114, options [nop,nop,TS val 390306 ecr 1065579], length 117
14:36:36.098168 IP client.34221 > mysqlA: Flags [P.], seq 613:664, ack 1521, w                                  in 229, options [nop,nop,TS val 1066080 ecr 390306], length 51
14:36:36.098457 IP mysqlA > client.34221: Flags [.], ack 664, win 114, options                                   [nop,nop,TS val 390306 ecr 1066080], length 0
14:36:38.098862 IP mysqlA > client.34221: Flags [P.], seq 1521:1638, ack 664,                                   win 114, options [nop,nop,TS val 390806 ecr 1066080], length 117
14:36:38.099752 IP client.34221 > mysqlA: Flags [P.], seq 664:715, ack 1638, w                                  in 229, options [nop,nop,TS val 1066580 ecr 390806], length 51
14:36:38.100041 IP mysqlA > client.34221: Flags [.], ack 715, win 114, options                                   [nop,nop,TS val 390807 ecr 1066580], length 0
--- FAILOVER HERE --- HANG  -- NO RST FROM MYSQL

An easy way to reproduce under a linux box would be just to block traffic from port 3306:

iptables -A INPUT -p tcp --source-port 3306 -j DROP

dougwilson · 2014-06-20T15:30:08Z

@mseddon thanks for the verification of the missing RST to be the source :) I'll plan to make a new release tonight with changes that should solve the issue.

mseddon · 2014-06-20T15:32:28Z

Excellent, thanks! I'll have a play with that over the weekend and let you know how it works out.

dougwilson · 2014-06-21T16:51:10Z

@mseddon sorry, I did not end up having much time yesterday, but I got most of it done. Will be finishing it up today :)

mseddon · 2014-06-23T15:47:30Z

@dougwilson Any updates on this issue?

dougwilson · 2014-06-23T15:52:33Z

I have been iterating though different takes in a server farm. They weren't quite working, so I'm just trying to get something that will actually work :)

mseddon · 2014-06-23T15:53:53Z

Ok, no problem! Thanks :)

dougwilson · 2014-07-06T05:02:58Z

If you like, please try out the patch with npm install felixge/node-mysql#945b92e and let me know if it corrects your issue.

mseddon · 2014-07-07T12:14:41Z

Printing the result of doing SELECT CURRENT_TIMESTAMP every 2 seconds and then triggering a failover still doesn't cause any reconnect after ten seconds, or indeed five minutes.

SUCCESS:  Mon Jul 07 2014 12:11:34 GMT+0000 (UTC)
SUCCESS:  Mon Jul 07 2014 12:11:36 GMT+0000 (UTC)
SUCCESS:  Mon Jul 07 2014 12:11:38 GMT+0000 (UTC)
SUCCESS:  Mon Jul 07 2014 12:11:40 GMT+0000 (UTC)
[timeout occurs here]

Same problem as before when the RST packet is not received.

/home/ubuntu/node-mysql-test
└─┬ mysql@2.3.2 (git://github.com/felixge/node-mysql.git#945b92e15e080944c79a45ca68e2c6d9f7c3059f)
  ├── bignumber.js@1.4.0
  ├─┬ readable-stream@1.1.13-1
  │ ├── core-util-is@1.0.1
  │ ├── inherits@2.0.1
  │ ├── isarray@0.0.1
  │ └── string_decoder@0.10.25-1
  └── require-all@0.0.8

@miningpoolhub does this patch work for you?

dougwilson · 2014-07-07T13:02:04Z

@mseddon can you paste me the code you are using to test? It is important to note that the patch only affects pool.getConnection right now.

mseddon · 2014-07-07T13:12:49Z

ah, that may be it, one moment while I try again using that. Cheers.

mseddon · 2014-07-07T13:17:08Z

Hi- switching the test to use pool.getConnection seems to work better, I'll have a play but this might fix it after all, sorry about that. Thanks :)

dougwilson · 2014-07-07T13:45:45Z

@mseddon no problem :) I know there is still work to be done--but from now, it's actually down deep into Node.js internals. When you send data over TCP, you get ACKs for each packet; we should be able to timeout if when sending off the sequence doesn't get ACKs back, but that is not straight-forward in Node.js since the details have been extracted away from us.

mseddon · 2014-07-07T16:26:56Z

So, one observation (though probably not surprising)- if I set the connectionLimit to 1 to simulate a full connection pool, and failover within a query, reconnects still don't happen. I would imagine that this is because we're hung waiting for a free connection, and your timeout check occurs once we have a free connection?

dougwilson · 2014-07-07T16:33:34Z

I would imagine that this is because we're hung waiting for a free connection, and your timeout check occurs once we have a free connection?

Correct. The query is hung because of the severed connection. You can always do conn.query({sql: 'SELECT 1', timeout: 30000}) to give the query a 30s timeout, but that's not very practical. AFAIK, Node.js abstracts too much away to know that the TCP connection is gone in the middle of the query without a FIN or RST packet to tell us.

mseddon · 2014-07-07T16:35:19Z

Hmm, does timeout start when the conn.query() command is issued, or when the request is finally submitted to mysql? It's not pretty but it could work for my case.

dougwilson · 2014-07-07T16:37:50Z

does timeout start when the conn.query() command is issued, or when the request is finally submitted to mysql?

The timeout is actually based on sequence packet activity, so it would start when the query packet is sent to MySQL server, rather than at the time you called .query().

mseddon · 2014-07-07T16:38:54Z

Ok, thanks, that should be good enough!

dougwilson · 2014-07-07T16:43:00Z

Ok, thanks, that should be good enough!

Cool :) I'm still going to be working on adding more protections, though. It may be interesting to know that every command now accepts a timeout option, so you could ping with a timeout conn.ping({timeout: 1000}), etc.

closes mysqljs#821 closes mysqljs#854

dougwilson added the question label May 18, 2014

dougwilson added the needs investigation label Jun 9, 2014

dougwilson added bug and removed question labels Jun 20, 2014

dougwilson self-assigned this Jun 20, 2014

dougwilson added this to the 2.4 milestone Jun 20, 2014

dougwilson mentioned this issue Jun 24, 2014

getConnection()'s callback never called when I got disconnect from the NET (using pools) #854

Closed

dougwilson closed this as completed in 945b92e Jul 6, 2014

mysqljs locked and limited conversation to collaborators Oct 3, 2014

seangarner pushed a commit to seangarner/node-mysql that referenced this issue May 11, 2015

Add acquireTimeout pool option

853bb4b

closes mysqljs#821 closes mysqljs#854

Pool connection doesn't reconnect well. #821

Pool connection doesn't reconnect well. #821

Comments

miningpoolhub commented May 18, 2014

dougwilson commented May 18, 2014

miningpoolhub commented May 18, 2014

dougwilson commented May 18, 2014

miningpoolhub commented May 18, 2014

dougwilson commented May 18, 2014

dougwilson commented May 18, 2014

miningpoolhub commented May 19, 2014

sidorares commented May 19, 2014

miningpoolhub commented May 20, 2014

miningpoolhub commented Jun 9, 2014

dougwilson commented Jun 9, 2014

miningpoolhub commented Jun 9, 2014

dougwilson commented Jun 9, 2014

miningpoolhub commented Jun 10, 2014

dougwilson commented Jun 10, 2014

mseddon commented Jun 12, 2014

dougwilson commented Jun 12, 2014

mseddon commented Jun 12, 2014

dougwilson commented Jun 12, 2014

mseddon commented Jun 12, 2014

mseddon commented Jun 20, 2014

dougwilson commented Jun 20, 2014

mseddon commented Jun 20, 2014

dougwilson commented Jun 21, 2014

mseddon commented Jun 23, 2014

dougwilson commented Jun 23, 2014

mseddon commented Jun 23, 2014

dougwilson commented Jul 6, 2014

mseddon commented Jul 7, 2014

dougwilson commented Jul 7, 2014

mseddon commented Jul 7, 2014

mseddon commented Jul 7, 2014

dougwilson commented Jul 7, 2014

mseddon commented Jul 7, 2014

dougwilson commented Jul 7, 2014

mseddon commented Jul 7, 2014

dougwilson commented Jul 7, 2014

mseddon commented Jul 7, 2014

dougwilson commented Jul 7, 2014