Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool connection doesn't reconnect well. #821

Closed
miningpoolhub opened this issue May 18, 2014 · 39 comments
Closed

Pool connection doesn't reconnect well. #821

miningpoolhub opened this issue May 18, 2014 · 39 comments
Assignees
Labels
Milestone

Comments

@miningpoolhub
Copy link

I use amazon cloud server, with rds multi-AZ.

I'm using pooled connection, but it doesn't reconnect as expected.

mysqlreconnection

I restarted RDS machine, and it started fail-over to other multi-AZ.
But at that point, connection was lost and not recovered well.

I'm just using pool.query() which don't have connection release leak. It must work as expected isn't it?

@dougwilson
Copy link
Member

It must work as expected isn't it?

I'm not really sure what you are asking. I also don't know anything about how Amazon's services work, so all that information about Amazon-specific terms is over my head. an you give me a more detailed description about what is not working, exactly? Also, are you getting any error from the pool?

@miningpoolhub
Copy link
Author

Oh, the connection count graph increased at certain point is not done automatically. The connection count didn't increase but increased after I restarted node process.

Amazon EC2 (just virtual machine computer) connects to RDS which is DB server.

Two RDS works, one is primary (master) and the other is slave. Each other changes when fail-over happens.
EC2 doesn't connect to RDS by IP, but it connects with dns name. dns endpoint is provided by Amazon and it redirects to primary RDS server at that time.

This RDS fail-over works well with other program written in python, reconnects very smoothly when RDS reboots or some problem occurs.
But node-mysql doesn't reconnect as expected.

Sorry that I couldn't get any info from pool. I didn't log about node-mysql that much.
But it's reproducible to this kind of problem.

I used cluster to run many nodejs process and each process has 10 connections max.

Maybe it's dns resolve problem, which previous pool connection trying to connect to same IP again and again. I'm not sure but just guessing because RDS works different, needs DNS info refreshed.

Curious thing is that it worked a bit too. I'm not sure it worked because it reconnected or some idle connection was established but some data was sent correctly.

Any idea?

@dougwilson
Copy link
Member

Any idea?

I don't really have any. So for the host -> DNS, that is just using the internal node.js core mechanisms, so I don't have any comment there about how that is working. Also, if the connection gets lost in this pool, a new connection will only be reestablished the next time to try to make a query, so if you are not making queries, the connections won't come back automatically.

Also, what version of this module are you using. Can you try the newest 2.3.0 if you are not using it and see if anything changes (there are various improvements/fixes that may be impacting you)?

@miningpoolhub
Copy link
Author

Queries are inserted frequently like 10~50 per second. Maybe I had to wait more time to see if things were going well, but I couldn't wait long because it is real production service.

I'm using 2.2.0
I will try new version.

@dougwilson
Copy link
Member

I'm using 2.2.0

OK, 2.2.0 should be the newest I really wanted you to try, but you are using it.

But it's reproducible to this kind of problem.

I don't have access to any Amazon services to try and reproduce the issue to determine what is happening. It would be really helpful if you could figure out what the issue is so I would know how to make a patch for it, otherwise I'm just in the dark about what the issue you are experiencing is.

@dougwilson
Copy link
Member

/cc @sidorares

@miningpoolhub
Copy link
Author

What information do you need?
How can I enable debug info or things with node-mysql especially for pooled query?
I just use

pool = mysql.createPool()
pool.query()

Sorry that I'm new to nodejs, not sure how to examine it.

@sidorares
Copy link
Member

It's not very clear to me as well what problem are you solving. Number of connections in AWS console is expected to drop after restart - pool opens new connections only if there is no idle connection available and you requesting new one.

Are you actually getting errors on mysql client side because it's using old IP after failover? If not, could you add some connection identifying SQL and log results ( server id / thread id )?

@miningpoolhub
Copy link
Author

I will test more and write it soon. Thanks all.

@miningpoolhub
Copy link
Author

I analyzed deep into it today.

When Amazon tries Multi-AZ fail-over, two DB instance changes their position by redirecting endpoint url.
Amazon provides endpoint address like "testtesttest.cqofkxwqldcj.us-east-1.rds.amazonaws.com" not an IP and it redirects to master DB instance at that moment.
So connection IP may change time to time when fail-over occurs.

I couldn't get any error when doing fail-over with node-mysql. It works as it didn't disconnected at all.
DB records inserted successfully (not showing any error), but DB record data is not shown on DB. I think these newly inserted data have gone away to Backup DB instance.

Pool connection works perfectly when I tried just restarting DB instance not using Multi-AZ fail over. It says PROTOCOL_CONNECTION_LOST, ECONNREFUSED properly and works well when DB instance restarted some minutes later.

One more thing, I connected HeidiSQL to Master DB from my client PC and it disconnected immediately when I fail-over the DB instance. So HeidiSQL detects disconnection at right moment and tried to reconnect to endpoint url again.

I think node-mysql doesn't detect disconnection well or Amazon doesn't disconnect backup DB connection from which is from their server farm. Since other python based mysql projects works well with Amazon MySQL Multi-AZ, it maybe a specific issue with node-mysql only.

I'm using latest node-mysql version. I used today's latest souce.

@dougwilson
Copy link
Member

Yea, it does sound like there must be some disconnection indicator if other libs are seeing the disconnect. Just as a curiosity, would it be possible to try the mysql2 library to see if it has the same problem?

@miningpoolhub
Copy link
Author

I tried mysql2 too but no luck.
I found that setting "queueLimit: 10" makes the pool query to output error. (It didn't even show error before) Since queries were stacked on queue unlimitedly, it didn't return error code. I thought it was inserted to DB correctly but it didn't at all.

When error occurs, I tried to reconnect something like this code.

pool.query(querystring, ..., ...,
function(err, result) {
if (err) {
pool.end();
pool = mysql.createPool(......);
}

It worked as expected but had some timing issue that createPool() was called constantly by other query's fail.
I increased queueLimit to 100 and the error did not happen frequently but happened time to time.

I want to know for clean and robust code to reset pool connection and reinitialization.
One thing the main problem is that Amazon Multi-AZ doesn't disconnect previous connection, pool also doesn't get disconnect event too.
node-mysql catches disconnection event when Amazon RDS just reboots, so this is some bug about Multi-AZ fail over on Amazon side.

Any idea for workaround or clean code?

@dougwilson
Copy link
Member

Can you see what the behavior is if you set your connectTimeout value low? For example, add " connectTimeout: 10" to your configuration.

@miningpoolhub
Copy link
Author

When I set connectTimeout: 10
it seems to work!
But I don't think it's stable.

  1. I'm curious that connectTimeout is millisecond but 'ETIMEDOUT error' log didn't appear immediately. It took about 30sec ~ 10 minutes to reconnect. This is weird. Is it normal that connectTimeout value is not affected much?
  2. And when query is not inputted much after failover, new connections are made but some previous connections are still connected to backup server. I checked it by netstat and it was ESTABLISHED state. It's not a big problem, but it doesn't use pool connection in round-robin style.

It is stated as
"Connections are also cycled round-robin style, with connections being taken from the top of the pool and returning to the bottom."

I am worried about it that some potential problem lies in it. Like round robin is working as expected but timeout didn't happen to previous connection and something stuck. Well DB records are inserted successfully though. Hope it's just round robin problem or false documented.

Anyway, Amazon RDS is the core problem that didn't disconnect client when failover occurs. I submitted this issue to Amazon RDS forum.

@dougwilson
Copy link
Member

Oops, I was thinking more like 10s, not 10ms :) How about "connectTimeout: 10000"? I think setting it to 10ms will cause weird behavior for sure, though, because that is way too quick.

@mseddon
Copy link

mseddon commented Jun 12, 2014

Anyway, Amazon RDS is the core problem that didn't disconnect client when failover occurs.
I submitted this issue to Amazon RDS forum.

Unfortunately, if for example, the AZ's datacentre goes completely offline for some reason, Amazon 'fixing' this problem this way will not help, you are merely treating the symptoms and failover will still not happen. It is perfectly feasible in many failure conditions that the connection could just die without ever sending you a TCP RST, and that all users are susceptible to this problem, however the behaviour of RDS has drawn attention to the problem.

I haven't had time to look into this just yet, but I can confirm it happening here on our RDS cluster. The fix must be applied here (or upstream if that is where the problem actually is).

I'll check back with more information once I've had a chance to look at it.

@dougwilson
Copy link
Member

It is perfectly feasible in many failure conditions that the connection could just die without ever sending you a TCP RST

Yes, that is definitely an issue. Without a RST packet, there is no way to know the TCP connection has closed by the other side. We'll need to set a timeout when doing operations and let that timeout expire to know that the remote host has disappeared. I can definitely implement that.

@mseddon
Copy link

mseddon commented Jun 12, 2014

Yeah, that should hopefully do it (assuming obviously you can dial in the timeout and not prematurely close horribly loaded servers ;))

I'm hoping to allocate some time next week to look at this issue properly and will let you know what tcpdump sees and what's going on inside the code as it happens. Happy to test anything you need, but my time is a bit limited at the moment, so I can't be sure of when it will be done.

@dougwilson
Copy link
Member

assuming obviously you can dial in the timeout and not close horribly loaded servers

If this means "is the timeout configurable"... yes, yes, it will definitely be :)

I'm hoping to allocate some time next week to look at this issue properly and will let you know what tcpdump sees and what's going on inside the code as it happens.

Awesome!

@mseddon
Copy link

mseddon commented Jun 12, 2014

Necessity is the mother of pull requests ;)

@mseddon
Copy link

mseddon commented Jun 20, 2014

Hi again, finally had a chance to briefly look at the problem today in more detail. It's very much to do with the failure to receive that final RST, so this is a very critical bug. When we simulate the AWS failover amazon abruptly kill the server instance and the connection dies. A simple setInterval query selecting CURRENT_TIMESTAMP from the server does the following as examined by tcpdump:

14:36:32.093640 IP mysqlA > client.34221: Flags [.], ack 562, win 114, options                                   [nop,nop,TS val 389305 ecr 1065078], length 0
14:36:34.095415 IP mysqlA > client.34221: Flags [P.], seq 1287:1404, ack 562,                                   win 114, options [nop,nop,TS val 389805 ecr 1065078], length 117
14:36:34.096335 IP client.34221 > mysqlA: Flags [P.], seq 562:613, ack 1404, w                                  in 229, options [nop,nop,TS val 1065579 ecr 389805], length 51
14:36:34.096651 IP mysqlA > client.34221: Flags [.], ack 613, win 114, options                                   [nop,nop,TS val 389806 ecr 1065579], length 0
14:36:36.097251 IP mysqlA > client.34221: Flags [P.], seq 1404:1521, ack 613,                                   win 114, options [nop,nop,TS val 390306 ecr 1065579], length 117
14:36:36.098168 IP client.34221 > mysqlA: Flags [P.], seq 613:664, ack 1521, w                                  in 229, options [nop,nop,TS val 1066080 ecr 390306], length 51
14:36:36.098457 IP mysqlA > client.34221: Flags [.], ack 664, win 114, options                                   [nop,nop,TS val 390306 ecr 1066080], length 0
14:36:38.098862 IP mysqlA > client.34221: Flags [P.], seq 1521:1638, ack 664,                                   win 114, options [nop,nop,TS val 390806 ecr 1066080], length 117
14:36:38.099752 IP client.34221 > mysqlA: Flags [P.], seq 664:715, ack 1638, w                                  in 229, options [nop,nop,TS val 1066580 ecr 390806], length 51
14:36:38.100041 IP mysqlA > client.34221: Flags [.], ack 715, win 114, options                                   [nop,nop,TS val 390807 ecr 1066580], length 0
--- FAILOVER HERE --- HANG  -- NO RST FROM MYSQL

An easy way to reproduce under a linux box would be just to block traffic from port 3306:

iptables -A INPUT -p tcp --source-port 3306 -j DROP

@dougwilson
Copy link
Member

@mseddon thanks for the verification of the missing RST to be the source :) I'll plan to make a new release tonight with changes that should solve the issue.

@dougwilson dougwilson added bug and removed question labels Jun 20, 2014
@dougwilson dougwilson self-assigned this Jun 20, 2014
@dougwilson dougwilson added this to the 2.4 milestone Jun 20, 2014
@mseddon
Copy link

mseddon commented Jun 20, 2014

Excellent, thanks! I'll have a play with that over the weekend and let you know how it works out.

@dougwilson
Copy link
Member

@mseddon sorry, I did not end up having much time yesterday, but I got most of it done. Will be finishing it up today :)

@mseddon
Copy link

mseddon commented Jun 23, 2014

@dougwilson Any updates on this issue?

@dougwilson
Copy link
Member

I have been iterating though different takes in a server farm. They weren't quite working, so I'm just trying to get something that will actually work :)

@mseddon
Copy link

mseddon commented Jun 23, 2014

Ok, no problem! Thanks :)

@dougwilson
Copy link
Member

If you like, please try out the patch with npm install felixge/node-mysql#945b92e and let me know if it corrects your issue.

@mseddon
Copy link

mseddon commented Jul 7, 2014

Printing the result of doing SELECT CURRENT_TIMESTAMP every 2 seconds and then triggering a failover still doesn't cause any reconnect after ten seconds, or indeed five minutes.

SUCCESS:  Mon Jul 07 2014 12:11:34 GMT+0000 (UTC)
SUCCESS:  Mon Jul 07 2014 12:11:36 GMT+0000 (UTC)
SUCCESS:  Mon Jul 07 2014 12:11:38 GMT+0000 (UTC)
SUCCESS:  Mon Jul 07 2014 12:11:40 GMT+0000 (UTC)
[timeout occurs here]

Same problem as before when the RST packet is not received.

/home/ubuntu/node-mysql-test
└─┬ mysql@2.3.2 (git://github.com/felixge/node-mysql.git#945b92e15e080944c79a45ca68e2c6d9f7c3059f)
  ├── bignumber.js@1.4.0
  ├─┬ readable-stream@1.1.13-1
  │ ├── core-util-is@1.0.1
  │ ├── inherits@2.0.1
  │ ├── isarray@0.0.1
  │ └── string_decoder@0.10.25-1
  └── require-all@0.0.8

@miningpoolhub does this patch work for you?

@dougwilson
Copy link
Member

@mseddon can you paste me the code you are using to test? It is important to note that the patch only affects pool.getConnection right now.

@mseddon
Copy link

mseddon commented Jul 7, 2014

ah, that may be it, one moment while I try again using that. Cheers.

@mseddon
Copy link

mseddon commented Jul 7, 2014

Hi- switching the test to use pool.getConnection seems to work better, I'll have a play but this might fix it after all, sorry about that. Thanks :)

@dougwilson
Copy link
Member

@mseddon no problem :) I know there is still work to be done--but from now, it's actually down deep into Node.js internals. When you send data over TCP, you get ACKs for each packet; we should be able to timeout if when sending off the sequence doesn't get ACKs back, but that is not straight-forward in Node.js since the details have been extracted away from us.

@mseddon
Copy link

mseddon commented Jul 7, 2014

So, one observation (though probably not surprising)- if I set the connectionLimit to 1 to simulate a full connection pool, and failover within a query, reconnects still don't happen. I would imagine that this is because we're hung waiting for a free connection, and your timeout check occurs once we have a free connection?

@dougwilson
Copy link
Member

I would imagine that this is because we're hung waiting for a free connection, and your timeout check occurs once we have a free connection?

Correct. The query is hung because of the severed connection. You can always do conn.query({sql: 'SELECT 1', timeout: 30000}) to give the query a 30s timeout, but that's not very practical. AFAIK, Node.js abstracts too much away to know that the TCP connection is gone in the middle of the query without a FIN or RST packet to tell us.

@mseddon
Copy link

mseddon commented Jul 7, 2014

Hmm, does timeout start when the conn.query() command is issued, or when the request is finally submitted to mysql? It's not pretty but it could work for my case.

@dougwilson
Copy link
Member

does timeout start when the conn.query() command is issued, or when the request is finally submitted to mysql?

The timeout is actually based on sequence packet activity, so it would start when the query packet is sent to MySQL server, rather than at the time you called .query().

@mseddon
Copy link

mseddon commented Jul 7, 2014

Ok, thanks, that should be good enough!

@dougwilson
Copy link
Member

Ok, thanks, that should be good enough!

Cool :) I'm still going to be working on adding more protections, though. It may be interesting to know that every command now accepts a timeout option, so you could ping with a timeout conn.ping({timeout: 1000}), etc.

@mysqljs mysqljs locked and limited conversation to collaborators Oct 3, 2014
seangarner pushed a commit to seangarner/node-mysql that referenced this issue May 11, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

4 participants