Pool does not reconnect on db failover #108
Comments
|
So, I spent some more time with this problem in stage. If I simply reboot the RDS instance with no failover, then after a short interruption, normal operation comes back with no intervention. With traces of DNS and port 3306 traffic in place, I can see that FIN packets arrive from the RDS instance, followed by a period of further pushes getting a RST packet response. Then a series of attempts to SYN with the RDS instance at the same IP as before (which are RST), and finally the SYN handshake is accepted and normal connections are re-established. For DNS, at some point around the disconnect, lookups to resolve dbwrite are made, returning the same IP as before. That's all normal. If I do a reboot with failover, I do not see FIN packets. No DNS requests are made for dbwrite. In fact, I see nothing odd, except for the fact that I should be seeing something happen. Packets continue to be pushed and ack-ed, as if the original master db has not shut down. Two days ago, I think that I saw some network activity, and sockets winding up in non-ESTABLISHED states, so I'll have to try this some more, but I'm out of time for this week and next (although I think we are having a chat with some of the AWS engineering teams, so I'll ask about exactly what is supposed to happen on failover). |
|
moved to now... |
|
@jrgm from your comment above, it sounds like this may be due to weirdness within the AWS networking environment itself, is that accurate? Is there anything we should be digging into on the dev side at this point? One small concrete thing we can do: update to the latest version of the mysql client lib, which I think has changes to the way it does pooling. |
|
@jrgm I'm going to put "update mysql dependencies" in a separate bug and move this out of "now". Let me know if there's more you want us to follow up on from a dev perspective, or whether this is only something you can dig into in a deployed environment. |
|
@rfk and @dannycoates any thoughts? |
|
Backlogging this for now, we're going to try updating mysql package and then will need @jrgm to check if the problem is resolved in a deployed AWS environment. |
|
needs more investigation , backlogging |
Maybe, we need to get that landed and then test it again. |
|
@jrgm unless you have plans to re-test this sometime soon, I think we should just close out the bug and open a fresh one if it bites us again. |
|
I thought I updated on this, but maybe in a different bug. Seems to work last time I went through this for realz. Closing. |
|
|
From email conversation with @jrgm:
"""
I was testing RDS Multi-AZ failover in stage under load on Friday.
The TL;DR is that fxa-auth does not reconnect on failover. Ugh.
fxa-oauth and fxa-profile did reconnect, and on a restart fxa-auth
reconnects. But on 3 of 4 occasions, fxa-auth wound up in a state where it
kept putting queries in the queue and returning 500 on 'Error: Queue limit
reached'.
"""
We do some mucking around with the error-recovery options, which may be affecting this.
From https://github.com/mozilla/fxa-auth-db-mysql/blob/master/lib/db/mysql.js#L34:
The text was updated successfully, but these errors were encountered: