Skip to content
This repository has been archived by the owner. It is now read-only.

Pool does not reconnect on db failover #108

Closed
rfk opened this issue Sep 29, 2015 · 13 comments
Closed

Pool does not reconnect on db failover #108

rfk opened this issue Sep 29, 2015 · 13 comments

Comments

@rfk
Copy link
Member

@rfk rfk commented Sep 29, 2015

From email conversation with @jrgm:

"""
I was testing RDS Multi-AZ failover in stage under load on Friday.

The TL;DR is that fxa-auth does not reconnect on failover. Ugh.

fxa-oauth and fxa-profile did reconnect, and on a restart fxa-auth
reconnects. But on 3 of 4 occasions, fxa-auth wound up in a state where it
kept putting queries in the queue and returning 500 on 'Error: Queue limit
reached'.
"""

We do some mucking around with the error-recovery options, which may be affecting this.
From https://github.com/mozilla/fxa-auth-db-mysql/blob/master/lib/db/mysql.js#L34:

    // poolCluster will remove the pool after `removeNodeErrorCount` errors.
    // We don't ever want to remove a pool because we only have one pool
    // for writing and reading each. Connection errors are mostly out of our
    // control for automatic recovery so monitoring of 503s is critical.
    // Since `removeNodeErrorCount` is Infinity `canRetry` must be false
    // to prevent inifinite retry attempts.
    this.poolCluster = mysql.createPoolCluster(
      {
        removeNodeErrorCount: Infinity,
        canRetry: false
      }
    )
@jrgm
Copy link
Contributor

@jrgm jrgm commented Oct 1, 2015

So, I spent some more time with this problem in stage.

If I simply reboot the RDS instance with no failover, then after a short interruption, normal operation comes back with no intervention. With traces of DNS and port 3306 traffic in place, I can see that FIN packets arrive from the RDS instance, followed by a period of further pushes getting a RST packet response. Then a series of attempts to SYN with the RDS instance at the same IP as before (which are RST), and finally the SYN handshake is accepted and normal connections are re-established. For DNS, at some point around the disconnect, lookups to resolve dbwrite are made, returning the same IP as before.

That's all normal.

If I do a reboot with failover, I do not see FIN packets. No DNS requests are made for dbwrite. In fact, I see nothing odd, except for the fact that I should be seeing something happen. Packets continue to be pushed and ack-ed, as if the original master db has not shut down.

Two days ago, I think that I saw some network activity, and sockets winding up in non-ESTABLISHED states, so I'll have to try this some more, but I'm out of time for this week and next (although I think we are having a chat with some of the AWS engineering teams, so I'll ask about exactly what is supposed to happen on failover).

@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Oct 5, 2015

moved to now...

@rfk
Copy link
Member Author

@rfk rfk commented Oct 26, 2015

@jrgm from your comment above, it sounds like this may be due to weirdness within the AWS networking environment itself, is that accurate? Is there anything we should be digging into on the dev side at this point?

One small concrete thing we can do: update to the latest version of the mysql client lib, which I think has changes to the way it does pooling.

@rfk
Copy link
Member Author

@rfk rfk commented Oct 30, 2015

@jrgm I'm going to put "update mysql dependencies" in a separate bug and move this out of "now". Let me know if there's more you want us to follow up on from a dev perspective, or whether this is only something you can dig into in a deployed environment.

@rfk rfk removed their assignment Oct 30, 2015
@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Nov 2, 2015

@rfk and @dannycoates any thoughts?

@rfk rfk removed their assignment Nov 16, 2015
@rfk
Copy link
Member Author

@rfk rfk commented Nov 16, 2015

Backlogging this for now, we're going to try updating mysql package and then will need @jrgm to check if the problem is resolved in a deployed AWS environment.

@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Nov 16, 2015

needs more investigation , backlogging

@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Dec 15, 2015

@rfk would this be fixed by #112 ?

@rfk
Copy link
Member Author

@rfk rfk commented Dec 15, 2015

would this be fixed by #112 ?

Maybe, we need to get that landed and then test it again.

@rfk rfk removed their assignment Dec 15, 2015
@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Jun 7, 2016

Does this still happen @rfk @jrgm ?

@rfk
Copy link
Member Author

@rfk rfk commented Jun 8, 2016

@jrgm unless you have plans to re-test this sometime soon, I think we should just close out the bug and open a fresh one if it bites us again.

@jrgm
Copy link
Contributor

@jrgm jrgm commented Jun 9, 2016

I thought I updated on this, but maybe in a different bug. Seems to work last time I went through this for realz. Closing.

@jrgm jrgm closed this Jun 9, 2016
@vladikoff
Copy link
Contributor

@vladikoff vladikoff commented Jun 9, 2016

👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants