Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticache Cluster Mode - Throws Connection Errors During Update/Maintenance #703

Closed
mattjamesaus opened this issue Jul 8, 2021 · 2 comments
Labels

Comments

@mattjamesaus
Copy link

Describe the bug
We're currently running an AWS Elasticache Redis Cluster with multiple shard and nodes which works great, until we do a scheduled update event where we start receiving a lot of Predis\Connection\ConnectionException: Connection timed out [tls://clustercfg.##############.eh01hp.use1.cache.amazonaws.com:6379] when one of the nodes in the shard get updated./patched.

It appears that even though there are other nodes in the shard and the rest of the cluster is healthy that requests fail until the server is updated and returned back into rotation. The current predis setup we're using is to pass the AWS Elasticache configuration endpoint as the host value when initialising predis. The AWS elasticache cluster configuration endpoint is basically a CNAME that returns all the A records of the nodes in the cluster (including the node that's been taken out of service).

I suspect what's happening here is that predis (due to the order of the A records rotating) is attempting to query a single node when making the request to the cluster and is unable do autodiscovery of the other nodes etc? If that's the case i suspect it maybe as simple as changing the way we initialize predis to contain all the nodes returned by the configuration endpoint.

In that track does predis cache the autodiscovered nodes from the cluster for a period of time like phpredis? I wasn't able to see anything in the documentation.

Lastly if i'm on the right track with the DNS issue being problematic would we be open to having the library handle enumerating the nodes from a DNS value that returns multiple values i.e with a flag in the initialization or by default?

I've had a good hunt to find any real documentation regarding this specific issue and have come up with basically nothing.

To Reproduce
Steps to reproduce the behavior:
Run an elasticache cluster using the cluster configuration endpoint as the host value for predis. Then perform patching / maintenance on a node.

Expected behavior
Predis to continue serving traffic to the other nodes in the cluster (when they're promoted), and at the least don't return a timeout error.

Versions (please complete the following information):
v1.1.7

Code sample
If applicable, a small snippet of code that reproduces the issue.

Additional context
Add any other context about the problem here.

@tillkruss
Copy link
Member

Thanks for reporting this, @mattjamesaus. Can you submit a PR with a solution?

@mattjamesaus
Copy link
Author

Well it doesn't necessarily need to be implemented in predis, it could be up to the wrapped code to enumerate these values. I guess what I'm trying to determine, is the aforementioned issue then expected result and if so that's ok.

I think it could be dangerous if this behaviour is changed if others are expecting it to behave consistently.

It could be a case of just updating the doc to call this out and an example of how to enumerate the records prior to hand off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants