Elasticache Cluster Mode - Throws Connection Errors During Update/Maintenance #703

mattjamesaus · 2021-07-08T05:57:34Z

Describe the bug
We're currently running an AWS Elasticache Redis Cluster with multiple shard and nodes which works great, until we do a scheduled update event where we start receiving a lot of Predis\Connection\ConnectionException: Connection timed out [tls://clustercfg.##############.eh01hp.use1.cache.amazonaws.com:6379] when one of the nodes in the shard get updated./patched.

It appears that even though there are other nodes in the shard and the rest of the cluster is healthy that requests fail until the server is updated and returned back into rotation. The current predis setup we're using is to pass the AWS Elasticache configuration endpoint as the host value when initialising predis. The AWS elasticache cluster configuration endpoint is basically a CNAME that returns all the A records of the nodes in the cluster (including the node that's been taken out of service).

I suspect what's happening here is that predis (due to the order of the A records rotating) is attempting to query a single node when making the request to the cluster and is unable do autodiscovery of the other nodes etc? If that's the case i suspect it maybe as simple as changing the way we initialize predis to contain all the nodes returned by the configuration endpoint.

In that track does predis cache the autodiscovered nodes from the cluster for a period of time like phpredis? I wasn't able to see anything in the documentation.

Lastly if i'm on the right track with the DNS issue being problematic would we be open to having the library handle enumerating the nodes from a DNS value that returns multiple values i.e with a flag in the initialization or by default?

I've had a good hunt to find any real documentation regarding this specific issue and have come up with basically nothing.

To Reproduce
Steps to reproduce the behavior:
Run an elasticache cluster using the cluster configuration endpoint as the host value for predis. Then perform patching / maintenance on a node.

Expected behavior
Predis to continue serving traffic to the other nodes in the cluster (when they're promoted), and at the least don't return a timeout error.

Versions (please complete the following information):
v1.1.7

Code sample
If applicable, a small snippet of code that reproduces the issue.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

tillkruss · 2021-07-09T18:30:37Z

Thanks for reporting this, @mattjamesaus. Can you submit a PR with a solution?

mattjamesaus · 2021-07-11T01:21:01Z

Well it doesn't necessarily need to be implemented in predis, it could be up to the wrapped code to enumerate these values. I guess what I'm trying to determine, is the aforementioned issue then expected result and if so that's ok.

I think it could be dangerous if this behaviour is changed if others are expecting it to behave consistently.

It could be a case of just updating the doc to call this out and an example of how to enumerate the records prior to hand off.

mattjamesaus added the bug label Jul 8, 2021

tillkruss closed this as completed May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticache Cluster Mode - Throws Connection Errors During Update/Maintenance #703

Elasticache Cluster Mode - Throws Connection Errors During Update/Maintenance #703

mattjamesaus commented Jul 8, 2021

tillkruss commented Jul 9, 2021

mattjamesaus commented Jul 11, 2021

Elasticache Cluster Mode - Throws Connection Errors During Update/Maintenance #703

Elasticache Cluster Mode - Throws Connection Errors During Update/Maintenance #703

Comments

mattjamesaus commented Jul 8, 2021

tillkruss commented Jul 9, 2021

mattjamesaus commented Jul 11, 2021