ClusterAllFailedError on version 4.24.1 #1330

sundeqvist · 2021-04-08T15:26:02Z

Hey! We're using ioredis with our AWS ElastiCache cluster running 3 shards on version 5.0.5. We're making the redis calls through a lambda with quite high traffic, meaning multiple concurrent lambdas running.

I've done some version bumping and I'm seeing the error ClusterAllFailedError: Failed to refresh slots cache intermittently. Through some debugging I've narrowed version 4.24.1 to be the culprit - any version before that works fine. When setting DEBUG=ioredis:* in the lambda env the ClusterAllFailedError: Failed to refresh slots cache is in most cases followed by these logs:

ioredis:cluster:connectionPool Reset with []
ioredis:cluster:connectionPool Disconnect <ip>:<port> because the node does not hold any slot
ioredis:cluster:connectionPool Remove <ip>:<port> from the pool

When looking at the 4.24.1 commit 8524eea I can tell that code related to this error has been touched - could this fix have introduced unintended issues? Any pointers would be appreciated 👍

The text was updated successfully, but these errors were encountered:

luin · 2021-04-08T15:51:10Z

Hi @sundeqvist, thanks for raising this issue!

Before 4.24.1, ioredis asked cluster nodes for cluster slot information when connecting and periodically after connected. If all cluster nodes failed to provide the information (ex all nodes were down), ioredis would raise the "Failed to refresh slots cache" error and reconnect to the cluster (and print debug log Reset with [] ) if it hadn't connected, otherwise (when running periodically) it would just ignore.

However, after 4.24.1, ioredis will raise and reconnect to the cluster even the cluster has already connected. This change is introduced to make failover detection faster.

For your case, I'd suggest listen to the "node error" event (cluster.on('node error', err => console.error(err))) and see what errors cause the issue. This event will be emitted every time a cluster node fails to provide slot information.

lvkmahesh · 2021-04-14T14:45:11Z

Hello,
We also facing same issue after 4.24.1. We are using AWS ElastiCache cluster. This is happening intermittently, when this happens our clusterRetryStrategy function is getting called only once. And we keep on getting "ClusterAllFailedError: Failed to refresh slots cache" error in error handler. Also, this is not happening in all of the containers of service, but in some of them only. For us, once broken, the retry is not working, even if cluster is healthy. Please advise.

this._client = new Redis.Cluster( [{ host: redisConfig.host, port: redisConfig.port }], { clusterRetryStrategy: function (retryCount) { console.log('retrying for redis connection') return Math.min(100 + retryCount * 2, 5000) }, enableReadyCheck: true } )
this._client.on('ready', () => { // able to connect console.log('successfully connected to redis, cache') })
this._client.on('error', (error) => { // error while connecting to redis console.error('error recieved from redis, cache', { err: error.message, errStack: error.stack }) })

luin · 2021-04-14T17:57:55Z

Hi @lvkmahesh , I just had a talk with @leibale about the issue. Actually, can you add a listener for the "node error" event (just like the reply I posted) and post the errors here? We'd like to better understand what caused the issue.

sundeqvist · 2021-04-16T09:03:07Z

Thanks for looking at this so quickly. The trace that is emitted in the node error event looks as follows:

Error: timeout
at Object.timeout (/var/task/node_modules/ioredis/built/utils/index.js:159:38)
at Cluster.getInfoFromNode (/var/task/node_modules/ioredis/built/cluster/index.js:660:55)
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:395:19)
at Cluster.refreshSlotsCache (/var/task/node_modules/ioredis/built/cluster/index.js:414:9)
at Timeout._onTimeout (/var/task/node_modules/ioredis/built/cluster/index.js:108:22)
at listOnTimeout (internal/timers.js:554:17)
at processTimers (internal/timers.js:497:7)

artur-ma · 2021-05-02T15:50:39Z

Same here after upgrade 4.19.2 => 4.27.1

We see Redis - error: ClusterAllFailedError: Failed to refresh slots cache." when starting intensive writes on redis cluster
and then a lot of Error: Cluster isn't ready and enableOfflineQueue options is false

alexandrugheorghe · 2021-05-04T13:06:49Z

I believe the commit might have failed to build and deploy. @luin Could you have another look?

## [4.27.2](v4.27.1...v4.27.2) (2021-05-04) ### Bug Fixes * **cluster:** avoid ClusterAllFailedError in certain cases ([aa9c5b1](aa9c5b1)), closes [#1330](#1330)

ioredis-robot · 2021-05-04T13:50:32Z

🎉 This issue has been resolved in version 4.27.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

trademark18 · 2021-06-14T19:05:10Z

Hi all, I'm on 4.27.6 and I'm having a similar issue where I see this error just when under a heavy load:

ERROR [ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache.
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:396:31)
at /var/task/node_modules/ioredis/built/cluster/index.js:413:21
at Timeout.<anonymous> (/var/task/node_modules/ioredis/built/cluster/index.js:671:24)
at Timeout.run (/var/task/node_modules/ioredis/built/utils/index.js:156:22)
at listOnTimeout (internal/timers.js:557:17)
at processTimers (internal/timers.js:498:7)

And then the Lambda ends with this error message:

{
  "errorType": "Error",
  "errorMessage": "None of startup nodes is available",
  "stack": [
    "Error: None of startup nodes is available",
    " at Cluster.closeListener (/var/task/node_modules/ioredis/built/cluster/index.js:184:35)",
    " at Object.onceWrapper (events.js:482:28)",
    " at Cluster.emit (events.js:388:22)",
    " at /var/task/node_modules/ioredis/built/cluster/index.js:367:18",
    " at processTicksAndRejections (internal/process/task_queues.js:77:11)"
  ]
}

Here's my Redis.Cluster() code:

cluster = new Redis.Cluster(
[
	{ 'host': redisSecret.host }
], 
{
	dnsLookup: (address, callback) => callback(null, address),
	redisOptions: {
		tls: true,
	},
	clusterRetryStrategy: (times) => {
		const ms = Math.min(100 * times, 2000);
		console.log(`Cluster retry #${times}: Will wait ${ms} ms`);
		return ms;
	}
}
);

Questions:

Can I avoid it entirely?
Since the Lambda errors out, it's going to trigger alarms and such. I'd rather have it just silently attempt to reconnect to the cluster. Is there some way to catch this and then just proceed silently to the retry?

Sorry to comment on a closed issue but this is the only place I've found anyone talking about seeing this error only under heavy load as opposed to just incorrect network configuration or similar.

bkvaiude · 2021-06-16T15:53:25Z

Need help to drill down the following issue
Please check the following needful information for more context

ClusterAllFailedError: Failed to refresh slots cache.
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:396:31)
at /var/task/node_modules/ioredis/built/cluster/index.js:413:21
at Timeout.<anonymous> (/var/task/node_modules/ioredis/built/cluster/index.js:671:24)
at Timeout.run (/var/task/node_modules/ioredis/built/utils/index.js:156:22)
at listOnTimeout (internal/timers.js:556:17)
at processTimers (internal/timers.js:497:7) {
lastNodeError: Error: timeout
at Object.timeout (/var/task/node_modules/ioredis/built/utils/index.js:159:38)
at Cluster.getInfoFromNode (/var/task/node_modules/ioredis/built/cluster/index.js:668:55)
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:402:19)
at Cluster.refreshSlotsCache (/var/task/node_modules/ioredis/built/cluster/index.js:421:9)
at /var/task/node_modules/ioredis/built/cluster/index.js:192:22
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at runNextTicks (internal/process/task_queues.js:66:3)
at listOnTimeout (internal/timers.js:523:9)
at processTimers (internal/timers.js:497:7)
}

I'm facing the same issue where the new lambda version release started throwing a massive number of errors

ioredis version 4.27.6
AWS Lambda nodejs12.x
Redis 6.0.5

AWS Redis cluster metrics look good and healthy, it is confirmed by AWS tech support too.
Couldn't able to root cause the issue

Redis Cluster Initilization

        this.client = new redis.Cluster([redisConfig], {
		dnsLookup: (address, callback) => callback(null, address),
		slotsRefreshTimeout: 5000,
		slotsRefreshInterval: 1 * 60 * 1000,
	});

vaughandroid · 2021-06-17T14:11:59Z

@bkvaiude I looked into this a bit. The "ClusterAllFailedError: Failed to refresh slots cache." is thrown prior to ioredis attempting to reconnect to the cluster. In other words, it's a recoverable error. On my team we're tracking a metric for them but not treating them as an error that needs to be addressed.

Ideally I would like get a clear signal when it can't manage to reconnect. I thought to do that using clusterRetryStrategy and fail after n retries, but it appears there are (some issues with that right now)[https://github.com//issues/1062]. For now, we're OK to live with it as it is.

@trademark18 Again, I think the "Failed to refresh slots cache" errors can be treated as warnings. It looks like something is closing the connection (you can see it originates from Cluster.closeListener) and that's what is terminating your lambda.

bkvaiude · 2021-06-22T10:57:30Z

Thanks, @vaughandroid for sharing your experience, it is really insightful and helpful.

The problem which I faced was pretty much stupid, because of the wrong tls configuration, we are facing a connection issue with AWS Redis.

AWS ElasticCache Clustered without TLS and AUTH

The problematic situation for the developer:

The IORedis doesn't provide the right information for generated error.

When we try to replicate the issue locally with a trial and error approach, we received the same error for the following case

Passing wrong connection URL
Passing invalid credentials (username and password)
tls configuration

Now, you see it is difficult to process the error details for the above use cases and act on them.
If the right error details have been provided then the developer easily gets the context of wrongdoing and s/he can take needful precautions to fix the issue.

In our case, we doubted the Redis and started looking into the performance metrics

@trademark18 I hope IORedis can consider this feedback and do the needful changes with error handling documentation

And also client.error callback handle such type of errors and not impacting anymore on the aws lambda function

Thank you very much!

## [4.27.2](redis/ioredis@v4.27.1...v4.27.2) (2021-05-04) ### Bug Fixes * **cluster:** avoid ClusterAllFailedError in certain cases ([aa9c5b1](redis/ioredis@aa9c5b1)), closes [#1330](redis/ioredis#1330)

leibale added a commit to leibale/ioredis that referenced this issue May 3, 2021

fix redis#1330 - revert 8524eea

d5eeb5e

luin closed this as completed in aa9c5b1 May 3, 2021

ioredis-robot pushed a commit that referenced this issue May 4, 2021

chore(release): 4.27.2 [skip ci]

e0cfea1

## [4.27.2](v4.27.1...v4.27.2) (2021-05-04) ### Bug Fixes * **cluster:** avoid ClusterAllFailedError in certain cases ([aa9c5b1](aa9c5b1)), closes [#1330](#1330)

ioredis-robot added the released label May 4, 2021

snyk-bot mentioned this issue Jul 29, 2021

[Snyk] Upgrade ioredis from 4.17.3 to 4.27.6 SalesforceCommerceCloud/commerce-sdk-core#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterAllFailedError on version 4.24.1 #1330

ClusterAllFailedError on version 4.24.1 #1330

sundeqvist commented Apr 8, 2021 •

edited

Loading

luin commented Apr 8, 2021

lvkmahesh commented Apr 14, 2021 •

edited

Loading

luin commented Apr 14, 2021

sundeqvist commented Apr 16, 2021

artur-ma commented May 2, 2021 •

edited

Loading

alexandrugheorghe commented May 4, 2021

ioredis-robot commented May 4, 2021

trademark18 commented Jun 14, 2021

bkvaiude commented Jun 16, 2021 •

edited

Loading

vaughandroid commented Jun 17, 2021

bkvaiude commented Jun 22, 2021 •

edited

Loading

ClusterAllFailedError on version 4.24.1 #1330

ClusterAllFailedError on version 4.24.1 #1330

Comments

sundeqvist commented Apr 8, 2021 • edited Loading

luin commented Apr 8, 2021

lvkmahesh commented Apr 14, 2021 • edited Loading

luin commented Apr 14, 2021

sundeqvist commented Apr 16, 2021

artur-ma commented May 2, 2021 • edited Loading

alexandrugheorghe commented May 4, 2021

ioredis-robot commented May 4, 2021

trademark18 commented Jun 14, 2021

Questions:

bkvaiude commented Jun 16, 2021 • edited Loading

Redis Cluster Initilization

vaughandroid commented Jun 17, 2021

bkvaiude commented Jun 22, 2021 • edited Loading

sundeqvist commented Apr 8, 2021 •

edited

Loading

lvkmahesh commented Apr 14, 2021 •

edited

Loading

artur-ma commented May 2, 2021 •

edited

Loading

bkvaiude commented Jun 16, 2021 •

edited

Loading

bkvaiude commented Jun 22, 2021 •

edited

Loading