Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to avoid CLUSTERDOWN and UNBLOCKED? #28

Closed
thelinuxlich opened this issue May 6, 2015 · 30 comments
Closed

How to avoid CLUSTERDOWN and UNBLOCKED? #28

thelinuxlich opened this issue May 6, 2015 · 30 comments

Comments

@thelinuxlich
Copy link

Revisiting this problem, I asked on the redis google group about it and here it is what @antirez said:

Hello,

CLUSTERDOWN is a transient error that happens when at least one master
node is down. If you want the partial portion of the cluster which is
still up to run regardless of a set of hash slots not covered, there
is an option inside the example "redis.conf" file, doing exactly this.

UNBLOCKED is unavoidable since it is delivered to clients that are
blocked into lists that are moved from a different master because of
resharding. We don't want them to wait forever for something that will
never happen, since those lists are not moved into a different master.

So CLUSTERDOWN should be handled by the application and or at client
level directly by retrying. UNBLOCKED should be handled rescanning the
config with CLUSTER SLOTS and connecting to the right node.

Cheers,
Salvatore

Configuration-wise, I've set my cluster with "cluster-require-full-coverage" to "no" so I just need to know how to cover this on the application side

@luin
Copy link
Collaborator

luin commented May 7, 2015

It seems that we can just resend the command when a CLUSTERDOWN error is received. Thank you for the information.

If you want to handle CLUSTERDOWN on the application side, you have to catch all errors returned from Redis and resend the command if the error is CLUSTERDOWN.

@thelinuxlich
Copy link
Author

I'm also getting a lot of these:

ReplyError: EXECABORT Transaction discarded because of previous errors.

@luin
Copy link
Collaborator

luin commented May 7, 2015

You can get previous errors by error.previousErrors to see what they are. For instance:

redis.multi().set('foo').get('foo').exec().catch(function (err) {
  console.log(err.previousErrors);
});

@thelinuxlich
Copy link
Author

And by the way, if the library has the autoResendUnfulfilledCommands option enabled by default, shouldn't it resend automatically after recovering from CLUSTERDOWN?

@luin
Copy link
Collaborator

luin commented May 7, 2015

autoResendUnfulfilledCommands is used to send unfulfilled commands after a reconnection. Since CLUSTERDOWN is caused by a master's being offline, we may use clusterRetryStrategy option to retry the node.

@thelinuxlich
Copy link
Author

Now I have 3 masters and 9 slaves, let's keep this open for some days so I can see if those errors persist. I've also configured the cluster with "cluster-slave-validity-factor 0" and "cluster-migration-barrier 1"

@thelinuxlich
Copy link
Author

good reference: http://redis.io/presentation/Redis_Cluster.pdf

@thelinuxlich
Copy link
Author

even after increasing slaves and changing config to be more available, I'm receiving EXECABORT ocasionally with this previousErrors:

previousErrors:
 [ { [ReplyError: MOVED 1684 192.168.0.1:7000]
 name: 'ReplyError',
 message: 'MOVED 1684 192.168.0.1:7000',
command: [Object] },
{ [ReplyError: MOVED 1684 192.168.0.1:7000]
name: 'ReplyError',
 message: 'MOVED 1684 192.168.0.1:7000',
 command: [Object] } ] } 

@AVVS
Copy link
Collaborator

AVVS commented May 7, 2015

@thelinuxlich I believe you are trying to perform multi-key operations on the cluster, therefore you get the following errors

If you need to that, you need to make sure they have the same hash, ie {somehash}your-key-name1, {somehash}your-key-name2 etc

@thelinuxlich
Copy link
Author

You mean, using multi(), right?

@thelinuxlich
Copy link
Author

My code has one transaction:

redis.multi().setnx(new_key, possible_new_session_id).expire(new_key, 1800).exec()

@luin
Copy link
Collaborator

luin commented May 7, 2015

Is there a resharding or failover happens after cluster has been initialized?

@thelinuxlich
Copy link
Author

no, probably the multi() is not going to the right node

@AVVS
Copy link
Collaborator

AVVS commented May 7, 2015

In that case multi() is fine (same key -> new_key), but the error says that the hash, that was resolved to new_key is now on another machine in the cluster. These errors should really be handled by the library. What the error says is that, hey, your hash slot caching is wrong, it needs to be updated and the operation needs to be retried.

@luin, please take a look at this, as I believe it needs to be improved. IE, MOVED reply must be handled.

On May 6, 2015, at 10:03 PM, Alisson Cavalcante Agiani notifications@github.com wrote:

My code has one transaction:

redis.multi().setnx(new_key, possible_new_session_id).expire(new_key, 1800).exec()

Reply to this email directly or view it on GitHub #28 (comment).

@thelinuxlich
Copy link
Author

By the way, I didn't have this problem with a similar library: https://github.com/thunks/thunk-redis

Their API forces you to set the key explicitly on the multi() and exec() methods

@luin
Copy link
Collaborator

luin commented May 7, 2015

Yes, you are right. ioredis should be able to handle these MOVED errors. I'll try to fix these errors tonight.

@AVVS
Copy link
Collaborator

AVVS commented May 7, 2015

@thelinuxlich thunk-redis takes hash from the first key, and then applies it to all the operations in the multi, but they don't make use of pipeline. Here the first key from pipeline is taken and then the hash is applied to the whole pipeline.

Beside that its pretty much the same, except that ioredis seems cleaner (and with a bug not handling MOVED responses 💨)

@luin
Copy link
Collaborator

luin commented May 7, 2015

@thelinuxlich As stated in the README, ioredis will use the first key in the pipeline queue to calculate the slot. So the problem here isn't ioredis use the wrong key, instead is ioredis doesn't handle MOVED errors properly in the transaction.

@thelinuxlich
Copy link
Author

ok, maybe refreshAfterFails = 1 can solve the issue?

@luin
Copy link
Collaborator

luin commented May 7, 2015

@thelinuxlich No, it doesn't help since it's a bug of ioredis. I'll fix it soon :-)

@thelinuxlich
Copy link
Author

Dont know if it is luck, but since setting refreshAfterFails to 1 no error happened

@luin
Copy link
Collaborator

luin commented May 7, 2015

@thelinuxlich That's interesting. However I'm implementing a more stable transaction strategy in cluster mode.

luin added a commit that referenced this issue May 8, 2015
@thelinuxlich
Copy link
Author

let's test it ;)

@thelinuxlich
Copy link
Author

still getting MOVED errors :(

@luin
Copy link
Collaborator

luin commented May 9, 2015

@thelinuxlich Yes, this commit just fixed CLUSTERDOWN errors, and I'm still working on handling MOVED errors :-)

@luin
Copy link
Collaborator

luin commented May 14, 2015

There is lot of work to do to implement a stable transaction in cluster mode. However the job is getting done :-)
Pull request #33 should handle MOVED, ASK and CLUSTERDOWN error properly. I'm writing more tests for it and will ship it in the next version. Welcome to do some tests if you have time.

@thelinuxlich
Copy link
Author

Great, every new release I'm always testing :)

@luin
Copy link
Collaborator

luin commented May 15, 2015

Released in 1.3.0

@thelinuxlich
Copy link
Author

No errors so far...

@thelinuxlich
Copy link
Author

Seems fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants