New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question/Request: Alternate cluster nodes on Reconnect Strategy #392

Closed
HaloFour opened this Issue Oct 26, 2016 · 11 comments

Comments

Projects
None yet
2 participants
@HaloFour

HaloFour commented Oct 26, 2016

I am using the RedisClusterClient to work with a Redis cluster. Periodically that cluster is updated/maintained by a series of new instances/nodes being brought online and all of the old instances/nodes terminated. When this happens the RedisClusterClient expectedly loses connectivity with the cluster.

My question is what is a good strategy for handling this kind of scenario? I don't see a way to provide a set of new RedisURIs during a reconnect attempt, like Node.js ioredis?. Would it be up to me to detect the disconnected connections, rediscover the new cluster nodes and create a new RedisClusterClient? What's the best method for detecting the dropped connection(s)?

@mp911de

This comment has been minimized.

Show comment
Hide comment
@mp911de

mp911de Oct 26, 2016

Member

Thanks for your question. In general, lettuce takes care out of the box as soon as you enable topology refresh. Topology refresh can be enabled periodically (refresh each hour) or adaptive (refresh when running into a MOVED redirection) or both.

You might want to dive as well into static vs. dynamic refresh sources (like mentioned from ioredis):

  1. You either have a stable set of known Redis nodes (static refresh sources). The topology is discovered from by asking for CLUSTER NODES and lettuce discovers all your nodes.
  2. Or, if you can't have a stable set of nodes, that are always on, then you might want to peek into dynamic topology refresh sources. Lettuce discovers your nodes from an initial set and then works like a crawler. If some nodes are removed and others remain active, that's not an issue (during runtime) because lettuce crawls all the nodes. There's one caveat: Lettuce might get stuck with nodes that were removed from the cluster (orphaning). #355 solves that issue as it sticks with the known majority of previously known nodes.

See also: https://github.com/mp911de/lettuce/wiki/Client-options#dynamic-topology-refresh-sources

Member

mp911de commented Oct 26, 2016

Thanks for your question. In general, lettuce takes care out of the box as soon as you enable topology refresh. Topology refresh can be enabled periodically (refresh each hour) or adaptive (refresh when running into a MOVED redirection) or both.

You might want to dive as well into static vs. dynamic refresh sources (like mentioned from ioredis):

  1. You either have a stable set of known Redis nodes (static refresh sources). The topology is discovered from by asking for CLUSTER NODES and lettuce discovers all your nodes.
  2. Or, if you can't have a stable set of nodes, that are always on, then you might want to peek into dynamic topology refresh sources. Lettuce discovers your nodes from an initial set and then works like a crawler. If some nodes are removed and others remain active, that's not an issue (during runtime) because lettuce crawls all the nodes. There's one caveat: Lettuce might get stuck with nodes that were removed from the cluster (orphaning). #355 solves that issue as it sticks with the known majority of previously known nodes.

See also: https://github.com/mp911de/lettuce/wiki/Client-options#dynamic-topology-refresh-sources

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Oct 26, 2016

@mp911de

Thanks for the response. I'm definitely looking at topography updates. My concern is that the switch-over occurs quite rapidly during a deploy. As soon as the new nodes are online and healthy all of the original nodes are disconnected from the cluster and terminated. I imagine that if a topography update doesn't occur within that window that the RedisClusterClient would not be able to discover any of the new nodes. Would there be any way to compensate for that short of creating a new RedisClusterClient?

I have been poking at the protected members of RedisClusterClient and it does seem that I can override the list of RedisURIs used for a topography update. Do you know what would happen if I were to return a completely different list of URIs than the initial URIs?

HaloFour commented Oct 26, 2016

@mp911de

Thanks for the response. I'm definitely looking at topography updates. My concern is that the switch-over occurs quite rapidly during a deploy. As soon as the new nodes are online and healthy all of the original nodes are disconnected from the cluster and terminated. I imagine that if a topography update doesn't occur within that window that the RedisClusterClient would not be able to discover any of the new nodes. Would there be any way to compensate for that short of creating a new RedisClusterClient?

I have been poking at the protected members of RedisClusterClient and it does seem that I can override the list of RedisURIs used for a topography update. Do you know what would happen if I were to return a completely different list of URIs than the initial URIs?

@mp911de

This comment has been minimized.

Show comment
Hide comment
@mp911de

mp911de Nov 6, 2016

Member

If your cluster nodes disappear faster than you're able to deploy an application then lettuce can't fix that issue. You should get a more stable environment then.

Member

mp911de commented Nov 6, 2016

If your cluster nodes disappear faster than you're able to deploy an application then lettuce can't fix that issue. You should get a more stable environment then.

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Nov 6, 2016

The Redis cluster and the applications are deployed/maintained entirely separately from each other, although both follow the same principle of the old instances being terminated immediately after the new instances are launched and pass a health check (in the Redis cluster case that health check includes having successfully joined the cluster). This seems to be a pretty standard way to handle a rolling update. I've not seen any literature that prescribes the requirement of having nodes exist outside of the maintenance window through which to allow clients to eventually discover the new nodes, although if you know of any documentation of that nature I'd love to show it to the architect and devops teams which designed and require the deploy tasks we execute.

Either way, I'm less concerned with what Lettuce can't do than what with Lettuce can do. I've had a bit of success using the method described above, subclassing RedisClusterClient and overriding getTopologyRefreshSource to return an arbitrary set of RedisURIs.

HaloFour commented Nov 6, 2016

The Redis cluster and the applications are deployed/maintained entirely separately from each other, although both follow the same principle of the old instances being terminated immediately after the new instances are launched and pass a health check (in the Redis cluster case that health check includes having successfully joined the cluster). This seems to be a pretty standard way to handle a rolling update. I've not seen any literature that prescribes the requirement of having nodes exist outside of the maintenance window through which to allow clients to eventually discover the new nodes, although if you know of any documentation of that nature I'd love to show it to the architect and devops teams which designed and require the deploy tasks we execute.

Either way, I'm less concerned with what Lettuce can't do than what with Lettuce can do. I've had a bit of success using the method described above, subclassing RedisClusterClient and overriding getTopologyRefreshSource to return an arbitrary set of RedisURIs.

@mp911de

This comment has been minimized.

Show comment
Hide comment
@mp911de

mp911de Dec 6, 2016

Member

Subclassing RedisClusterClient and overriding getTopologyRefreshSource() is the preferred approach. Alternatively, you would have to reconnect the cluster using a different RedisClusterClient instance with might be more complicated.

I will add some JavaDoc to these methods so the intention why these methods are protected gets more clear.

Member

mp911de commented Dec 6, 2016

Subclassing RedisClusterClient and overriding getTopologyRefreshSource() is the preferred approach. Alternatively, you would have to reconnect the cluster using a different RedisClusterClient instance with might be more complicated.

I will add some JavaDoc to these methods so the intention why these methods are protected gets more clear.

@mp911de

This comment has been minimized.

Show comment
Hide comment
@mp911de

mp911de Dec 6, 2016

Member

Documentation added. May I close this ticket or is there anything else I can assist you with?

Member

mp911de commented Dec 6, 2016

Documentation added. May I close this ticket or is there anything else I can assist you with?

@mp911de mp911de added this to the Lettuce 4.4.0 milestone Dec 6, 2016

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Dec 6, 2016

Switching out the nodes seems to work alright. I've had some issues finding the appropriate place to trigger the topology update which would result in the swapped nodes. I was wondering if you might have some advice.

What I'm currently doing is attaching a RedisConnectionStateListener and in onRedisDisconnected I check the type of the instance and if it is a StatefulRedisClusterConnection I start a timer which will call reloadPartitions. This seems kind of hacky but if I tried to call reloadPartitions directly in the callback it would block the NIO worker threads and that would cause it to eventually lock up.

Here's my implementation so far:

https://gist.github.com/HaloFour/2a0e18246b1ab6bb82b199f1d6eb4aab

This seems to work somewhat reliably during testing. However I do see a small flurry of exceptions that are logged during the switchover:

13:24:05.537 [lettuce-nioEventLoop-4-7] WARN  i.n.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.NullPointerException: null
	at com.lambdaworks.redis.protocol.ConnectionWatchdog.userEventTriggered(ConnectionWatchdog.java:106) ~[lettuce-4.2.2.Final.jar:na]
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:334) [netty-transport-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:320) [netty-transport-4.0.40.Final.jar:4.0.40.Final]
        ...
13:24:05.537 [lettuce-cluster-rediscovery] WARN  io.netty.channel.AbstractChannel - Force-closing a channel whose registration task was not accepted by an event loop: [id: 0x3a9e8bed]
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:800) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:345) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:338) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:743) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.register(AbstractChannel.java:422) ~[netty-transport-4.0.40.Final.jar:4.0.40.Final]
        ...
13:24:05.538 [lettuce-cluster-rediscovery] ERROR i.n.u.c.D.rejectedExecution - Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:800) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:345) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:338) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:743) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:767) [netty-common-4.0.40.Final.jar:4.0.40.Final]
        ...

They ultimately seem harmless.

HaloFour commented Dec 6, 2016

Switching out the nodes seems to work alright. I've had some issues finding the appropriate place to trigger the topology update which would result in the swapped nodes. I was wondering if you might have some advice.

What I'm currently doing is attaching a RedisConnectionStateListener and in onRedisDisconnected I check the type of the instance and if it is a StatefulRedisClusterConnection I start a timer which will call reloadPartitions. This seems kind of hacky but if I tried to call reloadPartitions directly in the callback it would block the NIO worker threads and that would cause it to eventually lock up.

Here's my implementation so far:

https://gist.github.com/HaloFour/2a0e18246b1ab6bb82b199f1d6eb4aab

This seems to work somewhat reliably during testing. However I do see a small flurry of exceptions that are logged during the switchover:

13:24:05.537 [lettuce-nioEventLoop-4-7] WARN  i.n.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.NullPointerException: null
	at com.lambdaworks.redis.protocol.ConnectionWatchdog.userEventTriggered(ConnectionWatchdog.java:106) ~[lettuce-4.2.2.Final.jar:na]
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:334) [netty-transport-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:320) [netty-transport-4.0.40.Final.jar:4.0.40.Final]
        ...
13:24:05.537 [lettuce-cluster-rediscovery] WARN  io.netty.channel.AbstractChannel - Force-closing a channel whose registration task was not accepted by an event loop: [id: 0x3a9e8bed]
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:800) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:345) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:338) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:743) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.register(AbstractChannel.java:422) ~[netty-transport-4.0.40.Final.jar:4.0.40.Final]
        ...
13:24:05.538 [lettuce-cluster-rediscovery] ERROR i.n.u.c.D.rejectedExecution - Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:800) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:345) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:338) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:743) ~[netty-common-4.0.40.Final.jar:4.0.40.Final]
	at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:767) [netty-common-4.0.40.Final.jar:4.0.40.Final]
        ...

They ultimately seem harmless.

@mp911de

This comment has been minimized.

Show comment
Hide comment
@mp911de

mp911de Dec 6, 2016

Member

The approach indeed relies on some internals. I took a look on the exceptions, and maybe there's a better approach using existing features:

  • RedisConnectionStateListener is notified on the I/O threads (NioEventLoopGroup). RedisClusterClient.reloadPartitions() is a blocking operation that connects to Redis cluster nodes using I/O threads and that would cause a deadlock. Either use ClientResources.eventExecutorGroup() or an own thread pool.
  • NullPointerException at ConnectionWatchdog: That's a known issue/behavior that was fixed in 4.3.0 with #358
  • The other two RejectedExecutionException: The event loop is terminated so all components trying to schedule Jobs will run into RejectedExecutionException. You might want to guard your code once AbstractRedisClient.shutdown gets triggered to not submit any further jobs.

You could maybe use Adaptive refresh for PERSISTENT_RECONNECTS. PERSISTENT_RECONNECTS means that multiple reconnect attempts didn't succeed and RedisClusterClient triggers a topology refresh. The built-in functionality doesn't require you to implement the own RedisConnectionStateListener that schedules a refresh. (see ClusterTopologyRefreshScheduler)

Seeing your code it would make sense to extend RedisConnectionStateListener.onRedisConnected by SocketAddress so you're not required to obtain the SocketAddress via reflection (Hint: Connection structure is going to change with lettuce 5.0).

Member

mp911de commented Dec 6, 2016

The approach indeed relies on some internals. I took a look on the exceptions, and maybe there's a better approach using existing features:

  • RedisConnectionStateListener is notified on the I/O threads (NioEventLoopGroup). RedisClusterClient.reloadPartitions() is a blocking operation that connects to Redis cluster nodes using I/O threads and that would cause a deadlock. Either use ClientResources.eventExecutorGroup() or an own thread pool.
  • NullPointerException at ConnectionWatchdog: That's a known issue/behavior that was fixed in 4.3.0 with #358
  • The other two RejectedExecutionException: The event loop is terminated so all components trying to schedule Jobs will run into RejectedExecutionException. You might want to guard your code once AbstractRedisClient.shutdown gets triggered to not submit any further jobs.

You could maybe use Adaptive refresh for PERSISTENT_RECONNECTS. PERSISTENT_RECONNECTS means that multiple reconnect attempts didn't succeed and RedisClusterClient triggers a topology refresh. The built-in functionality doesn't require you to implement the own RedisConnectionStateListener that schedules a refresh. (see ClusterTopologyRefreshScheduler)

Seeing your code it would make sense to extend RedisConnectionStateListener.onRedisConnected by SocketAddress so you're not required to obtain the SocketAddress via reflection (Hint: Connection structure is going to change with lettuce 5.0).

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Dec 6, 2016

@mp911de

Thanks for the tip about the persistent reconnects option for adaptive refresh. That looks like it would serve as a perfect trigger for the topology refresh and eliminate much of the additional code, particularly that timer.

The adaptive refresh option for persistent reconnects looks promising. I commented out my refresh timer trigger and enabled that option and my test succeeded. That makes it a lot less complicated, although I'll likely leave the RedisConnectionStateListener for logging purposes.

I'd love to get rid of that reflection to get the remote socket address. I'd love to know the cluster node IP/port for logging purposes and I'd much prefer to have a supported public API for obtaining that information.

HaloFour commented Dec 6, 2016

@mp911de

Thanks for the tip about the persistent reconnects option for adaptive refresh. That looks like it would serve as a perfect trigger for the topology refresh and eliminate much of the additional code, particularly that timer.

The adaptive refresh option for persistent reconnects looks promising. I commented out my refresh timer trigger and enabled that option and my test succeeded. That makes it a lot less complicated, although I'll likely leave the RedisConnectionStateListener for logging purposes.

I'd love to get rid of that reflection to get the remote socket address. I'd love to know the cluster node IP/port for logging purposes and I'd much prefer to have a supported public API for obtaining that information.

@mp911de

This comment has been minimized.

Show comment
Hide comment
@mp911de

mp911de Dec 6, 2016

Member

ReconnectHandler provides already some logging. I created #419 to add another method that adds SocketAddress.

Member

mp911de commented Dec 6, 2016

ReconnectHandler provides already some logging. I created #419 to add another method that adds SocketAddress.

@mp911de

This comment has been minimized.

Show comment
Hide comment
@mp911de

mp911de Dec 21, 2016

Member

#419 is implemented. Closing this ticket since there's nothing left to do here.

Member

mp911de commented Dec 21, 2016

#419 is implemented. Closing this ticket since there's nothing left to do here.

@mp911de mp911de closed this Dec 21, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment