Introduce `nodeFilter` `Predicate` to filter `Partitions` #1942

jhmartin · 2021-12-22T16:00:08Z

Bug Report

Current Behavior

When Lettuce is connected to a 2 node Redis Cluster (1 shard, 1 replica), and is configured for REPLICA_PREFERRED, and the replica ceases responding to TCP (such as a ungraceful hardware failure), Lettuce does not recover until the TCP retry counter expires for that connection (~926 seconds).

Input Code

(Assumes the hostname `redis` resolves to all nodes in the group)

import io.lettuce.core.cluster.*;
import io.lettuce.core.cluster.api.sync.*;
import io.lettuce.core.cluster.api.*;
import io.lettuce.core.*;
import java.util.concurrent.TimeUnit;
import java.time.Duration;
 
public class RedisExample {
    public static void main(String[] args) {
        RedisURI redisUri = RedisURI.Builder.redis("redis").build();
 
        RedisClusterClient clusterClient = RedisClusterClient.create(redisUri);
 
        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions
           .builder()
           .enablePeriodicRefresh(60, TimeUnit.SECONDS)
           .enableAllAdaptiveRefreshTriggers()
           .dynamicRefreshSources(true)
           .closeStaleConnections(true)
           .build();
       TimeoutOptions  timeoutOptions = TimeoutOptions
          .builder()
          .timeoutCommands()
          .fixedTimeout(Duration.ofMillis(400))
          .build();
 
        SocketOptions socketOptions = SocketOptions
          .builder()
          .connectTimeout(500, TimeUnit.MILLISECONDS)
          .build();
 
        clusterClient.setOptions(ClusterClientOptions
            .builder()
            .autoReconnect(true)
            .socketOptions(socketOptions)
            .cancelCommandsOnReconnectFailure(true)
            .timeoutOptions(timeoutOptions)
            .disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
            .topologyRefreshOptions(topologyRefreshOptions)
            .validateClusterNodeMembership(true)
            .suspendReconnectOnProtocolFailure(true)
            .build());
 
        StatefulRedisClusterConnection<String, String> connection = clusterClient.connect();
        RedisAdvancedClusterCommands<String, String> syncCommands = connection.sync();
        connection.setReadFrom(ReadFrom.REPLICA_PREFERRED);
 
        String value1 = syncCommands.set("foo", "bar");
 
        while (true) {
         try {
          Thread.sleep(1000);
          String value = syncCommands.get("foo");
          if (! value.equals(value2)) {
            System.out.println("bad response:" + value + ":" + value2 + ":");
          } else {
            System.out.println("Good response: " + value);
          }
         } catch (Exception e) {
            System.out.println("Error response: " + e);
         }
       }
    }
}

Expected behavior/code

When the timeout is reached and a dynamic topology refresh is triggered, connections to the node in "fail?" state should be considered stale and closed / abandoned.

Environment

Lettuce version(s): 6.1.5.RELEASE
Redis server v=6.2.6 sha=00000000:0 malloc=jemalloc-5.1.0 bits=64 build=3f28004270edf9dc
OpenJDK Runtime Environment (build 1.8.0_312-b07)
Amazon Linux 2, 5.10.75-79.358.amzn2.x86_64, t3a.small

Possible Solution

A workaround, undesirable as it is global-to-the-node in nature, is to shorten the TCP retry counter on the client:

echo 5 >/proc/sys/net/ipv4/tcp_retries2

At first glance it looks like adding a filter for failed/eventual_fail nodes at https://github.com/lettuce-io/lettuce-core/blob/cda3be6b9477da790365ad098c6e39c8687f5002/src/main/java/io/lettuce/core/cluster/topology/DefaultClusterTopologyRefresh.java#L292-L296 would cap the duration of the failure scenario at the periodic-topology-refresh interval.

Additional context

TCPdump clearly shows the client getting a topology refresh occurs, but the client does not recover until the existing TCP connection exits.

17:58:54.396570 IP 10.0.0.41.6379 > 10.0.0.218.46894: Flags [P.], seq 150:428, ack 117, win 490, options [nop,nop,TS val 3343612451 ecr 121987982], length 278: RESP "=270" "txt:e03b0b3b56ca33dc759fb6a122a903c7ac47d8f7 10.0.0.41:6379@16379 myself,master - 0 0 0 connected 0-16383" "215c649d39c0182c82aec8fc7e533cd57c052b9a 10.0.0.101:6379@16379 slave,fail? e03b0b3b56ca33dc759fb6a122a903c7ac47d8f7 1640109483742 1640109482738 0 connected"

Failure of the replica node was simulated by dropping all Redis packets on the replica:

$ for x in INPUT OUTPUT; do for y in 6379 16379; do iptables -I $x -p tcp --dport $y -j DROP; done; done

Redis.conf contains:

port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
appendonly yes
maxmemory 1gb

TCP KeepAlive's do not help here as the connection is not idle.

The text was updated successfully, but these errors were encountered:

mp911de · 2022-01-04T08:51:32Z

You can override RedisClusterClient.determinePartitions(…) to post-process which nodes are available in the topology view. The client obtains topology information from all cluster nodes and uses the topology view that makes the most sense. Since each Redis server can have a different perspective on the other nodes, there is no single source of truth but rather requires a consensus to select the most appropriate one. And the view of Redis servers may be different than the perspective of the client.

I'm not sure whether TCP keep-alive is helping in this case.

jhmartin · 2022-01-04T15:48:06Z

@mp911de Yeah I don't think TCP keep-alive helps either, but I wanted to call out that I had looked at it as a possibility.

jhmartin · 2022-01-04T22:48:22Z

I tested reworking https://github.com/lettuce-io/lettuce-core/blob/4110f2820766c4967951639aa2b6bdd9d50466be/src/main/java/io/lettuce/core/cluster/RedisClusterClient.java#L1025-L1032 to filter out FAIL and EVENTUAL_FAIL nodes and achieved the same recovery after the periodic-topology-refresh interval.

mp911de · 2022-01-05T07:57:22Z

I wonder whether it makes generally sense to introduce a node filter (Predicate<RedisClusterNode>) that is used to filter Partitions. It's easier to provide a config value than subclassing the client.

jhmartin · 2022-01-05T15:18:53Z

I like the sound of that over a subclass; the behavior will look more baked-in. I did try calling removeIf against the Partitions collection with a predicate and it threw a unimplemented-error, so I guess that'd have to be added as well.

mp911de · 2022-01-07T13:20:38Z

That's in place now.

jhmartin · 2022-01-07T18:01:53Z

Adding
.nodeFilter(it -> ! (it.is(RedisClusterNode.NodeFlag.FAIL) || it.is(RedisClusterNode.NodeFlag.EVENTUAL_FAIL)))
gained me the desired behavior.

srgsanky · 2023-07-12T01:01:01Z

Is this the default behavior in latest lettuce without explicitly specifying the node filters?

If the answer is no, what is the reason for not making this the default behavior?

mp911de · 2023-07-12T07:14:26Z

You can specify a Predicate now. This new functionality allows additional filtering so that we do not break existing applications.

srgsanky · 2023-07-12T16:20:10Z

Thanks for the reason!

Can this be made the default in the next major version? This seems like a good default for most use cases. Folks run into this problem frequently that there is even a guide to set the predicates in AWS's documentation https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/BestPractices.Clients-lettuce.html#:~:text=Set%20nodeFilter,retrying%20is%20exhausted.

mp911de · 2023-07-13T07:30:23Z

Thanks @srgsanky for the background. Can you file a new ticket to flip the default?

mp911de added the status: waiting-for-triage label Jan 4, 2022

mp911de added type: enhancement A general enhancement and removed status: waiting-for-triage labels Jan 7, 2022

mp911de changed the title ~~Cluster failover stalls until system TCP retry exhausted~~ Introduce nodeFilter Predicate to filter Partitions Jan 7, 2022

mp911de added this to the 6.1.6 milestone Jan 7, 2022

mp911de added a commit that referenced this issue Jan 7, 2022

Introduce nodeFilter Predicate to filter Partitions #1942

64801c7

mp911de added a commit that referenced this issue Jan 7, 2022

Introduce nodeFilter Predicate to filter Partitions #1942

5d44cb2

mp911de closed this as completed Jan 7, 2022

jhmartin mentioned this issue Jan 10, 2022

cluster scan command can not remove the fail node #1939

Open

jhmartin mentioned this issue Jan 21, 2022

Support Lettuce 6.1.6 NodeFilter for increased Redis resiliency. profunktor/redis4cats#655

Closed

srgsanky mentioned this issue Jul 13, 2023

Filter failed and defunct nodes by default #2446

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `nodeFilter` `Predicate` to filter `Partitions` #1942

Introduce `nodeFilter` `Predicate` to filter `Partitions` #1942

jhmartin commented Dec 22, 2021 •

edited

mp911de commented Jan 4, 2022

jhmartin commented Jan 4, 2022

jhmartin commented Jan 4, 2022 •

edited

mp911de commented Jan 5, 2022

jhmartin commented Jan 5, 2022 •

edited

mp911de commented Jan 7, 2022

jhmartin commented Jan 7, 2022

srgsanky commented Jul 12, 2023

mp911de commented Jul 12, 2023

srgsanky commented Jul 12, 2023

mp911de commented Jul 13, 2023

Introduce nodeFilter Predicate to filter Partitions #1942

Introduce nodeFilter Predicate to filter Partitions #1942

Comments

jhmartin commented Dec 22, 2021 • edited

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Possible Solution

Additional context

mp911de commented Jan 4, 2022

jhmartin commented Jan 4, 2022

jhmartin commented Jan 4, 2022 • edited

mp911de commented Jan 5, 2022

jhmartin commented Jan 5, 2022 • edited

mp911de commented Jan 7, 2022

jhmartin commented Jan 7, 2022

srgsanky commented Jul 12, 2023

mp911de commented Jul 12, 2023

srgsanky commented Jul 12, 2023

mp911de commented Jul 13, 2023

Introduce `nodeFilter` `Predicate` to filter `Partitions` #1942

Introduce `nodeFilter` `Predicate` to filter `Partitions` #1942

jhmartin commented Dec 22, 2021 •

edited

jhmartin commented Jan 4, 2022 •

edited

jhmartin commented Jan 5, 2022 •

edited