Question about detecting dead connection #1572

wangkekekexili · 2021-01-07T10:45:14Z

I'm not sure whether this is a feature request, an issue on my end or just a simple question so please forgive me for not completely following the template.

Current Behavior

The issue we are encountering is that during the scale-up process of aws Redis, we are seeing io.lettuce.core.RedisCommandTimeoutException errors.

We are using non-cluster mode Redis and connecting to reader endpoint. When scaling up aws Redis, DNS domain name remains the same but IP changes; that's when client starts to show errors. After some time, ConnectionWatchdog seems to notice the channel is inactive. Lettuce reconnects and it gets the updated IP address.

I think the timeout issue is caused by client side still holding the existing connection when the peer disappears. It doesn't know the peer disappears and keeps sending requests using the existing connection. I'm wondering what I can do here to detect the dead connection? Could ConnectionWatchdog be updated to catch dead connection and try re-connect?

Input Code

I'm using this simple code for testing the behavior:

fun main(args: Array<String>) = runBlocking<Unit> {
    val ro = <reader-endpoint-here>
    launch(Dispatchers.IO) {
        val clientResources = DefaultClientResources.builder()
            .dnsResolver(DirContextDnsResolver()).build()
        val redisClient = RedisClient.create(clientResources, RedisURI.create(ro)).apply {
            options = ClientOptions
                .builder()
                .socketOptions(
                    SocketOptions
                        .builder()
                        .connectTimeout(Duration.ofMillis(500L))
                        .keepAlive(true)
                        .build()
                )
                .timeoutOptions(
                    TimeoutOptions
                        .builder()
                        .fixedTimeout(Duration.ofMillis(500L))
                        .build()
                )
                .build()
        }
        val statefulRedisConnection = redisClient.connect()
        val redisCommands = statefulRedisConnection.sync()
        while (true) {
            try {
                redisCommands.get("hello")
            } catch (ex: Exception) {
                println(ex)
            }
            delay(Duration.ofMillis(500))
        }
    }
}

Environment

Lettuce version: 5.2.2.RELEASE
Redis version: 5.0.6

Any suggestions would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

KowalczykBartek · 2021-01-07T20:38:17Z

@wangkekekexili on what OS are you testing this code (as far as I know only linux/epoll is handling keep-alive properly netty/netty#9780 ) ? did you enable keepAlive in your app that is facing this problem ?

edit:
I tried to reproduce behaviour (I started ElasticCache cluster and added read replica) and I don't see any errors, additionally,

 When scaling up aws Redis, DNS domain name remains the same but IP changes

are you sure its true ? when new instance is added to the cluster, new IP can appear but why AWS could change existing instance IP? Can you paste stacktrace you see in your logs ?

wangkekekexili · 2021-01-08T02:30:43Z

@KowalczykBartek

Thanks for your response.

I tried to reproduce behaviour (I started ElasticCache cluster and added read replica) and I don't see any errors

Sorry I may not make it very clear but by "scaling up" I mean modifying the node type (say changing from cache.m5.large to cache.m5.2xlarge) to make the instance have more memory (https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Scaling.RedisStandalone.ScaleUp.html). This action doesn't change the number of read replicas.

(I also can confirm that just adding replica won't have errors as we have done that in production.)

on what OS are you testing this code

Image "adoptopenjdk/openjdk8:jdk8u252-b09" (https://hub.docker.com/layers/adoptopenjdk/openjdk8/jdk8u252-b09/images/sha256-daf9b6b24d0a0d2099900e6eeef15b37360edd1c1933673173729773741e53a9?context=explore) is used.

> cat /etc/os-release

NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

did you enable keepAlive in your app that is facing this problem ?

Yes. I have enable "keepAlive". I set that in the socket option:

.socketOptions(
    SocketOptions
        .builder()
        .connectTimeout(Duration.ofMillis(500L))
        .keepAlive(true)
        .build()
)

I also noticed that I need to tweak some socket options to override default values by doing

.nettyCustomizer(object : NettyCustomizer {
    override fun afterBootstrapInitialized(bootstrap: Bootstrap?) {
        bootstrap!!.option(EpollChannelOption.TCP_KEEPIDLE, 2)
        bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 1)
        bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 1)
    }
})

But keep-alive feature doesn't really work in my case so I didn't include this part in my question snippet.
TCP_KEEPIDLE is "The time (in seconds) the connection needs to remain idle before TCP starts sending keepalive probes, if the socket option SO_KEEPALIVE has been set on this socket." In my case, the connection is never idle since I'm doing redisCommands.get("hello") periodically (on production it will have large qps, definitely making it never idle).
I verified this by removing the redisCommands.get("hello"), and I do see tcp probe packet every 2 seconds; adding that back and I can no longer find any probe packets.

mp911de · 2021-01-08T08:42:31Z

The general motivation to use Lettuce is its built-in resiliency by trying to auto-reconnect. That being said, you should not see dead connections, rather the way to think about it is to consider a connection temporary not available because of a failover. Using a HA deployment where the endpoint (DNS name) gets updated with the active master or replica node is the right way to approach high availability.

client side still holding the existing connection when the peer disappears

I assume you're talking about AWS removing the node and reconfiguring the cluster. As long as the infrastructure puts back a node and updates the DNS name, everything is fine. If the DNS name itself changes (cluster-a.aws.com to cluster-b.aws.com), then Lettuce can't do anything about that because it doesn't know that you're performing such a change.

Moreover, if a peer goes away and stops responding (firewall change, server node gets killed), then keep-alive is a good choice to detect dead peers. With #1437, we will apply Keep-Alive customizations, basically what you've outlined in your comment #1572 (comment).

Note that extended keep-alive requires either using NIO sockets with Java 11 or newer, epoll sockets (native transport), or io_uring sockets (native transport).

wangkekekexili · 2021-01-08T13:23:12Z

Thank you @mp911de for your response.

I assume you're talking about AWS removing the node and reconfiguring the cluster.

Yes, during AWS redis scale-up process, AWS sets up new server and updates DNS record to switch to the new IP without changing the name. Here is how AWS support describes the process:

"
When scaling Redis, the DNS remains the same but the IP changes, and on the back end the Elasticache service tries to do the transition as seamlessly as possible. Meaning that the underlying Redis servers are prepared and updated before the fail over on the DNS side happens and points to the new IP addresses for Primary and Secondary Nodes.
"

As long as the infrastructure puts back a node and updates the DNS name, everything is fine.

In my case, it causes time out errors for some time during the process. Let me show a concrete example below.

During one scaling up test, I connected to a REDACTED.cache.amazonaws.com reader endpoint, it had IP 172.16.51.76 in the beginning and later switched to 172.16.51.138 during the scale-up.

At one time, client starts to show errors. It's around this time that DNS record is updated to new IP address.

2021-01-04T10:41:14.024Z 
io.lettuce.core.RedisCommandTimeoutException: Command timed out after 500 millisecond(s)

Lettuce logs show that it is still trying to talk to the old IP address.

2021-01-04T10:41:14.524Z
[channel=0xfea5b586, /172.16.103.197:38798 -> REDACTED.cache.amazonaws.com/172.16.51.76:6379, epid=0x1] write() writeAndFlush command AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

Some time later, Lettuce notices the channel is inactive, and re-connects. It successfully re-connects with the new IP address.

2021-01-04T10:41:48.023Z
[channel=0xfea5b586, /172.16.103.197:38798 -> REDACTED.cache.amazonaws.com/172.16.51.76:6379, chid=0x1] channelInactive()

2021-01-04T10:41:48.204Z
Resolved SocketAddress REDACTED.cache.amazonaws.com/172.16.51.138:6379 using RedisURI [host='REDACTED.cache.amazonaws.com', port=6379]

It looks to me that if Lettuce can notice the connection is not available at "2021-01-04T10:41:14.024Z" and tries to re-connect at that point then it can be recovered sooner, thus wondering if it is possible.

mp911de · 2021-01-08T13:57:54Z

Lettuce doesn't monitor DNS. If during scaling, a new host gets in place first, the DNS gets updated and then the old host goes away, then the reconnect at that time is the only trigger we have. Clearly, you can handle scaling events in your application by issuing a QUIT command and then Lettuce tries to reconnect.

Since there isn't anything beyond that what we could do, I'd like to close this ticket.

wangkekekexili · 2021-01-09T02:45:32Z

@mp911de Thank you. I'm wondering if Lettuce can try to re-connect before server responds to QUIT command (since server may not be available to answer the response) or if we can manually tell Lettuce to re-connect?

mp911de · 2021-01-09T10:07:56Z

No, that doesn't work. Another alternative could be reflectively obtaining the channel and closing it. Since the connection doesn't expect the channel to be closed, it will try to reconnect. However, reflection is tricky.

mp911de added the status: waiting-for-triage label Jan 7, 2021

mp911de added status: invalid An issue that we don't feel is valid and removed status: waiting-for-triage labels Jan 8, 2021

mp911de closed this as completed Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about detecting dead connection #1572

Question about detecting dead connection #1572

wangkekekexili commented Jan 7, 2021

KowalczykBartek commented Jan 7, 2021 •

edited

Loading

wangkekekexili commented Jan 8, 2021

mp911de commented Jan 8, 2021

wangkekekexili commented Jan 8, 2021

mp911de commented Jan 8, 2021

wangkekekexili commented Jan 9, 2021

mp911de commented Jan 9, 2021

Question about detecting dead connection #1572

Question about detecting dead connection #1572

Comments

wangkekekexili commented Jan 7, 2021

Current Behavior

Input Code

Environment

KowalczykBartek commented Jan 7, 2021 • edited Loading

wangkekekexili commented Jan 8, 2021

mp911de commented Jan 8, 2021

wangkekekexili commented Jan 8, 2021

mp911de commented Jan 8, 2021

wangkekekexili commented Jan 9, 2021

mp911de commented Jan 9, 2021

KowalczykBartek commented Jan 7, 2021 •

edited

Loading