Redis command keep buffered(Questions) #466

kojilin · 2017-02-08T08:16:33Z

My environment is 4.3.0.Final and using Redis Cluster. Recently we have 1 server that has OOM by lettuce. Below is the image of dumped heap.

Looks like there are lots of commands buffered, I'm not sure but suspect that maybe there is something wrong at part of connections but nodes are still alive in the cluster. I just suspect this is the reason why commands buffered(https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/CommandHandler.java#L307).
But I saw there is connection watchdog help reconnect..

Below is what my configuration.

enablePeriodicRefresh()
enableAllAdaptiveRefreshTriggers()

others keep default.

The text was updated successfully, but these errors were encountered:

mp911de · 2017-02-08T08:47:17Z

Commands get buffered if a connection gets disconnected and these commands get replayed once the connection is restored again, by default. You can tweak that behavior with ClientOptions to set the disconnected behavior and control the queue size. Looks like the connection got disconnected and commands piled up in the queue.

See also:

kojilin · 2017-02-08T08:51:19Z

Thanks for the wiki link.

I think it should auto reconnect(default true), after restarting there is no problem to connect to all nodes.This keeps growing for day, and other servers has no problems.

I'm curious if there is possibility that auto connect not works?(or maybe it try reconnecting but always failed.(step 15 at reconnect))....

I will change ClientOptions to avoid OOM first, thx.

mp911de · 2017-02-08T08:57:14Z

You might want to inspect the connection target (host/port) and the topology view (Partitions) that is associated with the connection in the heap dump. That should give you a hint which connection/node was affected.

kojilin · 2017-02-09T07:38:43Z

Just add some investigate result. Saw one commandHandler's channel is null.(all other commandHandlers are healthy)

So connectionWatchdog has reconnectScheduleTimeout(not null), but don't know why its state is ST_EXPIRED(2) but attempts is 0.

I think this state make it won't try reconnect anymore?

mp911de · 2017-02-09T10:14:30Z

Zero attempts is unusual. Take a look into remoteAddress of ConnectionWatchdog to see the connection endpoint of the last successful connection. Do you have a log to see whether a previous error caused the ConnectionWatchdog to stop reconnecting? reconnectScheduleTimeout is set to null when ConnectionWatchdog.run(…) is called. This means that either reconnectWorkers (the event-loop) was terminated or the execution failed before ConnectionWatchdog.run(…) was called.

Reconnections are performed by ReconnectionHandler. During reconnection, ReconnectionHandler.currentFuture is set to synchronize connection attempts.

kojilin · 2017-02-10T01:39:15Z

remoteAddress of ConnectionWatchdog has result. And lastReconnectionLogging has time just before stuck(we wrap lettuce with timeout scheduler so our first timeout is occurred after lastReconnectionLogging 15 secs). And ReconnectionHandler.currentFuture is null.

In my log, I didn't get warn at https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L277

Hope this help(I'm trying too..)

kojilin · 2017-02-22T15:48:34Z

I can't reproduce locally, but curious https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L215 if it possible that timeout's run runs and reconnectWorkers.submit runs before assignment of reconnectScheduleTimeout?

mp911de · 2017-02-22T20:53:09Z

It is possible if the timeout is small enough and a lot of things come together (the right time, a GC pause) although it's very unlikely. For now, I'd like to close this ticket because there's effectively nothing we can do about it right now. At a later time, if we find some clues, we can still reopen the ticket. Does this make sense?

kojilin · 2017-02-23T00:07:31Z

Ok, I just suspect that situation may cause the state like I found at heap dump's result.
If I intentionally delay assignment(using another local variable with sleep) and restart redis, I can reproduce similiar situation with log at https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L277 Cannot connect: ....
But I didn't found log at https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L277 in my production log, so maybe it's not the same.

Let me close this and if I can find something new, I will reopen this.

kojilin · 2017-06-04T16:46:13Z

We met this problem again, so let me reopen this again to see if my thoughts can provide some clue or not.

I'm not sure if this can be treat as reproduce-able. e.g.

    Timeout localTimeout = timer.newTimeout(it -> {

                if (!isEventLoopGroupActive()) {
                    logger.debug("isEventLoopGroupActive() == false");
                    return;
                }

                reconnectWorkers.submit(() -> {
                    ConnectionWatchdog.this.run(it);
                    return null;
                });
            }, timeout, TimeUnit.MILLISECONDS);
    // force sleep here.
    Thread.sleep(50L);
    this.reconnectScheduleTimeout = localTimeout;

at https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L217 .
Then it would have reconnectScheduleTimeout is not null problem.

And in current code base, maybe it would be ok to remove reconnectScheduleTimeout check? because after async reconnect change, scheduleReconnect are all from netty thread, or move reconnectScheduleTimeout = null; to https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L278 and previous early return conditions?

mp911de · 2017-06-06T14:53:11Z

I think it could make sense to improve state handling to prevent multiple attempts of reconnect initialization. reconnectScheduleTimeout could be additionally cancelled on prepareClose().

Have you tried enabling debug logging for ConnectionWatchdog only?

kojilin · 2017-06-07T03:07:29Z

hm, let me try. But it's hard to reproduce, last time is Feb and this time is June.

Spikhalskiy · 2017-11-07T19:25:49Z

I think that I face the same problem pretty often in production. I have a limited command queue. Often after some network works with possible blips I get a state:

Caused by: com.lambdaworks.redis.RedisException: Request queue size exceeded: 10000. Commands are not accepted until the queue size drops.
	at com.lambdaworks.redis.protocol.CommandHandler.validateWrite(CommandHandler.java:384)
	at com.lambdaworks.redis.protocol.CommandHandler.write(CommandHandler.java:347)
	at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.writeCommand(ClusterDistributionChannelWriter.java:176)
	at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.writeCommand(ClusterDistributionChannelWriter.java:167)
	at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:124)
	at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:133)
	at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:232)
	at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1979)
	at com.lambdaworks.redis.AbstractRedisAsyncCommands.get(AbstractRedisAsyncCommands.java:400)

A number of such exceptions make me think that likely lettuce can't run commands against only one process in a cluster. On the same time, nodes are definitely accessible, restart helps to resolve this state.

My configuration code for the client

        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .dynamicRefreshSources(false)
                .enableAllAdaptiveRefreshTriggers()
                .enablePeriodicRefresh(10, TimeUnit.MINUTES)
                .build();
        ClusterClientOptions options = ClusterClientOptions.builder()
                .maxRedirects(5)
                .topologyRefreshOptions(topologyRefreshOptions)
                .requestQueueSize(10000)
                .build();
        clusterClient.setOptions(options);
        clusterClient.setDefaultTimeout(50, TimeUnit.MILLISECONDS);

Lettuce 4.4.1.Final + netty 4.0.51.Final

Watchdog logging:

[07 Nov 2017 14:29:56,628][INFO ][ConnectionWatchdog,lettuce-eventExecutorLoop-3-2] - Reconnecting, last destination was /10.201.12.214:9007
[07 Nov 2017 14:29:56,728][WARN ][ConnectionWatchdog,lettuce-epollEventLoop-6-3] - Cannot reconnect: java.util.concurrent.TimeoutException: Reconnection attempt exceeded timeout of 50 MILLISECONDS

Which doesn't make too much sense - node is accessible, connection opens from the same node from the command line in 5ms. And after application restart, this connection will be established successfully with the same parameters.

mp911de · 2017-11-07T21:20:05Z

@Spikhalskiy interesting to hear that Lettuce can't reconnect to the node (timeout) but a redis-cli is possible. I'm not sure whether this is related to #615 where replaying buffered commands keeps Lettuce busy for quite a while. Could you check what the threads are doing in such a case (thread dump)?

Spikhalskiy · 2017-11-07T21:27:53Z

@mp911de I will take a thread dump next time when I get this state. I get it pretty often, one in a week or two. Currently cure it by triggering shutting down and recreating a lettuce client.

Spikhalskiy · 2017-11-07T21:45:30Z

@mp911de Why I posted here and think that it can be related - if remove a queue limit from the client configuration in my case - the issue and outcomes will look very similar to the original one.

mp911de · 2018-01-22T15:29:51Z

Closing this one as per #679.

kojilin changed the title ~~Redis command keep buffered~~ Redis command keep buffered(Questions) Feb 8, 2017

mp911de added the for: stackoverflow A question that is better suited to stackoverflow.com label Feb 8, 2017

mp911de added the status: cannot-reproduce label Feb 17, 2017

kojilin closed this as completed Feb 23, 2017

This was referenced May 31, 2017

Fix FastCountingDeque not recording correct size. #544

Closed

[WIP] Fix ConnectionWatchdog won't reconnect problem #546

Closed

kojilin reopened this Jun 4, 2017

mp911de removed the status: cannot-reproduce label Jun 6, 2017

kojilin mentioned this issue Jan 15, 2018

Fix ConnectionWatchDog won't reconnect problem in edge case #679

Closed

4 tasks

mp911de closed this as completed Jan 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis command keep buffered(Questions) #466

Redis command keep buffered(Questions) #466

kojilin commented Feb 8, 2017 •

edited

mp911de commented Feb 8, 2017

kojilin commented Feb 8, 2017 •

edited

mp911de commented Feb 8, 2017

kojilin commented Feb 9, 2017 •

edited

mp911de commented Feb 9, 2017

kojilin commented Feb 10, 2017 •

edited

kojilin commented Feb 22, 2017

mp911de commented Feb 22, 2017

kojilin commented Feb 23, 2017 •

edited

kojilin commented Jun 4, 2017 •

edited

mp911de commented Jun 6, 2017

kojilin commented Jun 7, 2017

Spikhalskiy commented Nov 7, 2017 •

edited

mp911de commented Nov 7, 2017

Spikhalskiy commented Nov 7, 2017

Spikhalskiy commented Nov 7, 2017

mp911de commented Jan 22, 2018

Redis command keep buffered(Questions) #466

Redis command keep buffered(Questions) #466

Comments

kojilin commented Feb 8, 2017 • edited

mp911de commented Feb 8, 2017

kojilin commented Feb 8, 2017 • edited

mp911de commented Feb 8, 2017

kojilin commented Feb 9, 2017 • edited

mp911de commented Feb 9, 2017

kojilin commented Feb 10, 2017 • edited

kojilin commented Feb 22, 2017

mp911de commented Feb 22, 2017

kojilin commented Feb 23, 2017 • edited

kojilin commented Jun 4, 2017 • edited

mp911de commented Jun 6, 2017

kojilin commented Jun 7, 2017

Spikhalskiy commented Nov 7, 2017 • edited

mp911de commented Nov 7, 2017

Spikhalskiy commented Nov 7, 2017

Spikhalskiy commented Nov 7, 2017

mp911de commented Jan 22, 2018

kojilin commented Feb 8, 2017 •

edited

kojilin commented Feb 8, 2017 •

edited

kojilin commented Feb 9, 2017 •

edited

kojilin commented Feb 10, 2017 •

edited

kojilin commented Feb 23, 2017 •

edited

kojilin commented Jun 4, 2017 •

edited

Spikhalskiy commented Nov 7, 2017 •

edited