Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis command keep buffered(Questions) #466

Closed
kojilin opened this issue Feb 8, 2017 · 17 comments
Closed

Redis command keep buffered(Questions) #466

kojilin opened this issue Feb 8, 2017 · 17 comments
Labels
for: stackoverflow A question that is better suited to stackoverflow.com

Comments

@kojilin
Copy link
Contributor

kojilin commented Feb 8, 2017

My environment is 4.3.0.Final and using Redis Cluster. Recently we have 1 server that has OOM by lettuce. Below is the image of dumped heap.
2017-02-08 17 01 44

Looks like there are lots of commands buffered, I'm not sure but suspect that maybe there is something wrong at part of connections but nodes are still alive in the cluster. I just suspect this is the reason why commands buffered(https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/CommandHandler.java#L307).
But I saw there is connection watchdog help reconnect..

Below is what my configuration.

enablePeriodicRefresh()
enableAllAdaptiveRefreshTriggers()

others keep default.

@kojilin kojilin changed the title Redis command keep buffered Redis command keep buffered(Questions) Feb 8, 2017
@mp911de
Copy link
Collaborator

mp911de commented Feb 8, 2017

Commands get buffered if a connection gets disconnected and these commands get replayed once the connection is restored again, by default. You can tweak that behavior with ClientOptions to set the disconnected behavior and control the queue size. Looks like the connection got disconnected and commands piled up in the queue.

See also:

@kojilin
Copy link
Contributor Author

kojilin commented Feb 8, 2017

Thanks for the wiki link.

I think it should auto reconnect(default true), after restarting there is no problem to connect to all nodes.This keeps growing for day, and other servers has no problems.

I'm curious if there is possibility that auto connect not works?(or maybe it try reconnecting but always failed.(step 15 at reconnect))....

I will change ClientOptions to avoid OOM first, thx.

@mp911de
Copy link
Collaborator

mp911de commented Feb 8, 2017

You might want to inspect the connection target (host/port) and the topology view (Partitions) that is associated with the connection in the heap dump. That should give you a hint which connection/node was affected.

@mp911de mp911de added the for: stackoverflow A question that is better suited to stackoverflow.com label Feb 8, 2017
@kojilin
Copy link
Contributor Author

kojilin commented Feb 9, 2017

Just add some investigate result. Saw one commandHandler's channel is null.(all other commandHandlers are healthy)
2017-02-09 16 32 34

So connectionWatchdog has reconnectScheduleTimeout(not null), but don't know why its state is ST_EXPIRED(2) but attempts is 0.
2017-02-09 16 32 02
I think this state make it won't try reconnect anymore?

@mp911de
Copy link
Collaborator

mp911de commented Feb 9, 2017

Zero attempts is unusual. Take a look into remoteAddress of ConnectionWatchdog to see the connection endpoint of the last successful connection. Do you have a log to see whether a previous error caused the ConnectionWatchdog to stop reconnecting? reconnectScheduleTimeout is set to null when ConnectionWatchdog.run(…) is called. This means that either reconnectWorkers (the event-loop) was terminated or the execution failed before ConnectionWatchdog.run(…) was called.

Reconnections are performed by ReconnectionHandler. During reconnection, ReconnectionHandler.currentFuture is set to synchronize connection attempts.

@kojilin
Copy link
Contributor Author

kojilin commented Feb 10, 2017

remoteAddress of ConnectionWatchdog has result. And lastReconnectionLogging has time just before stuck(we wrap lettuce with timeout scheduler so our first timeout is occurred after lastReconnectionLogging 15 secs). And ReconnectionHandler.currentFuture is null.

In my log, I didn't get warn at https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L277

2017-02-10 10 16 09_new

Hope this help(I'm trying too..)

@kojilin
Copy link
Contributor Author

kojilin commented Feb 22, 2017

I can't reproduce locally, but curious https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L215 if it possible that timeout's run runs and reconnectWorkers.submit runs before assignment of reconnectScheduleTimeout?

@mp911de
Copy link
Collaborator

mp911de commented Feb 22, 2017

It is possible if the timeout is small enough and a lot of things come together (the right time, a GC pause) although it's very unlikely. For now, I'd like to close this ticket because there's effectively nothing we can do about it right now. At a later time, if we find some clues, we can still reopen the ticket. Does this make sense?

@kojilin
Copy link
Contributor Author

kojilin commented Feb 23, 2017

Ok, I just suspect that situation may cause the state like I found at heap dump's result.
If I intentionally delay assignment(using another local variable with sleep) and restart redis, I can reproduce similiar situation with log at https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L277 Cannot connect: ....
But I didn't found log at https://github.com/mp911de/lettuce/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L277 in my production log, so maybe it's not the same.

Let me close this and if I can find something new, I will reopen this.

@kojilin
Copy link
Contributor Author

kojilin commented Jun 4, 2017

We met this problem again, so let me reopen this again to see if my thoughts can provide some clue or not.

I'm not sure if this can be treat as reproduce-able. e.g.

    Timeout localTimeout = timer.newTimeout(it -> {

                if (!isEventLoopGroupActive()) {
                    logger.debug("isEventLoopGroupActive() == false");
                    return;
                }

                reconnectWorkers.submit(() -> {
                    ConnectionWatchdog.this.run(it);
                    return null;
                });
            }, timeout, TimeUnit.MILLISECONDS);
    // force sleep here.
    Thread.sleep(50L);
    this.reconnectScheduleTimeout = localTimeout;

at https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L217 .
Then it would have reconnectScheduleTimeout is not null problem.

And in current code base, maybe it would be ok to remove reconnectScheduleTimeout check? because after async reconnect change, scheduleReconnect are all from netty thread, or move reconnectScheduleTimeout = null; to https://github.com/lettuce-io/lettuce-core/blob/master/src/main/java/com/lambdaworks/redis/protocol/ConnectionWatchdog.java#L278 and previous early return conditions?

@kojilin kojilin reopened this Jun 4, 2017
@mp911de
Copy link
Collaborator

mp911de commented Jun 6, 2017

I think it could make sense to improve state handling to prevent multiple attempts of reconnect initialization. reconnectScheduleTimeout could be additionally cancelled on prepareClose().

Have you tried enabling debug logging for ConnectionWatchdog only?

@kojilin
Copy link
Contributor Author

kojilin commented Jun 7, 2017

hm, let me try. But it's hard to reproduce, last time is Feb and this time is June.

@Spikhalskiy
Copy link

Spikhalskiy commented Nov 7, 2017

I think that I face the same problem pretty often in production. I have a limited command queue. Often after some network works with possible blips I get a state:

Caused by: com.lambdaworks.redis.RedisException: Request queue size exceeded: 10000. Commands are not accepted until the queue size drops.
	at com.lambdaworks.redis.protocol.CommandHandler.validateWrite(CommandHandler.java:384)
	at com.lambdaworks.redis.protocol.CommandHandler.write(CommandHandler.java:347)
	at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.writeCommand(ClusterDistributionChannelWriter.java:176)
	at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.writeCommand(ClusterDistributionChannelWriter.java:167)
	at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:124)
	at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:133)
	at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:232)
	at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1979)
	at com.lambdaworks.redis.AbstractRedisAsyncCommands.get(AbstractRedisAsyncCommands.java:400)

A number of such exceptions make me think that likely lettuce can't run commands against only one process in a cluster. On the same time, nodes are definitely accessible, restart helps to resolve this state.

My configuration code for the client

        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .dynamicRefreshSources(false)
                .enableAllAdaptiveRefreshTriggers()
                .enablePeriodicRefresh(10, TimeUnit.MINUTES)
                .build();
        ClusterClientOptions options = ClusterClientOptions.builder()
                .maxRedirects(5)
                .topologyRefreshOptions(topologyRefreshOptions)
                .requestQueueSize(10000)
                .build();
        clusterClient.setOptions(options);
        clusterClient.setDefaultTimeout(50, TimeUnit.MILLISECONDS);

Lettuce 4.4.1.Final + netty 4.0.51.Final

Watchdog logging:

[07 Nov 2017 14:29:56,628][INFO ][ConnectionWatchdog,lettuce-eventExecutorLoop-3-2] - Reconnecting, last destination was /10.201.12.214:9007
[07 Nov 2017 14:29:56,728][WARN ][ConnectionWatchdog,lettuce-epollEventLoop-6-3] - Cannot reconnect: java.util.concurrent.TimeoutException: Reconnection attempt exceeded timeout of 50 MILLISECONDS

Which doesn't make too much sense - node is accessible, connection opens from the same node from the command line in 5ms. And after application restart, this connection will be established successfully with the same parameters.

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2017

@Spikhalskiy interesting to hear that Lettuce can't reconnect to the node (timeout) but a redis-cli is possible. I'm not sure whether this is related to #615 where replaying buffered commands keeps Lettuce busy for quite a while. Could you check what the threads are doing in such a case (thread dump)?

@Spikhalskiy
Copy link

@mp911de I will take a thread dump next time when I get this state. I get it pretty often, one in a week or two. Currently cure it by triggering shutting down and recreating a lettuce client.

@Spikhalskiy
Copy link

@mp911de Why I posted here and think that it can be related - if remove a queue limit from the client configuration in my case - the issue and outcomes will look very similar to the original one.

@mp911de
Copy link
Collaborator

mp911de commented Jan 22, 2018

Closing this one as per #679.

@mp911de mp911de closed this as completed Jan 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: stackoverflow A question that is better suited to stackoverflow.com
Projects
None yet
Development

No branches or pull requests

3 participants