Not able to read the updated Master when connected Master/Slave through sentinel #1293

UdCa-Codes · 2020-05-22T01:21:42Z

Bug Report

Im using Master/Slave connection through sentinel and using the below configuration to connect to redis sentinel.

Current Behavior

we hosted our app in kubernetes and when the master node fails for the first few times, Lettuce is responding by updating the sentinel configuration, after that its returning as error saying the "Cannot find the master node"

public RedisConnectionFactory redisConnectionFactory() {
LettuceClientConfiguration clientConfig = LettuceClientConfiguration.builder()
                    .readFrom(MASTER)
                    .build();
            RedisSentinelConfiguration sentinelConfig = new RedisSentinelConfiguration()
                    .master("mysentinelmaster")
		    .build();    
                sentinelConfig.sentinel(xxx,6783);
            return new LettuceConnectionFactory(sentinelConfig,clientConfig);
}
@Bean
    public RedisTemplate<?, ?> redisTemplate() {
        RedisTemplate<?, ?> template = new RedisTemplate<>();
        template.setConnectionFactory(redisConnectionFactory());
        return template;
    }

here are the logs

org.springframework.data.redis.RedisSystemException: Redis exception; nested exception is io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE]]
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:74)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:268)
	at org.springframework.data.redis.connection.lettuce.LettuceSetCommands.convertLettuceAccessException(LettuceSetCommands.java:520)
	at org.springframework.data.redis.connection.lettuce.LettuceSetCommands.sRem(LettuceSetCommands.java:394)
	at org.springframework.data.redis.connection.DefaultedRedisConnection.sRem(DefaultedRedisConnection.java:649)
	at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:61)
	at com.sun.proxy.$Proxy135.sRem(Unknown Source)
	at org.springframework.data.redis.core.RedisKeyValueAdapter$MappingExpirationListener.lambda$onMessage$1(RedisKeyValueAdapter.java:787)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:224)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:184)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:171)
	at org.springframework.data.redis.core.RedisKeyValueAdapter$MappingExpirationListener.onMessage(RedisKeyValueAdapter.java:785)
	at org.springframework.data.redis.listener.RedisMessageListenerContainer.executeListener(RedisMessageListenerContainer.java:250)
	at org.springframework.data.redis.listener.RedisMessageListenerContainer.processMessage(RedisMessageListenerContainer.java:240)
	at org.springframework.data.redis.listener.RedisMessageListenerContainer.lambda$dispatchMessage$0(RedisMessageListenerContainer.java:986)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='xxx', port=6379], role=SLAVE]]
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster(MasterSlaveConnectionProvider.java:304)
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getConnectionAsync(MasterSlaveConnectionProvider.java:153)
	at io.lettuce.core.masterslave.MasterSlaveChannelWriter.write(MasterSlaveChannelWriter.java:66)
	at io.lettuce.core.RedisChannelHandler.dispatch(RedisChannelHandler.java:187)
	at io.lettuce.core.StatefulRedisConnectionImpl.dispatch(StatefulRedisConnectionImpl.java:152)
	at io.lettuce.core.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:467)
	at io.lettuce.core.AbstractRedisAsyncCommands.srem(AbstractRedisAsyncCommands.java:1367)
	at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:57)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy131.srem(Unknown Source)
	at org.springframework.data.redis.connection.lettuce.LettuceSetCommands.sRem(LettuceSetCommands.java:392)
	... 15 common frames omitted

Expected behavior/code

Sentinel should refresh the topology continuously even after receiving updated master details from sentinel if there is some issue with the current details

Environment

Lettuce version(s): [5.1.0.RELEASE]
Redis version: [e.g. 4.0.9]

The text was updated successfully, but these errors were encountered:

mp911de · 2020-05-22T07:00:03Z

Lettuce listens continously to Sentinel Pub/Sub channels for topology updates. This approach is the most elaborate one as Sentinels actively publish changes in master and replica configuration. Can you provide a simple, reproducible test case or the logs from the time of the failover until the command failure?

UdCa-Codes · 2020-06-01T15:39:51Z

@mp911de I face that issue intermittently , I tried it few times and here are the logs that i see.
Steps:

1.I failed the Redis Master and it recovered fine.
2.Brought the master up and brought down master again.

i face the below errors after repeating the above steps few times(attached the logs
Redis Issue.pdf
)

Im actually unsure if there is something im missing in the configuration or it is a bug.
Or should i manually trigger the Topology Refresh or Pub/Sub channels from my configuration?

maiconbaumx · 2020-12-16T18:03:03Z

Hello guys!
We're facing the same issue described by @UdCa-Codes in our setup, which actually consists in a master-slave setup with Sentinel.

This is the driver configuration we're using:

application.store.master-name=<master-name>
application.store.hosts=<sentinel-address>
application.store.port=<sentine-port>
application.store.username=<username>
application.store.password=<password>
application.store.commandTimeout=15s

We're using the version 6.0.1, but we're facing this issue for a long time in other versions.
Seems to happens quite randomly. From the Redis perspective, the Master election occurs with no erros and we're able to login in the Sentinel and discover the master normally through redis-cli for example. From the application perspective, sometimes we need to reboot it to effectively discover the Master through Sentinel.

In the application's log we see this kind of message:

Master is currently unknown: [RedisUpstreamReplicaNode [redisURI=redis://****@<previous-master-IP> role=REPLICA], RedisUpstreamReplicaNode [redisURI=redis://****@<previous-replica-IP>, role=REPLICA]]

Is there some kind of debugging we can do to get more information from the driver itself?

siwyd · 2020-12-17T08:43:37Z

I don't know if it's completely related, but we faced a bit of a similar issue some time ago where newly started applications would sometimes complain about not finding a master and staying in a broken state. The only way to 'fix' them was by sending a SENTINEL RESET to one of the Sentinel servers, so all clients would get info sent by the Sentinels. We couldn't quite figure out what was going wrong through the debug logging. After more digging, we did stumble upon a change some time ago where we reduced the Redis timeout from 10s to 1s. After reverting this change, the problem stopped occurring.

perlun · 2020-12-28T09:38:33Z

I am seeing the same (or a similiar) issue as this. I think my scenario was roughly the following (this was on my local dev setup, luckily):

Redis Sentinel processes were killed (by me manually)
One of the Redis nodes was firewalled (sudo iptables -A INPUT -p tcp --dport 6379 -j DROP)

Our application was then connecting to the Redis Sentinel cluster, using Lettuce 5.3.0 and some connect code that looks similar to this (manually recreated, the actual code is spread out over multiple places in our case and conditionalized to support both single-node and Sentinel setups):

        Builder redisUriBuilder = RedisURI.builder()
             .withSentinelMasterId( sentinelMasterId) )
             .withPassword( password );

        redisUriBuilder.withSentinel( host1, port, password) );
        redisUriBuilder.withSentinel( host2, port, password) );
        redisUriBuilder.withSentinel( host3, port, password) );

        RedisURI redisUri = redisUriBuilder.build();

        Builder clientResourcesBuilder = DefaultClientResources.builder();

        if ( computationThreadPoolSize > 0 ) {
            clientResourcesBuilder.computationThreadPoolSize( computationThreadPoolSize );
        }

        if ( ioThreadPoolSize > 0 ) {
            clientResourcesBuilder.ioThreadPoolSize( ioThreadPoolSize );
        }

        RedisClient redisClient = RedisClient.create( clientResourcesBuilder.build() );
        redisClient.setDefaultTimeout( Duration.ofMillis( connectionTimeout ) );

         StatefulRedisMasterReplicaConnection<String, byte[]> redisConnection = MasterReplica.connect(
                 redisClient,
                 StringByteArrayCodec.INSTANCE,
                 redisUri
         );

         redisConnection.setTimeout( Duration.ofMillis( redisServer.getCommandTimeout() ) );
         redisConnection.setReadFrom( redisSentinelServer.readFrom() );

With this in place, Lettuce continuously fails to reconnect to the master node if it was down on the initial connection. Here is the stack trace (up until the first application-level line):

io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='192.168.97.13', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='192.168.97.11', port=6379], role=SLAVE]]
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster(MasterSlaveConnectionProvider.java:309)
	at io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getConnectionAsync(MasterSlaveConnectionProvider.java:158)
	at io.lettuce.core.masterslave.MasterSlaveChannelWriter.write(MasterSlaveChannelWriter.java:66)
	at io.lettuce.core.RedisChannelHandler.dispatch(RedisChannelHandler.java:187)
	at io.lettuce.core.StatefulRedisConnectionImpl.dispatch(StatefulRedisConnectionImpl.java:169)
	at io.lettuce.core.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:472)
	at io.lettuce.core.AbstractRedisAsyncCommands.set(AbstractRedisAsyncCommands.java:1223)
	at jdk.internal.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:57)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy47.set(Unknown Source)
	at fi.hibox.centre.module.job.rediscachepopulator.managers.RedisCrudManager.lambda$upsertEntity$0(RedisCrudManager.java:55)

The proper Redis master node is on 192.168.97.12 in this case, but Lettuce fails to ever realize it. Issuing a SENTINEL RESET * command manually seemed to help it, but... I'd much rather see it self-heal of course.

@mp911de Is my assumption that the fact that the 192.168.97.12 node was unavailable on startup, and Lettuce hence removed it from its list of "potential master nodes" anyway near correct, or is this a false assumption on my behalf? The Sentinel nodes was also down when the connection was made, so... could it be that it never connected to the Sentinel nodes in this case? (only to the Redis nodes). This would be one plausible theory about why it never would receive the pub/sub toplogy update in this case. Note, only guessing, I haven't looked at the Lettuce internals in this case. OTOH, issuing the Sentinel reset manually did indeed make it work, so I guess it must have been connected to Sentinel at that point at least...

Could it be like this: Sentinel was down when it first tried to connect => Lettuce never got the initial pub/sub topology update/updates. Once Lettuce had managed to connect it, all was fine in terms of subsequent topology updates, but the actual update/updates when the 192.168.97.12 was made the master was gone => it never managed to recover.

Distributed systems are indeed hard... 😄

perlun · 2020-12-28T09:49:03Z

Could it be like this: Sentinel was down when it first tried to connect => Lettuce never got the initial pub/sub topology update/updates. Once Lettuce had managed to connect it, all was fine in terms of subsequent topology updates, but the actual update/updates when the 192.168.97.12 was made the master was gone => it never managed to recover.

Another theory: it could actually have been that 192.168.97.12 was actually already the master. When it was firewalled => no other master could be elected. Once it came back again, the Redis slaves would reconnect to it nicely. But no topology update was posted, since the topology didn't actually change. (it was more a matter of "master came back", it was just the fact that Lettuce had never managed to connect to this node.)

tushartvg · 2021-04-15T05:18:44Z

We are also facing the same issue with the master-slave sentinel setup. Using lettuce 5.2.2.RELEASE.
Is there any update on this?

Steven520xiaowei · 2021-06-17T08:51:30Z

We are also facing the same issue using lettuce 5.1.8.RELEASE.

parmar049 · 2022-10-03T08:32:18Z

I am also facing same issue. Currently using 6.1.9 version. We recently introduced Master Slave concept, before we were just connecting to master node.
We are using Spring class org.springframework.data.redis.connection.RedisStaticMasterReplicaConfiguration for configuration.
Everything seems to be fine , but we noticed that in AWS ECS 2 out of 16 instances started throwing error

exception is io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterReplicaNode [redisURI=rediss://replica.prod-redis-cluster.XXXXXX.euw1.cache.amazonaws.com:6379?timeout=200000000ns, role=REPLICA]]

Also some logs here:
Caused by: io.lettuce.core.RedisException: Cannot determine topology from [rediss://master.prod-redis-cluster.XXXXXX.euw1.cache.amazonaws.com:6379?timeout=200000000ns, rediss://replica.prod-redis-cluster.XXXXXX.euw1.cache.amazonaws.com:6379?timeout=200000000ns] at io.lettuce.core.masterreplica.StaticMasterReplicaConnector.lambda$connectAsync$0(StaticMasterReplicaConnector.java:72)

Not able to identify the cause , but for temporary fix have to restart the service and then it went well after that.
But after few days again same issue arrive. I can't restart every time for this issue.

tishun · 2024-07-02T13:56:17Z

Folks, to do any progress on this issue we would need a minimal reproducible example that could help pinpoint the problem.
Until then we can only speculate what the issue is.

UdCa-Codes added the type: bug A general bug label May 22, 2020

mp911de added status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage and removed type: bug A general bug labels May 22, 2020

tishun removed the status: waiting-for-feedback We need additional information before we can continue label Apr 19, 2024

tishun added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage labels Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to read the updated Master when connected Master/Slave through sentinel #1293

Not able to read the updated Master when connected Master/Slave through sentinel #1293

UdCa-Codes commented May 22, 2020

mp911de commented May 22, 2020

UdCa-Codes commented Jun 1, 2020

maiconbaumx commented Dec 16, 2020

siwyd commented Dec 17, 2020

perlun commented Dec 28, 2020

perlun commented Dec 28, 2020

tushartvg commented Apr 15, 2021

Steven520xiaowei commented Jun 17, 2021

parmar049 commented Oct 3, 2022

tishun commented Jul 2, 2024

Not able to read the updated Master when connected Master/Slave through sentinel #1293

Not able to read the updated Master when connected Master/Slave through sentinel #1293

Comments

UdCa-Codes commented May 22, 2020

Bug Report

Current Behavior

Expected behavior/code

Environment

mp911de commented May 22, 2020

UdCa-Codes commented Jun 1, 2020

maiconbaumx commented Dec 16, 2020

siwyd commented Dec 17, 2020

perlun commented Dec 28, 2020

perlun commented Dec 28, 2020

tushartvg commented Apr 15, 2021

Steven520xiaowei commented Jun 17, 2021

parmar049 commented Oct 3, 2022

tishun commented Jul 2, 2024