Document MasterSlave connection behavior on partial node failures #894

jocull · 2018-10-18T21:13:01Z

Bug Report

Current Behavior

Connecting to master/slave setups using autodiscovery can half-fail and leave the list of nodes empty, with no chance to recover. The problem is that creating the StatefulRedisMasterSlaveConnection will succeed, but every attempt to use it will fail. The error produced is at io.lettuce.core.masterslave.MasterSlaveConnectionProvider#getMaster.

I believe the happens because of the initial topology refresh can fail. The first connection to the master may be fine -- this causes us to not get an exception. But the refresh seems to happen out-of-band and appears to ping each discovered Redis node at io.lettuce.core.masterslave.MasterSlaveTopologyRefresh#getNodes and more specifically io.lettuce.core.masterslave.Connections#requestPing.

To reproduce this you can set a breakpoint (or a pause of your choice) at the call to requestPing and send a DEBUG SLEEP [some interval] to Redis - preferably the master node. After this you can see the knownNodes list refresh, and if you're in a single master environment it will set itself to an empty list. If you're in a master/slave environment, any node that fails to ping at this exact point in time will be removed from the node list and not contacted again. I think this is probably by design, but it unfortunately produces at best unexpected results and at worst connections that can't ever work.

From here I don't think it can recover. The instance of the StatefulRedisMasterSlaveConnection stays alive, but it will throw every time you try to use it because there are no nodes in the list and the topology will never refresh. This makes it hard to detect.

Input Code

Java; this is just a basic connection - you must interrupt it as described above and introduce some manual chaos.

RedisURI redisUri = buildRedisUri(host, port, useSsl, password);
ClientResources redisClientResources = DefaultClientResources.builder()
        .dnsResolver(new DirContextDnsResolver(true, false, new Properties()))
        .build();

RedisClient redisClient = RedisClient.create(redisClientResources, redisUri);

StatefulRedisMasterSlaveConnection<String, String> conn = MasterSlave.connect(redisClient, new Utf8StringCodec(), redisUri);
conn.setReadFrom(ReadFrom.MASTER_PREFERRED);

// !!! Throws at `io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster`
//               `throw new RedisException(String.format("Master is currently unknown: %s", knownNodes));`
String result = conn.sync().get("MyKey");

Expected behavior/code

I expected that if the connections were unreachable they would either:

Fail to connect and throw an exception (we can catch this on our end and retry) or...
Attempt to heal itself in the future, via a mechanism like periodic refresh (such as clustered mode does)

Environment

Lettuce version(s): 5.1.1.RELEASE
Redis version: 4.0.10

Possible Solution

I would suggest that the list of nodes initially discovered be cached somewhere, and retries be attempted later. Clustered clients do this with periodic and adaptive refreshes. I think this should be possible using the initial discovery in non-clustered setups to produce more reliable long-term results and a chance to recover from this particular failure state.

Another solution would be to simply throw an error if no nodes are discovered so we may catch it before the StatefulRedisMasterSlaveConnection instance is kept (just like a normal connection failure would produce). It may be worthwhile to expose the present list of nodes from the connection so they may be inspected for issues (e.g. if we only get slave nodes that may not be suitable for all cases)

Additional context

I hope this helps! Please let me know if I have left out any important details or if there is a workaround or expected behavior I'm clearly missing. I may be able to take suggestions and make them into a PR.

The text was updated successfully, but these errors were encountered:

mp911de · 2018-10-19T09:52:33Z

Thanks for this extensive bug report. Static configuration (without Sentinel) is meant to remain static so we won't provide any means of an automated refresh. For Redis Cluster and Sentinel, we have at least a notion of topology source that we can query regularly. Sentinel and Cluster give us some sort of reliability where to find a topology source.

For Master/Replica, that's different as changes to the static setup are not tied to any rules.

Regarding your suggestions:

Fail to connect and throw an exception (we can catch this on our end and retry) or...

Attempt to heal itself in the future, via a mechanism like periodic refresh (such as clustered mode does)

The assumption here is that StatefulRedisMasterSlaveConnection requires a master node to be functional. What MasterSlave currently guarantees is that it's able to discover at least one working node. Changing this guarantee would require to have a running master node. This can be a good thing or cannot be a good thing. It pretty much depends on what you're trying to do.

We face quite often the requirement to allow connections to a partially-down Redis setup. Changing our behavior would raise the bar in terms of Redis availability.

I would suggest that the list of nodes initially discovered be cached somewhere, and retries be attempted later.

This approach could work for a few users, but not for all. Some users might not want this behavior.

jocull · 2018-10-19T13:45:40Z

A couple extra details - in case they help 🙂

We're using Elasticache, which only provides one endpoint URL for the entire replication set regardless of how many masters or replicas.
We would need a way to fail if a master node wasn't fetched. We really need a master node, so a sole replica won't work (it would be better to fail until the replica was promoted). But I'm not sure how to do this since the list of discovered nodes is private as far as I can tell.

What MasterSlave currently guarantees is that it's able to discover at least one working node.

Part of the issue I think we're having is that this guarantee is failing. If timeouts happen as the exact right time this is not the case. Just wanted to make sure that was clear 😄

mp911de · 2018-10-19T17:46:37Z

Thanks for the update. ElastiCache has indeed no publicly discoverable endpoints. INFO REPLICATION returns the internal addresses which are typically not reachable. There is also no reconfiguration signal that we could consume from Redis or a topology source.

IIRC there is a possibility to listen to changes via AWS SNS but I'm not so much involved into AWS.

jocull · 2018-10-19T18:13:44Z

For now, I'm going to test the connection state before marking it as active. I'm not sure if this is exactly the right strategy, but essentially it's this:

Make the StatefulRedisMasterSlaveConnection instance
Set the connection to ReadFrom.MASTER
Call ECHO to ensure that we can reach a master node
If that works we go back to ReadFrom.MASTER_PREFERRED and mark the connection instance as valid

The second thing I am attempting is creating another StatefulRedisMasterSlaveConnection instance at a regular interval and swapping them out (paying close attention to thread safety and all that) so that we get a topology refresh from time to time like the clustered connections do.

Again, I'm not sure if that's the right way to approach this, but that is the best way I can see to do it on our end 😄

mp911de · 2018-11-07T10:28:13Z

The above mentioned workflow (testing the connection) seems about right.

I tried to reproduce a half-failed state. When the specified endpoint does not react (DEBUG SLEEP), then the connection creation fails. When the master node does not respond in requestPing(…), then requests to the master node fail with io.lettuce.core.RedisException: Master is currently unknown:.

Both work as they should. MasterSlave Javadoc now explains the actual behavior.

mp911de · 2018-11-07T11:32:52Z

Closing this ticket as there's nothing left to do here.

jocull · 2018-11-07T14:07:56Z

Just to be clear, did you start the DEBUG SLEEP before attempting the first call to the connection? To trigger the error properly you must stall the connections in the middle of the connection creation process. That is to say:

Set a breakpoint, thread sleep, or something right before requestPing(...)
When the pause is hit, run DEBUG SLEEP on Redis so the nodes will fail to respond during the ping cycle only
Continue execution - allow the requestPing to fail, leaving the persisted list of nodes completely empty.
Attempt to use the master/slave connection instance -- it should always fail from this point forward

I really just want to make sure we're testing this the same way. I can possibly stitch a sample case together if it is too difficult to reproduce by hand.

mp911de · 2018-11-07T14:13:53Z

Yes, I tried it with various scenarios (timeout 1 second):

DEBUG SLEEP 100 before connect
Breakpoint in requestPing, then DEBUG SLEEP 100, then continue (both, trying to connect the master and the slave node)

mp911de added the status: waiting-for-triage label Oct 19, 2018

mp911de added type: bug A general bug and removed status: waiting-for-triage labels Oct 19, 2018

mp911de added a commit that referenced this issue Nov 7, 2018

Document MasterSlave connection failure behavior #894

acb03f4

mp911de added a commit that referenced this issue Nov 7, 2018

Document MasterSlave connection failure behavior #894

011e17c

mp911de added type: documentation A documentation update and removed type: bug A general bug labels Nov 7, 2018

mp911de closed this as completed Nov 7, 2018

mp911de added this to the 5.1.3 milestone Nov 7, 2018

mp911de changed the title ~~MasterSlaveTopologyRefresh can fail to discover any valid nodes and be left in a state that can't ever connect~~ Document MasterSlave connection behavior on partial node failures Nov 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document MasterSlave connection behavior on partial node failures #894

Document MasterSlave connection behavior on partial node failures #894

jocull commented Oct 18, 2018

mp911de commented Oct 19, 2018

jocull commented Oct 19, 2018 •

edited

Loading

mp911de commented Oct 19, 2018

jocull commented Oct 19, 2018

mp911de commented Nov 7, 2018

mp911de commented Nov 7, 2018

jocull commented Nov 7, 2018

mp911de commented Nov 7, 2018

Document MasterSlave connection behavior on partial node failures #894

Document MasterSlave connection behavior on partial node failures #894

Comments

jocull commented Oct 18, 2018

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Possible Solution

Additional context

mp911de commented Oct 19, 2018

jocull commented Oct 19, 2018 • edited Loading

mp911de commented Oct 19, 2018

jocull commented Oct 19, 2018

mp911de commented Nov 7, 2018

mp911de commented Nov 7, 2018

jocull commented Nov 7, 2018

mp911de commented Nov 7, 2018

jocull commented Oct 19, 2018 •

edited

Loading