Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document MasterSlave connection behavior on partial node failures #894

Closed
jocull opened this issue Oct 18, 2018 · 8 comments
Closed

Document MasterSlave connection behavior on partial node failures #894

jocull opened this issue Oct 18, 2018 · 8 comments
Labels
type: documentation A documentation update
Milestone

Comments

@jocull
Copy link

jocull commented Oct 18, 2018

Bug Report

Current Behavior

Connecting to master/slave setups using autodiscovery can half-fail and leave the list of nodes empty, with no chance to recover. The problem is that creating the StatefulRedisMasterSlaveConnection will succeed, but every attempt to use it will fail. The error produced is at io.lettuce.core.masterslave.MasterSlaveConnectionProvider#getMaster.

I believe the happens because of the initial topology refresh can fail. The first connection to the master may be fine -- this causes us to not get an exception. But the refresh seems to happen out-of-band and appears to ping each discovered Redis node at io.lettuce.core.masterslave.MasterSlaveTopologyRefresh#getNodes and more specifically io.lettuce.core.masterslave.Connections#requestPing.

To reproduce this you can set a breakpoint (or a pause of your choice) at the call to requestPing and send a DEBUG SLEEP [some interval] to Redis - preferably the master node. After this you can see the knownNodes list refresh, and if you're in a single master environment it will set itself to an empty list. If you're in a master/slave environment, any node that fails to ping at this exact point in time will be removed from the node list and not contacted again. I think this is probably by design, but it unfortunately produces at best unexpected results and at worst connections that can't ever work.

From here I don't think it can recover. The instance of the StatefulRedisMasterSlaveConnection stays alive, but it will throw every time you try to use it because there are no nodes in the list and the topology will never refresh. This makes it hard to detect.

Input Code

  • Java; this is just a basic connection - you must interrupt it as described above and introduce some manual chaos.
RedisURI redisUri = buildRedisUri(host, port, useSsl, password);
ClientResources redisClientResources = DefaultClientResources.builder()
        .dnsResolver(new DirContextDnsResolver(true, false, new Properties()))
        .build();

RedisClient redisClient = RedisClient.create(redisClientResources, redisUri);

StatefulRedisMasterSlaveConnection<String, String> conn = MasterSlave.connect(redisClient, new Utf8StringCodec(), redisUri);
conn.setReadFrom(ReadFrom.MASTER_PREFERRED);

// !!! Throws at `io.lettuce.core.masterslave.MasterSlaveConnectionProvider.getMaster`
//               `throw new RedisException(String.format("Master is currently unknown: %s", knownNodes));`
String result = conn.sync().get("MyKey");

Expected behavior/code

I expected that if the connections were unreachable they would either:

  1. Fail to connect and throw an exception (we can catch this on our end and retry) or...
  2. Attempt to heal itself in the future, via a mechanism like periodic refresh (such as clustered mode does)

Environment

  • Lettuce version(s): 5.1.1.RELEASE
  • Redis version: 4.0.10

Possible Solution

I would suggest that the list of nodes initially discovered be cached somewhere, and retries be attempted later. Clustered clients do this with periodic and adaptive refreshes. I think this should be possible using the initial discovery in non-clustered setups to produce more reliable long-term results and a chance to recover from this particular failure state.

Another solution would be to simply throw an error if no nodes are discovered so we may catch it before the StatefulRedisMasterSlaveConnection instance is kept (just like a normal connection failure would produce). It may be worthwhile to expose the present list of nodes from the connection so they may be inspected for issues (e.g. if we only get slave nodes that may not be suitable for all cases)

Additional context

I hope this helps! Please let me know if I have left out any important details or if there is a workaround or expected behavior I'm clearly missing. I may be able to take suggestions and make them into a PR.

@mp911de
Copy link
Collaborator

mp911de commented Oct 19, 2018

Thanks for this extensive bug report. Static configuration (without Sentinel) is meant to remain static so we won't provide any means of an automated refresh. For Redis Cluster and Sentinel, we have at least a notion of topology source that we can query regularly. Sentinel and Cluster give us some sort of reliability where to find a topology source.

For Master/Replica, that's different as changes to the static setup are not tied to any rules.

Regarding your suggestions:

  1. Fail to connect and throw an exception (we can catch this on our end and retry) or...
  2. Attempt to heal itself in the future, via a mechanism like periodic refresh (such as clustered mode does)

The assumption here is that StatefulRedisMasterSlaveConnection requires a master node to be functional. What MasterSlave currently guarantees is that it's able to discover at least one working node. Changing this guarantee would require to have a running master node. This can be a good thing or cannot be a good thing. It pretty much depends on what you're trying to do.

We face quite often the requirement to allow connections to a partially-down Redis setup. Changing our behavior would raise the bar in terms of Redis availability.

I would suggest that the list of nodes initially discovered be cached somewhere, and retries be attempted later.

This approach could work for a few users, but not for all. Some users might not want this behavior.

@jocull
Copy link
Author

jocull commented Oct 19, 2018

A couple extra details - in case they help 🙂

  1. We're using Elasticache, which only provides one endpoint URL for the entire replication set regardless of how many masters or replicas.
  2. We would need a way to fail if a master node wasn't fetched. We really need a master node, so a sole replica won't work (it would be better to fail until the replica was promoted). But I'm not sure how to do this since the list of discovered nodes is private as far as I can tell.

What MasterSlave currently guarantees is that it's able to discover at least one working node.

  1. Part of the issue I think we're having is that this guarantee is failing. If timeouts happen as the exact right time this is not the case. Just wanted to make sure that was clear 😄

@mp911de mp911de added type: bug A general bug and removed status: waiting-for-triage labels Oct 19, 2018
@mp911de
Copy link
Collaborator

mp911de commented Oct 19, 2018

Thanks for the update. ElastiCache has indeed no publicly discoverable endpoints. INFO REPLICATION returns the internal addresses which are typically not reachable. There is also no reconfiguration signal that we could consume from Redis or a topology source.

IIRC there is a possibility to listen to changes via AWS SNS but I'm not so much involved into AWS.

@jocull
Copy link
Author

jocull commented Oct 19, 2018

For now, I'm going to test the connection state before marking it as active. I'm not sure if this is exactly the right strategy, but essentially it's this:

  1. Make the StatefulRedisMasterSlaveConnection instance
  2. Set the connection to ReadFrom.MASTER
  3. Call ECHO to ensure that we can reach a master node
  4. If that works we go back to ReadFrom.MASTER_PREFERRED and mark the connection instance as valid

The second thing I am attempting is creating another StatefulRedisMasterSlaveConnection instance at a regular interval and swapping them out (paying close attention to thread safety and all that) so that we get a topology refresh from time to time like the clustered connections do.

Again, I'm not sure if that's the right way to approach this, but that is the best way I can see to do it on our end 😄

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

The above mentioned workflow (testing the connection) seems about right.

I tried to reproduce a half-failed state. When the specified endpoint does not react (DEBUG SLEEP), then the connection creation fails. When the master node does not respond in requestPing(…), then requests to the master node fail with io.lettuce.core.RedisException: Master is currently unknown:.

Both work as they should. MasterSlave Javadoc now explains the actual behavior.

@mp911de mp911de added type: documentation A documentation update and removed type: bug A general bug labels Nov 7, 2018
@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

Closing this ticket as there's nothing left to do here.

@mp911de mp911de closed this as completed Nov 7, 2018
@mp911de mp911de added this to the 5.1.3 milestone Nov 7, 2018
@mp911de mp911de changed the title MasterSlaveTopologyRefresh can fail to discover any valid nodes and be left in a state that can't ever connect Document MasterSlave connection behavior on partial node failures Nov 7, 2018
@jocull
Copy link
Author

jocull commented Nov 7, 2018

Just to be clear, did you start the DEBUG SLEEP before attempting the first call to the connection? To trigger the error properly you must stall the connections in the middle of the connection creation process. That is to say:

  1. Set a breakpoint, thread sleep, or something right before requestPing(...)
  2. When the pause is hit, run DEBUG SLEEP on Redis so the nodes will fail to respond during the ping cycle only
  3. Continue execution - allow the requestPing to fail, leaving the persisted list of nodes completely empty.
  4. Attempt to use the master/slave connection instance -- it should always fail from this point forward

I really just want to make sure we're testing this the same way. I can possibly stitch a sample case together if it is too difficult to reproduce by hand.

@mp911de
Copy link
Collaborator

mp911de commented Nov 7, 2018

Yes, I tried it with various scenarios (timeout 1 second):

  • DEBUG SLEEP 100 before connect
  • Breakpoint in requestPing, then DEBUG SLEEP 100, then continue (both, trying to connect the master and the slave node)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: documentation A documentation update
Projects
None yet
Development

No branches or pull requests

2 participants