-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set node to unknown state if timeout #159
Conversation
@@ -35,7 +35,11 @@ pub async fn send_envelope<T: CdrsTransport + 'static, CM: ConnectionManager<T> | |||
} | |||
} | |||
}, | |||
Err(error) => return Some(Err(error)), | |||
Err(error) => { | |||
eprintln!("Trasport error: {:?}; for node {:?}", error, node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change it to warn!() or error!()?
Err(error) => { | ||
eprintln!("Trasport error: {:?}; for node {:?}", error, node); | ||
node.set_state_to_unknown(); | ||
continue 'next_node; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jumping to a new node will result in the error being lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But otherwise it will still think this node is good and UP
.
We should find a way to temporarily set this node to down or unknown and try after certain time or number of requests in order not to hit a dead node again and again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the core issue - if we run into a scenario when the node is in fact up, but unreachable by the clients, we'll never know when it becomes available, since there will never be any cluster event about it. In such case we should retry the connection at some point. This, unfortunately, means there are conflicting requirements. I'll find out how the java driver handles such situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cpp driver handle it by setting the node temporary down.
It can be down in two ways:
- setting the node to down when it timeout and every X amount of time, resetting all.
- Setting the node to down and after X amount of time checking if it still down in different thread.
I think simply setting the state to unknown is not enough. If we cannot establish a connection that means either:
The second case is problematic, since there will be nothing to change the state to up. This will result in the node being forever ignored. I think a better approach would be to keep the up state and try the next node, if available, or return the error otherwise. At some point the node will be marked as down, or will be skipped by jumping to the next one (this will increase latency, so a loud warning log should be present) until, or the intermittent problem will clear up and the node will become reachable. |
We tried hundreds of requests with one of the connections being dead. The connection manger always thought this node is up and on average it hit the dead node equally as the other nodes, which caused losing all the responses from this node. When we mark it as down, I didn't see any mechanism to check if it can be up again. |
Closing due to internal changes to reconnection mechanism. |
Regard #158
This temp solution will set the timed out node to unknown.
However, I don't think there is a mechanism to set it back to UP when it is ready. If this is the case, there could be a scenario where all nodes set to unknown and ignored.