Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set node to unknown state if timeout #159

Closed
wants to merge 2 commits into from

Conversation

Matansegal
Copy link
Contributor

Regard #158
This temp solution will set the timed out node to unknown.
However, I don't think there is a mechanism to set it back to UP when it is ready. If this is the case, there could be a scenario where all nodes set to unknown and ignored.

@@ -35,7 +35,11 @@ pub async fn send_envelope<T: CdrsTransport + 'static, CM: ConnectionManager<T>
}
}
},
Err(error) => return Some(Err(error)),
Err(error) => {
eprintln!("Trasport error: {:?}; for node {:?}", error, node);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change it to warn!() or error!()?

Err(error) => {
eprintln!("Trasport error: {:?}; for node {:?}", error, node);
node.set_state_to_unknown();
continue 'next_node;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jumping to a new node will result in the error being lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But otherwise it will still think this node is good and UP.
We should find a way to temporarily set this node to down or unknown and try after certain time or number of requests in order not to hit a dead node again and again.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the core issue - if we run into a scenario when the node is in fact up, but unreachable by the clients, we'll never know when it becomes available, since there will never be any cluster event about it. In such case we should retry the connection at some point. This, unfortunately, means there are conflicting requirements. I'll find out how the java driver handles such situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpp driver handle it by setting the node temporary down.
It can be down in two ways:

  1. setting the node to down when it timeout and every X amount of time, resetting all.
  2. Setting the node to down and after X amount of time checking if it still down in different thread.

@krojew
Copy link
Owner

krojew commented Apr 11, 2023

I think simply setting the state to unknown is not enough. If we cannot establish a connection that means either:

  • the node is actually down, which will result in a topology event or a full topology refresh
  • the node is up, but there's some intermittent problem

The second case is problematic, since there will be nothing to change the state to up. This will result in the node being forever ignored. I think a better approach would be to keep the up state and try the next node, if available, or return the error otherwise. At some point the node will be marked as down, or will be skipped by jumping to the next one (this will increase latency, so a loud warning log should be present) until, or the intermittent problem will clear up and the node will become reachable.

@Matansegal
Copy link
Contributor Author

We tried hundreds of requests with one of the connections being dead. The connection manger always thought this node is up and on average it hit the dead node equally as the other nodes, which caused losing all the responses from this node.

When we mark it as down, I didn't see any mechanism to check if it can be up again.

@krojew krojew closed this Apr 11, 2023
@krojew
Copy link
Owner

krojew commented Apr 11, 2023

Closing due to internal changes to reconnection mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants