Set node to unknown state if timeout #159

Matansegal · 2023-04-11T00:12:44Z

Regard #158
This temp solution will set the timed out node to unknown.
However, I don't think there is a mechanism to set it back to UP when it is ready. If this is the case, there could be a scenario where all nodes set to unknown and ignored.

krojew · 2023-04-11T07:38:02Z

cdrs-tokio/src/cluster/send_envelope.rs

@@ -35,7 +35,11 @@ pub async fn send_envelope<T: CdrsTransport + 'static, CM: ConnectionManager<T>
                        }
                    }
                },
-                Err(error) => return Some(Err(error)),
+                Err(error) => {
+                    eprintln!("Trasport error: {:?}; for node {:?}", error, node);


Can you change it to warn!() or error!()?

krojew · 2023-04-11T07:50:12Z

cdrs-tokio/src/cluster/send_envelope.rs

+                Err(error) => {
+                    eprintln!("Trasport error: {:?}; for node {:?}", error, node);
+                    node.set_state_to_unknown();
+                    continue 'next_node;


Jumping to a new node will result in the error being lost.

But otherwise it will still think this node is good and UP.
We should find a way to temporarily set this node to down or unknown and try after certain time or number of requests in order not to hit a dead node again and again.

That's the core issue - if we run into a scenario when the node is in fact up, but unreachable by the clients, we'll never know when it becomes available, since there will never be any cluster event about it. In such case we should retry the connection at some point. This, unfortunately, means there are conflicting requirements. I'll find out how the java driver handles such situation.

The cpp driver handle it by setting the node temporary down.
It can be down in two ways:

setting the node to down when it timeout and every X amount of time, resetting all.

Setting the node to down and after X amount of time checking if it still down in different thread.

krojew · 2023-04-11T08:41:22Z

I think simply setting the state to unknown is not enough. If we cannot establish a connection that means either:

the node is actually down, which will result in a topology event or a full topology refresh
the node is up, but there's some intermittent problem

The second case is problematic, since there will be nothing to change the state to up. This will result in the node being forever ignored. I think a better approach would be to keep the up state and try the next node, if available, or return the error otherwise. At some point the node will be marked as down, or will be skipped by jumping to the next one (this will increase latency, so a loud warning log should be present) until, or the intermittent problem will clear up and the node will become reachable.

Matansegal · 2023-04-11T14:18:31Z

We tried hundreds of requests with one of the connections being dead. The connection manger always thought this node is up and on average it hit the dead node equally as the other nodes, which caused losing all the responses from this node.

When we mark it as down, I didn't see any mechanism to check if it can be up again.

krojew · 2023-04-11T19:48:24Z

Closing due to internal changes to reconnection mechanism.

Matansegal added 2 commits April 10, 2023 20:09

set nod eto unknown state if timeout

a3249dd

cargo fmt

fae2f4e

krojew requested changes Apr 11, 2023

View reviewed changes

Matansegal mentioned this pull request Apr 11, 2023

Case where a machine is up but unreachable #158

Closed

krojew closed this Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set node to unknown state if timeout #159

Set node to unknown state if timeout #159

Matansegal commented Apr 11, 2023

krojew Apr 11, 2023

krojew Apr 11, 2023

Matansegal Apr 11, 2023

krojew Apr 11, 2023

Matansegal Apr 11, 2023

krojew commented Apr 11, 2023

Matansegal commented Apr 11, 2023

krojew commented Apr 11, 2023

Set node to unknown state if timeout #159

Set node to unknown state if timeout #159

Conversation

Matansegal commented Apr 11, 2023

krojew Apr 11, 2023

Choose a reason for hiding this comment

krojew Apr 11, 2023

Choose a reason for hiding this comment

Matansegal Apr 11, 2023

Choose a reason for hiding this comment

krojew Apr 11, 2023

Choose a reason for hiding this comment

Matansegal Apr 11, 2023

Choose a reason for hiding this comment

krojew commented Apr 11, 2023

Matansegal commented Apr 11, 2023

krojew commented Apr 11, 2023