Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mixnet retry mechanism #386

Merged
merged 58 commits into from
Sep 18, 2023
Merged

Add mixnet retry mechanism #386

merged 58 commits into from
Sep 18, 2023

Conversation

al8n
Copy link
Contributor

@al8n al8n commented Sep 11, 2023

This PR tries to solve #322 (comment). I introduced an ack mechanism to check whether we should retry to send the packet or not.

youngjoon-lee and others added 30 commits August 21, 2023 22:52
* Add mixnet service and overwatch app

* remove #[tokio::main]

---------

Co-authored-by: Youngjoon Lee <taxihighway@gmail.com>
* add a connection pool
* move mixnet listening into separate task

* add exponential retry for insufficient peers in libp2p

* fix logging
* Fix MutexGuard across await

Holding a MutexGuard across an await point is not a good idea.
Removing that solves the issues we had with the mixnet test

* Make mixnode handle bodies coming from the same source concurrently (#372)

---------

Co-authored-by: Youngjoon Lee <taxihighway@gmail.com>
We now wait after the call to 'subscribe' to give the network
the time to register peers in the mesh before starting to
publish messages
Comment on lines 132 to 142
let arc_socket = mu.clone();
let mut socket = mu.lock().await;
let body = Body::new_sphinx(packet);
body.write(&mut *socket).await?;

tracing::debug!("Sent a Sphinx packet successuflly to the node: {addr:?}");
let body = Body::SphinxPacket(packet);

if let Err(e) = body.write(&mut *socket).await {
tokio::spawn(async move {
mixnet_protocol::retry_backoff(addr, max_retries, retry_delay, body, arc_socket)
.await;
});
return Err(e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this code like a9be870, because

  • we don't need to spawn a task here.
    • this function shouldn't return Err if retry succeeded.
    • this function should return Err if all retries were failed.

I think this rule should be applied to the other similar function in mixnet/node/src/lib.rs. Also, the behaviour of these functions are not consistent. The function in mixnet/node/src/lib.rs returns Ok even if the first try was failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, if we remove the spawn, we will block and retry the single message until it is sent successfully or reaches the max_retries. Please correct me if I am wrong.

Comment on lines +131 to +134
// update the connection
if let Ok(tcp) = TcpStream::connect(peer_addr).await {
*socket = tcp;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep having a feeling that we need to adopt the message passing model instead of the locking model. We could do it at next PRs (maybe after freezing the first version of testnet).

In the current model, this retry_backoff function can be executed by multiple async tasks that share the same TcpStream. Then, many TcpStream::connect(..) can be called concurrently. That might be not a big issue, but it would sometimes cause too many fds errors or congestions. Even if not, only one of connections reconnected will survive in the pool eventually.

In the message passing model, we can have a single worker per TcpStream, which exposes a MPSC to users and manages the socket writes and the lifecycle of TcpStream. As I remember, Nym also uses this model (but I'm not sure). Then, it would be much easier to implement the retry/reconnect mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Having a single-worker thread per TcpStream to handle messages and communicate with channels is cleaner and more elegant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @youngjoon-lee and it's actually what is done pretty much everywhere else in the node

Comment on lines 137 to 139
tracing::error!("Failed to send packet to {addr} with error: {e}. Retrying...");
return mixnet_protocol::retry_backoff(addr, max_retries, retry_delay, body, arc_socket)
.await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we probably have to drop socket first before calling retry_backoff? as you did at the other part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thank you!

@al8n al8n changed the base branch from mixnet to master September 14, 2023 08:53
Copy link
Contributor

@youngjoon-lee youngjoon-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it works. It would be great if we have a test to see if the nomos network works even if several mixnodes are down.

}
}
}

impl MixnetNodeConfig {
const fn default_connection_pool_size() -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could they be consts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde(default = "...") does not allow to use consts, have to use a fn.

| ErrorKind::Other => {}
_ => {
// update the connection
if let Ok(tcp) = TcpStream::connect(peer_addr).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the rational behind updating the connection upon receiving these errors? For example, Unsupported means the action can never succeed, why retrying doing the same thing as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit confused here. It seems that the code will not update the conncection on those errors, do you mean we should early return when those errors happen?

 ErrorKind::Unsupported
                        | ErrorKind::NotFound
                        | ErrorKind::PermissionDenied
                        | ErrorKind::Other => {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that ErrorKind::Unsupported is specifically mentioned suggests we have some specific appropriate handling for it and due to the nature of the error it would mean to stop retying.
It's acceptable to retry again if we don't want to differentiate errors, but if we differentiate then it only makes sense to do the right thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we return early for ErrorKind::Other, why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood. so we only need to early return for Unsupported here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that if we explicitly mention an ErrorKind then we need to appropriately handle it. Depending on the specific case, it could be that Other and Unsupported might be treated differently (like this one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. 👍

) -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
for idx in 0..max_retries {
// backoff
let wait = retry_delay * (idx as u32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want an exponential backoff

@al8n al8n requested a review from zeegomo September 18, 2023 09:57
Copy link
Contributor

@zeegomo zeegomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "fix PR comment" is not a very helpful commit message

@al8n al8n merged commit 2429893 into master Sep 18, 2023
7 of 11 checks passed
@al8n al8n deleted the mixnet-retry branch September 18, 2023 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants