Add mixnet retry mechanism #386

al8n · 2023-09-11T04:11:18Z

This PR tries to solve #322 (comment). I introduced an ack mechanism to check whether we should retry to send the packet or not.

* proper delays

* Add mixnet service and overwatch app * remove #[tokio::main] --------- Co-authored-by: Youngjoon Lee <taxihighway@gmail.com>

* add a connection pool

* move mixnet listening into separate task * add exponential retry for insufficient peers in libp2p * fix logging

* Fix MutexGuard across await Holding a MutexGuard across an await point is not a good idea. Removing that solves the issues we had with the mixnet test * Make mixnode handle bodies coming from the same source concurrently (#372) --------- Co-authored-by: Youngjoon Lee <taxihighway@gmail.com>

We now wait after the call to 'subscribe' to give the network the time to register peers in the mesh before starting to publish messages

* optimize bytes serde and duration serde

youngjoon-lee · 2023-09-13T07:48:43Z

mixnet/client/src/sender.rs

+        let arc_socket = mu.clone();
        let mut socket = mu.lock().await;
-        let body = Body::new_sphinx(packet);
-        body.write(&mut *socket).await?;
-
-        tracing::debug!("Sent a Sphinx packet successuflly to the node: {addr:?}");
+        let body = Body::SphinxPacket(packet);

+        if let Err(e) = body.write(&mut *socket).await {
+            tokio::spawn(async move {
+                mixnet_protocol::retry_backoff(addr, max_retries, retry_delay, body, arc_socket)
+                    .await;
+            });
+            return Err(e);
+        }


I think we can simplify this code like a9be870, because

we don't need to spawn a task here.

this function shouldn't return Err if retry succeeded.

this function should return Err if all retries were failed.

I think this rule should be applied to the other similar function in mixnet/node/src/lib.rs. Also, the behaviour of these functions are not consistent. The function in mixnet/node/src/lib.rs returns Ok even if the first try was failed.

From my understanding, if we remove the spawn, we will block and retry the single message until it is sent successfully or reaches the max_retries. Please correct me if I am wrong.

youngjoon-lee · 2023-09-13T07:57:24Z

mixnet/protocol/src/lib.rs

+                            // update the connection
+                            if let Ok(tcp) = TcpStream::connect(peer_addr).await {
+                                *socket = tcp;
+                            }


I keep having a feeling that we need to adopt the message passing model instead of the locking model. We could do it at next PRs (maybe after freezing the first version of testnet).

In the current model, this retry_backoff function can be executed by multiple async tasks that share the same TcpStream. Then, many TcpStream::connect(..) can be called concurrently. That might be not a big issue, but it would sometimes cause too many fds errors or congestions. Even if not, only one of connections reconnected will survive in the pool eventually.

In the message passing model, we can have a single worker per TcpStream, which exposes a MPSC to users and manages the socket writes and the lifecycle of TcpStream. As I remember, Nym also uses this model (but I'm not sure). Then, it would be much easier to implement the retry/reconnect mechanism.

I agree. Having a single-worker thread per TcpStream to handle messages and communicate with channels is cleaner and more elegant.

Agree with @youngjoon-lee and it's actually what is done pretty much everywhere else in the node

youngjoon-lee · 2023-09-14T07:46:43Z

mixnet/client/src/sender.rs

+            tracing::error!("Failed to send packet to {addr} with error: {e}. Retrying...");
+            return mixnet_protocol::retry_backoff(addr, max_retries, retry_delay, body, arc_socket)
+            .await;


Do we probably have to drop socket first before calling retry_backoff? as you did at the other part.

Ah yes, thank you!

youngjoon-lee

It seems it works. It would be great if we have a test to see if the nomos network works even if several mixnodes are down.

zeegomo · 2023-09-18T09:31:20Z

mixnet/node/src/config.rs

        }
    }
 }

 impl MixnetNodeConfig {
+    const fn default_connection_pool_size() -> usize {


could they be consts?

serde(default = "...") does not allow to use consts, have to use a fn.

zeegomo · 2023-09-18T09:38:08Z

mixnet/protocol/src/lib.rs

+                        | ErrorKind::Other => {}
+                        _ => {
+                            // update the connection
+                            if let Ok(tcp) = TcpStream::connect(peer_addr).await {


what's the rational behind updating the connection upon receiving these errors? For example, Unsupported means the action can never succeed, why retrying doing the same thing as before?

I am a little bit confused here. It seems that the code will not update the conncection on those errors, do you mean we should early return when those errors happen?

ErrorKind::Unsupported | ErrorKind::NotFound | ErrorKind::PermissionDenied | ErrorKind::Other => {}

The fact that ErrorKind::Unsupported is specifically mentioned suggests we have some specific appropriate handling for it and due to the nature of the error it would mean to stop retying.
It's acceptable to retry again if we don't want to differentiate errors, but if we differentiate then it only makes sense to do the right thing.

Now we return early for ErrorKind::Other, why is that?

Ah, I misunderstood. so we only need to early return for Unsupported here.

My point was that if we explicitly mention an ErrorKind then we need to appropriately handle it. Depending on the specific case, it could be that Other and Unsupported might be treated differently (like this one).

Make sense. 👍

zeegomo · 2023-09-18T09:38:29Z

mixnet/protocol/src/lib.rs

+) -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
+    for idx in 0..max_retries {
+        // backoff
+        let wait = retry_delay * (idx as u32);


we might want an exponential backoff

zeegomo

nit: "fix PR comment" is not a very helpful commit message

youngjoon-lee and others added 30 commits August 21, 2023 22:52

Add mixnode and mixnet-client crate (#302)

d57efc3

Add mixnode binary (#317)

0a38fd3

Integrate mixnet with libp2p network backend (#318)

b692043

Fix #312: proper delays (#321)

f578217

* proper delays

add missing duration param

5a5aa1f

Merge branch 'master' into mixnet

4621c0a

Merge branch 'master' into mixnet

9798ff0

tiny fix: compilation error caused by rand 0.8 -> 0.7

030baa7

Merge branch 'master' into mixnet

e725af0

use get_available_port() for mixnet integration tests (#333)

292f393

Merge branch 'master' into mixnet

7df010e

add missing comments

723ebfe

Overwatch mixnet node (#339)

33a4823

* Add mixnet service and overwatch app * remove #[tokio::main] --------- Co-authored-by: Youngjoon Lee <taxihighway@gmail.com>

fix tests for the overwatch mixnode (#342)

e8f2c38

fix panic when corner case happen in RandomDelayIter (#335)

82421ce

Use log service for mixnode bin (#341)

361032e

Merge branch 'master' into mixnet

190a955

Use wire for MixnetMessage in libp2p (#347)

d00ca2a

Prevent tmixnet tests from running forever (#363)

25c8ae0

Use random delay when sending msgs to mixnet (#362)

b11d2cb

Merge branch 'master' into mixnet

3468d7c

fix a minor compilation error caused by the latest master

73360d7

Fix run output fd (#343)

b122867

* add a connection pool

Merge branch 'master' into mixnet

798a847

Exp backoff (#332)

2a0a6c9

* move mixnet listening into separate task * add exponential retry for insufficient peers in libp2p * fix logging

Move wait at network startup (#338)

e50ef09

We now wait after the call to 'subscribe' to give the network the time to register peers in the mesh before starting to publish messages

Remove unused functions from mixnet connpool (#374)

e1ab1a7

Mixnet benchmark (#375)

0950782

Merge branch 'master' into mixnet

2db4862

youngjoon-lee and others added 6 commits September 12, 2023 20:16

Simplify mixnet topology (#393)

f7b33e4

Simplify bytes and duration range ser/de (#394)

d020eed

* optimize bytes serde and duration serde

Merge branch 'master' into mixnet

4d0814c

Merge branch 'mixnet' into mixnet-retry

265adb9

fix PR comment

622614f

fix PR comments

a37e55c

al8n requested review from youngjoon-lee and zeegomo September 13, 2023 06:15

youngjoon-lee reviewed Sep 13, 2023

View reviewed changes

al8n mentioned this pull request Sep 13, 2023

Mixnet: messages handle model #398

Closed

Fix PR comment

847df4f

al8n requested a review from youngjoon-lee September 14, 2023 06:05

youngjoon-lee reviewed Sep 14, 2023

View reviewed changes

fix PR comment

ade9d45

al8n requested a review from youngjoon-lee September 14, 2023 08:07

al8n changed the base branch from mixnet to master September 14, 2023 08:53

al8n and others added 2 commits September 14, 2023 16:56

merge

4165320

fix bench

814ab99

youngjoon-lee approved these changes Sep 15, 2023

View reviewed changes

al8n added 2 commits September 15, 2023 16:39

Merge branch 'master' into mixnet-retry

0d48bda

Merge branch 'master' into mixnet-retry

0d0ff4d

zeegomo requested changes Sep 18, 2023

View reviewed changes

Fix PR comments

dcd9a33

al8n requested a review from zeegomo September 18, 2023 09:57

al8n added 2 commits September 18, 2023 18:02

fix PR comment

d93323e

fix PR comment

9b198dc

zeegomo approved these changes Sep 18, 2023

View reviewed changes

Merge branch 'master' into mixnet-retry

c4158f6

al8n merged commit 2429893 into master Sep 18, 2023
7 of 11 checks passed

al8n deleted the mixnet-retry branch September 18, 2023 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mixnet retry mechanism #386

Add mixnet retry mechanism #386

al8n commented Sep 11, 2023

youngjoon-lee Sep 13, 2023 •

edited

al8n Sep 13, 2023

youngjoon-lee Sep 13, 2023 •

edited

al8n Sep 13, 2023

zeegomo Sep 13, 2023

youngjoon-lee Sep 14, 2023

al8n Sep 14, 2023

youngjoon-lee left a comment

zeegomo Sep 18, 2023

al8n Sep 18, 2023

zeegomo Sep 18, 2023

al8n Sep 18, 2023

zeegomo Sep 18, 2023

zeegomo Sep 18, 2023

al8n Sep 18, 2023

zeegomo Sep 18, 2023

al8n Sep 18, 2023

zeegomo Sep 18, 2023

zeegomo left a comment

Add mixnet retry mechanism #386

Add mixnet retry mechanism #386

Conversation

al8n commented Sep 11, 2023

youngjoon-lee Sep 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youngjoon-lee Sep 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youngjoon-lee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeegomo left a comment

Choose a reason for hiding this comment

youngjoon-lee Sep 13, 2023 •

edited

youngjoon-lee Sep 13, 2023 •

edited