Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resilience to failed packets on ordered channels #3540

Closed
2 of 5 tasks
Tracked by #3217
ljoss17 opened this issue Aug 15, 2023 · 6 comments · Fixed by #3610
Closed
2 of 5 tasks
Tracked by #3217

Improve resilience to failed packets on ordered channels #3540

ljoss17 opened this issue Aug 15, 2023 · 6 comments · Fixed by #3610
Assignees
Labels
A: bug Admin: something isn't working A: critical Admin: critical or important O: reliability Objective: cause to improve trustworthiness and consistent performing
Milestone

Comments

@ljoss17
Copy link
Contributor

ljoss17 commented Aug 15, 2023

Closes: #3695

Summary

There is a risk that Hermes enters in a state where it will fail to relay all packets on an ordered channel until the packet clearing is triggered.
This can happen when a packet is not successfully relayed, but isn't noticed by Hermes. It will then try to relay the next packets but they will all fail due to a sequence number mismatch.

Hermes shouldn't blindly try to relay any packet if it fails to relay one in an ordered channel.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ljoss17 ljoss17 added A: bug Admin: something isn't working O: reliability Objective: cause to improve trustworthiness and consistent performing labels Aug 15, 2023
@romac romac changed the title Relaying issue for ordered channels Improve resilience to failed packets on ordered channels Aug 15, 2023
@romac
Copy link
Member

romac commented Aug 15, 2023

@soareschen suggested that we could maintain a priority queue of packets to be relayed, ordered by sequence numbers.

As a I see it, we would update the queue whenever we get a new event batch. After successfully submitting a tx, we would remove from the queue the packets we just submitted and move on the remaining ones. Whenever we fail to do so, we can refresh the queue by querying for packets yet to be relayed on the channel.

Does that make sense to you, @ljoss17 @soareschen?

@romac romac added this to the v1.7 milestone Aug 15, 2023
@romac romac added the A: critical Admin: critical or important label Aug 22, 2023
@seanchen1991 seanchen1991 self-assigned this Aug 31, 2023
@ancazamfir
Copy link
Collaborator

Why can't we trigger/call packet clearing if needed for seq < n, before we process the event for n?

@romac romac modified the milestones: v1.7, v1.7.1 Oct 2, 2023
@romac
Copy link
Member

romac commented Oct 17, 2023

Quick notes from the call with @ljoss17 and @seanchen1991:

diff --git a/crates/relayer/src/link/relay_path.rs b/crates/relayer/src/link/relay_path.rs
index dc01a71f7..e4b74a1b6 100644
--- a/crates/relayer/src/link/relay_path.rs
+++ b/crates/relayer/src/link/relay_path.rs
@@ -673,10 +673,40 @@ impl<ChainA: ChainHandle, ChainB: ChainHandle> RelayPath<ChainA, ChainB> {
                 }
                 Err(LinkError(error::LinkErrorDetail::Send(e), _)) => {
                     // This error means we could retry
-                    error!("error {}", e.event);
+                    error!("error {}", e.event); // TODO: better message here
+
+                    // if on ordered channel, do a packet clearing, but only if we are not doing
+                    // a packet clearing already.
+                    if self.ordered_channel() && i == 0 && !odata.tracking_id.is_clearing() {
+                        // Need to specify height for clearing?
+                        // Probably not since no progress will have been made
+                        self.relay_pending_packets(None)?;
+                    }
+
+                    // fine up to: 112
+                    // next expected: 113
+                    // submitting: 120, 121
+                    //
+                    // a1) pending: 114, 115, 116, 117, 118, 119, 120, 121
+                    //    cleared: 114, 115, 116, 117, 118, 119, 120, 121
+                    //
+                    // a2) pending: 114, 115, 116, 117, 118, 119
+                    //     cleared: 114, 115, 116, 117, 118, 119
+                    //
+                    // b1) pending: 114, 115, 116, xxx, 118, 119
+                    //     cleared: 114, 115, 116 (?)
+                    //
+                    // b2) pending: 114, 115, 116, xxx, 118, 119
+                    //     cleared: none
+                    //
+                    // c)  pending: 114, 115, 116, 117, 118, 119
+                    //     cleared: none
+
                     if i + 1 == MAX_RETRIES {
                         error!("{}/{} retries exhausted. giving up", i + 1, MAX_RETRIES)
                     } else {
+                        // TODO: output current retry step here in debug!
+
                         // If we haven't exhausted all retries, regenerate the op. data & retry
                         match self.regenerate_operational_data(odata.clone()) {
                             None => return Ok(S::Reply::empty()), // Nothing to retry

@romac romac modified the milestones: v1.7.1, v1.8 Nov 8, 2023
@devashishdxt
Copy link

Can #3695 also be solved as a part of this?

@mmsqe
Copy link
Contributor

mmsqe commented Nov 20, 2023

Can #3695 also be solved as a part of this?

Yes, seems resolved with the change

@romac romac modified the milestones: v1.8, v1.9 Jan 16, 2024
@mmsqe
Copy link
Contributor

mmsqe commented Feb 17, 2024

Seems #3610 does not resolve such case: timeout is not triggered if exceeds max gas error is raised after submitting large txs as test in here. While golang relayer will send MsgTimeout to controller chain and then MsgChannelCloseConfirm to host chain after max gas error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working A: critical Admin: critical or important O: reliability Objective: cause to improve trustworthiness and consistent performing
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

6 participants