Event monitor crashes when the RPC node goes away #863

ancazamfir · 2021-04-28T13:11:52Z

Crate

relayer

Summary of Bug

This was discovered by the DEX team during load testing and happens on hermes multi path startup.

Trace is:

The application panicked (crashed).
Message: called `Result::unwrap()` on an `Err` value: Error { code: ClientInternalError, message: "Client internal error", data: Some("failed to hear back from WebSocket driver") }
Location: relayer/src/chain/cosmos.rs:378
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets.

Version

Steps to Reproduce

Acceptance Criteria

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate milestone (priority) applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

nodebreaker0-0 · 2021-04-29T11:09:11Z

Hello, I'm bug informant.

To reproduce this crashe, see

Set up at least 3 chains
Assuming chain-a, chain-b, chain-c,
connect a and b , connect a and c
hermes -c config.toml create channel chain-a chain-b --port-a transfer --port-b transfer -o unordered
hermes -c config.toml create channel chain-a chain-c --port-a transfer --port-b transfer -o unordered
Run Hermes with start-multi.
Exit the chain-c

Then the crashe will be reproduced.

adizere · 2021-04-29T16:35:57Z

Thank you for the feedback, @nodebreaker0-0.

I would like to know what would be the required behavior in your opinion, if running hermes start-multi and one of the chains crashes. When that chain crashes, then Hermes cannot continue relaying packets for all paths involving that chain. Now I can imagine two options:

The hermes relayer process could exit with an error.
The hermes relayer process could continue operating if there are other paths that are alive (paths which still have the chains at both ends alive).

Any thoughts?

ancazamfir · 2021-04-29T19:00:48Z

I would say 2.
Also, it is possible that the chain is fine (continues to produce blocks) but the full node "crashes". In this case, once the full node is operational, would hermes in case 2. (and with your fix) be able to resume relaying over the paths spanning that chain?

romac · 2021-04-29T19:14:05Z

We could perhaps introduce a thread to monitor the state of all configured chains and trigger a reload/restart when a crashed node comes back online.

ancazamfir · 2021-04-30T14:05:13Z

As an initial fix, maybe we do 2 without resuming relaying for the affected chain?

gamarin2 · 2021-04-30T14:45:18Z

I confirm that 2. is much preferred.

nodebreaker0-0 · 2021-04-30T14:54:04Z

In my opinion, number two is appropriate.

It is necessary to separate the rpc connection for each chain independently.

For example

Fundamental Problem Solving

connect a and b , connect a and c
Chain-c is dead
Terminates all relay processes associated with chain-c.
But the relay process of a and b must be alive.

If you think this is very complicated development.

Temporary Measure

Multiple rpc nodes in config.toml are required.

Then, if node 1 is unable to communicate rpc, you can try to connect to node 2 (or up to number 3 and 4).

In conclusion, however, the fundamental problem must be solved.

ancazamfir · 2021-04-30T15:46:50Z

What are the minimum requirements for recovery when the full node comes back? Is manual restart ok, similar to when a new chain is added? Or need something like @romac proposes here #863 (comment)

ancazamfir added this to the 04.2021 milestone Apr 28, 2021

ancazamfir added this to To do in Relayer v0.3 via automation Apr 28, 2021

adizere added A: bug Admin: something isn't working I: logic Internal: related to the relaying logic labels Apr 29, 2021

adizere self-assigned this Apr 29, 2021

andynog added the E: gravity External: related to Gravity DEX label Apr 29, 2021

adizere moved this from To do to In progress in Relayer v0.3 Apr 30, 2021

ancazamfir assigned romac May 3, 2021

romac mentioned this issue May 4, 2021

Fix crash during initialization of event monitor when node is down #895

Merged

6 tasks

ancazamfir closed this as completed in #895 May 6, 2021

Relayer v0.3 automation moved this from In progress to Done May 6, 2021

ancazamfir mentioned this issue May 6, 2021

Documentation update for v0.3 #904

Merged

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event monitor crashes when the RPC node goes away #863

Event monitor crashes when the RPC node goes away #863

ancazamfir commented Apr 28, 2021

nodebreaker0-0 commented Apr 29, 2021

adizere commented Apr 29, 2021 •

edited

Loading

ancazamfir commented Apr 29, 2021

romac commented Apr 29, 2021

ancazamfir commented Apr 30, 2021

gamarin2 commented Apr 30, 2021

nodebreaker0-0 commented Apr 30, 2021

ancazamfir commented Apr 30, 2021

Event monitor crashes when the RPC node goes away #863

Event monitor crashes when the RPC node goes away #863

Comments

ancazamfir commented Apr 28, 2021

Crate

Summary of Bug

Version

Steps to Reproduce

Acceptance Criteria

For Admin Use

nodebreaker0-0 commented Apr 29, 2021

adizere commented Apr 29, 2021 • edited Loading

ancazamfir commented Apr 29, 2021

romac commented Apr 29, 2021

ancazamfir commented Apr 30, 2021

gamarin2 commented Apr 30, 2021

nodebreaker0-0 commented Apr 30, 2021

ancazamfir commented Apr 30, 2021

adizere commented Apr 29, 2021 •

edited

Loading