Explorer nodes can't catch up with latest blocks #3740

LeoHChen · 2021-06-07T17:59:27Z

Describe the bug
Recently, with a higher volume of tx happened on-chain, the explorer nodes, either archival or non-archival nodes, can't catch up with the latest blockchain blocks.

To Reproduce
We've seen this problem on this build v6999-v4.0.0-66-g343dbe89 on the mainnet.

The explorer nodes can't catch up with the latest blocks by a few minutes. However, there is no sign of CPU / IO overload on the nodes. After I restarted the harmony process, it can catch up very quickly. So I think it is not the resource issue, there could be some logic that is slow to catch up.

Expected behavior
Explorer nodes should be able to catch up with the latest block when there is no resource overload issue.

rlan35 · 2021-06-07T22:52:35Z

on one explorer node I checked (35.81.82.117), it's CPU overloaded.

rlan35 · 2021-06-07T23:07:59Z

another node (34.222.182.47) definitely isn't catching up with the latest block and keeps triggering the block sync logic to catch up (which has 60s delay). It's very likely caused by too much CPU/memory load where the blocks broadcasted from consensus aren't processed in time (due to missing parent block). So likely the lastMile block logic need some optimization. @JackyWYX . you can check the log on 34.222.182.47, the issue is clear.

LeoHChen · 2021-06-08T00:20:30Z

@rlan35 , 382% CPU is not CPU overloaded, this node has 8 cores, you may use htop to take a took. The CPU load is under 50%

LeoHChen · 2021-06-08T00:21:29Z

But I agree with you, I doubt the last mile block sync should be a bit more aggressive in order to catch up. We do have enough CPU power. Some parameters may need to be tuned in the last mile catch. @JackyWYX

JackyWYX · 2021-06-08T00:41:26Z

My current guess is follows: The explorer node is heavy in p2p message handling, resulting in committed block handled at AddNewBlockForExplorer is out of order. Thus result in unknown ancestor problems. This issue is only for explorer node. For validator node, there is already caching mechanism in consensus module handling this situation.

I will first try to confirm my guess, and apply the fix later.

LeoHChen · 2021-06-08T07:00:12Z

heavy p2p message is reasonable as those explorer nodes are also RPC end points, received a lot of RPC requests from users and sent p2p messages to the network.

JackyWYX · 2021-06-09T05:07:00Z

The issue is observed on one custom explorer node. The issue appears to be there is no caching mechanism for the last mile blocks on explorer node. When there are heavy p2p message load, the sequence of processing last mile blocks is not ensured. Thus I am currently working a fix to add a cache to explorer, and sort the last mile blocks before block insertion.

JackyWYX · 2021-06-10T00:22:28Z

After some more in-depth debugging, found that the issue is actually happens in writing explorer db data.

On explorer node, the data for explorer need 7~15s of Dump (api/service/explorer/storage.go:85). The reason behind this is that all transaction history of addresses are encoded and dumped into db which is way from efficient. I will focus on doing following things trying to fix:

Optimize the multi-threading logic with the current db data structure to see how much we can help with the performance.
Do not write entry every block. Buffer for several blocks and write once.
Migrate the explorer db data structure. Can use iterator with the prefix of addr_{OneAccr}_{txHash}.

LeoHChen · 2021-06-14T23:54:57Z

Inspection on the system load of non-archival explorer nodes shows the disk IOPS can't keep up. So, we will migrate non-archival explorer nodes to m5d.2xlarge. The progress will be tracked in the following issue.
https://github.com/harmony-one/harmony-ops-priv/issues/37

LeoHChen · 2021-06-16T20:44:42Z

we have upgraded the non-archival explorer nodes to m5d.2xlarge, the iostat can go beyond 300 now.

iostat
Linux 4.14.232-176.381.amzn2.x86_64 (ip-172-31-61-224.us-west-2.compute.internal)       06/16/2021      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          23.89    0.00    2.10    0.39    0.00   73.61

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1         473.66        92.76     57667.85   14056667 8738529518
nvme0n1           0.73         2.33        16.87     352652    2556108

LeoHChen · 2021-06-21T22:11:46Z

iostat of each RPC endpoint can go up to 750 tps. To use m5d.2xlarge is the right choices.

LeoHChen added the high priority high priority issue with customer impact label Jun 7, 2021

LeoHChen assigned JackyWYX Jun 7, 2021

LeoHChen pinned this issue Jun 7, 2021

gupadhyaya mentioned this issue Jun 8, 2021

Token balances not loading (getBalance HRC20 calls), leading to DEXs like Viper, Mochi unusable #3743

Closed

JackyWYX mentioned this issue Jun 9, 2021

[Explorer] Fix explorer constantly lagging behind #3749

Closed

JackyWYX mentioned this issue Jun 19, 2021

Explorer db schema #3790

Merged

LeoHChen closed this as completed Jun 21, 2021

LeoHChen unpinned this issue Jun 21, 2021

LeoHChen mentioned this issue Jun 21, 2021

[documentation] instruction on how to migrate the non-validating nodes #3798

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorer nodes can't catch up with latest blocks #3740

Explorer nodes can't catch up with latest blocks #3740

LeoHChen commented Jun 7, 2021 •

edited

rlan35 commented Jun 7, 2021 •

edited

rlan35 commented Jun 7, 2021

LeoHChen commented Jun 8, 2021

LeoHChen commented Jun 8, 2021

JackyWYX commented Jun 8, 2021 •

edited

LeoHChen commented Jun 8, 2021

JackyWYX commented Jun 9, 2021

JackyWYX commented Jun 10, 2021

LeoHChen commented Jun 14, 2021

LeoHChen commented Jun 16, 2021

LeoHChen commented Jun 21, 2021

Explorer nodes can't catch up with latest blocks #3740

Explorer nodes can't catch up with latest blocks #3740

Comments

LeoHChen commented Jun 7, 2021 • edited

rlan35 commented Jun 7, 2021 • edited

rlan35 commented Jun 7, 2021

LeoHChen commented Jun 8, 2021

LeoHChen commented Jun 8, 2021

JackyWYX commented Jun 8, 2021 • edited

LeoHChen commented Jun 8, 2021

JackyWYX commented Jun 9, 2021

JackyWYX commented Jun 10, 2021

LeoHChen commented Jun 14, 2021

LeoHChen commented Jun 16, 2021

LeoHChen commented Jun 21, 2021

LeoHChen commented Jun 7, 2021 •

edited

rlan35 commented Jun 7, 2021 •

edited

JackyWYX commented Jun 8, 2021 •

edited