New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explorer nodes can't catch up with latest blocks #3740
Comments
another node (34.222.182.47) definitely isn't catching up with the latest block and keeps triggering the block sync logic to catch up (which has 60s delay). It's very likely caused by too much CPU/memory load where the blocks broadcasted from consensus aren't processed in time (due to missing parent block). So likely the lastMile block logic need some optimization. @JackyWYX . you can check the log on 34.222.182.47, the issue is clear. |
@rlan35 , 382% CPU is not CPU overloaded, this node has 8 cores, you may use |
But I agree with you, I doubt the last mile block sync should be a bit more aggressive in order to catch up. We do have enough CPU power. Some parameters may need to be tuned in the last mile catch. @JackyWYX |
My current guess is follows: The explorer node is heavy in p2p message handling, resulting in committed block handled at I will first try to confirm my guess, and apply the fix later. |
heavy p2p message is reasonable as those explorer nodes are also RPC end points, received a lot of RPC requests from users and sent p2p messages to the network. |
The issue is observed on one custom explorer node. The issue appears to be there is no caching mechanism for the last mile blocks on explorer node. When there are heavy p2p message load, the sequence of processing last mile blocks is not ensured. Thus I am currently working a fix to add a cache to explorer, and sort the last mile blocks before block insertion. |
After some more in-depth debugging, found that the issue is actually happens in writing explorer db data. On explorer node, the data for explorer need 7~15s of Dump (
|
Inspection on the system load of non-archival explorer nodes shows the disk IOPS can't keep up. So, we will migrate non-archival explorer nodes to m5d.2xlarge. The progress will be tracked in the following issue. |
we have upgraded the non-archival explorer nodes to m5d.2xlarge, the iostat can go beyond 300 now.
|
iostat of each RPC endpoint can go up to 750 tps. To use m5d.2xlarge is the right choices. |
Describe the bug
Recently, with a higher volume of tx happened on-chain, the explorer nodes, either archival or non-archival nodes, can't catch up with the latest blockchain blocks.
To Reproduce
We've seen this problem on this build v6999-v4.0.0-66-g343dbe89 on the mainnet.
The explorer nodes can't catch up with the latest blocks by a few minutes. However, there is no sign of CPU / IO overload on the nodes. After I restarted the harmony process, it can catch up very quickly. So I think it is not the resource issue, there could be some logic that is slow to catch up.
Expected behavior
Explorer nodes should be able to catch up with the latest block when there is no resource overload issue.
The text was updated successfully, but these errors were encountered: