-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parachain node unable to synchronize relaychain #3550
Comments
We are experiencing the issue on all Interlay collators using our 1.25.3 and 1.25.4 release. |
My log for Polkadot AssetHub for the last 24 hours. |
Attached 24 hours of polkadot bridgehub: |
My collator logs for Assethub Polkadot, last 24h, with upgrade from v1.7.0 to v4.0.0-ec7817e5adc |
The Composable parachain has also halted >4 hours ago and unable to produce further blocks. It's also version 0.9.49 |
Same error logs and issues observed on some NeuroWeb collators. They are on version 0.9.40 |
Judging by telemetry https://telemetry.polkadot.io/#/0x91b171bb158e2d3848fa23a9f1c25182fb8e20313b2c1eb49219da7a70ce90c3 -- vast majority of nodes stuck on Edit: interestingly the stuck nodes report on telemetry
which is kind of strange, because |
Here are my collator logs from asset-hub-polkadot, it is running 4.0.0-ec7817e5adc |
@Sudo-Whodo it seems you logs show that you are import blocks of relay and para well |
On Centrifuge, we have also experienced the issue on one collator and several fullnodes which all run on Polkadot v0.9.43. Two legacy nodes at Polkadot v0.9.38 werent effected, might be too small sample size though. Fixed by rolling back to snapshot from last week and resyncing with |
The failure happens here in the babe block import queue. It tries to load a weight from the auxiliary database that is already removed. The weight from previous blocks is cleaned up when a new finality notification arrives, so usually it is not a problem during block import.
This seems to be the root cause. In the code linked above we check what is the last best block and try to load the weight. But because finalized block has advanced past the best block, the weight in aux db has been cleaned up already. I am not sure how we ended up in that situation. Maybe some scenario occured in which we are not setting the best block correctly. Maybe we should use Since the issue appears on older versions, it is probably already fixed, otherwise the issue would have appeared on all the relay chain nodes too. As a workaround resyncing the embedded relay should work or using an external RPC node for the relay. |
Any solution for this issue? |
You need to delete the relay chain db and resync it. For resyncing you can use warp sync with |
About the relay chain db, do you have specific folder or files? |
I mean, for full node, the size of db is 1.6T, so is there any fast resync solution? |
As I told you, you can use warp sync. You also only need to delete the relay chain db. Check your base path, there you somewhere find the polkadot folder. |
Do you have any estimate time how long it will take to sync using warp sync mode? |
I investigated a bit more what happened here. I arrived at the following situation:
New fork arrives and is set as best block.
19720320 arrives but is not set as best block (idk exact reason, probably less primary slots there?). Additionally, 19720317 is finalized. The logic before 0.9.43 set the best block to the finalized one if there were more leaves than one. This is the case here since 19720316' is a fork. Since 19720317 is finalized, 19720316' is pruned.
19720320 is finalized, but we have no forks, so best block is now behind finalized block. At this point we prune the babe weight data in the aux db for the blocks below the finalized block.
Now 19720321 arrives, we go in this else branch but can not load the aux data for the best block -> import error that we have seen. This was fixed in 0.9.43 over a year ago: paritytech/substrate#14308, in recent versions we set the best block to the finalized one, even if there is only a single branch.
This is hard to say and depends on your nodes settings. Warp sync should not take long, but since you said your node db is 1.6T I assume its an archive? Then it will take a while. EDIT: Was fixed in 1.0.0 |
Now i reboot with warp sync, and the node already catch the latest node. |
Really appreciate the postmortem! The linked fix is not part of Polkadot v0.9.43 though but v1.0.0 instead:
I was skeptical because our clients @ v0.9.43 were affected by this issue. |
Yes, thanks for the correction! Edited the comment above.
Warp sync will quickly go to the tip of the chain and then respect your pruning settings for all new incoming blocks. For the blocks from genesis to the tip however, it will only download headers and bodies, not reconstruct the states. So if you want an archive node you can not use warp as you correctly said. |
I tried warp sync with ' --blocks-pruning=50000 --state-pruning=50000', but seemed the node still only stored the latest 256 block while 256 is the default set for state-pruning; older blocker will throw 'State already discarded for txhash'; and the node version is v1.0.0. /usr/bin/polkadot --sync=warp --rpc-external --rpc-cors=all --rpc-max-connections=2048 --rpc-max-request-size=52500000 --rpc-max-response-size=52500000 --blocks-pruning=50000 --state-pruning=50000 So if the pruning issue exist with warp sync mode? Or had i entered the wrong command to sync? |
The node syncs to the tip of the chain. It will take 50000 blocks added to the tip of the chain to have all of them in the pruning window. We don't download old blocks state. |
Thanks very much! |
We received two reports from Acala node operators that the nodes stops finalize blocks, because it is unable to sync the relaychain wiht following error:
One report is from Aacla node with 0.9.43 and another one is using 1.3.0 (the latest released Acala version).
The text was updated successfully, but these errors were encountered: