Parachain node unable to synchronize relaychain #3550

xlc · 2024-03-02T07:55:34Z

We received two reports from Acala node operators that the nodes stops finalize blocks, because it is unable to sync the relaychain wiht following error:

2024-03-02 15:10:16 [Relaychain] 💔 Error importing block 0xaf540fcfaf1ace474c8a8fe96ca1d2501de72990d3f04e8639219e97f4b99436: consensus error: Chain lookup failed: No block weight for parent header.    
2024-03-02 15:10:17 [Relaychain] 💔 Error importing block 0xaf540fcfaf1ace474c8a8fe96ca1d2501de72990d3f04e8639219e97f4b99436: consensus error: Chain lookup failed: No block weight for parent header.    
2024-03-02 15:10:17 [Relaychain] 💔 Error importing block 0xaf540fcfaf1ace474c8a8fe96ca1d2501de72990d3f04e8639219e97f4b99436: consensus error: Chain lookup failed: No block weight for parent header.

One report is from Aacla node with 0.9.43 and another one is using 1.3.0 (the latest released Acala version).

The text was updated successfully, but these errors were encountered:

xlc · 2024-03-02T08:24:13Z

aca.log

nud3l · 2024-03-02T08:25:01Z

We are experiencing the issue on all Interlay collators using our 1.25.3 and 1.25.4 release.

Interlay-collator-logs.txt

matherceg · 2024-03-02T08:25:14Z

My log for Polkadot AssetHub for the last 24 hours.
log_assethub_polkadot24h.txt

stakeworld · 2024-03-02T08:43:55Z

Attached 24 hours of polkadot bridgehub:
polkadot-bridgehub-collator.txt
Version: polkadot-parachain 1.7.0-97df9dd6554

paulormart · 2024-03-02T10:15:07Z

My collator logs for Assethub Polkadot, last 24h, with upgrade from v1.7.0 to v4.0.0-ec7817e5adc
polkadot_ahp_20240302.log

JafarAz · 2024-03-02T10:45:12Z

The Composable parachain has also halted >4 hours ago and unable to produce further blocks. It's also version 0.9.49

https://composable.subscan.io/block

NZT48 · 2024-03-02T11:07:37Z

Same error logs and issues observed on some NeuroWeb collators. They are on version 0.9.40

DamianStraszak · 2024-03-02T12:42:08Z

Judging by telemetry https://telemetry.polkadot.io/#/0x91b171bb158e2d3848fa23a9f1c25182fb8e20313b2c1eb49219da7a70ce90c3 -- vast majority of nodes stuck on 19720319 are on version <=0.9.43. While version >=1 seems to be fine.

Edit: interestingly the stuck nodes report on telemetry

"finalized": 19720320
"(best) block": 19720319 or 19720317,

which is kind of strange, because finalized should always be smaller than best.

Sudo-Whodo · 2024-03-02T16:09:04Z

Here are my collator logs from asset-hub-polkadot, it is running 4.0.0-ec7817e5adc
asset-hub-polkadot.txt

dzmitry-lahoda · 2024-03-02T17:17:32Z

@Sudo-Whodo it seems you logs show that you are import blocks of relay and para well

wischli · 2024-03-02T20:33:22Z

On Centrifuge, we have also experienced the issue on one collator and several fullnodes which all run on Polkadot v0.9.43. Two legacy nodes at Polkadot v0.9.38 werent effected, might be too small sample size though. Fixed by rolling back to snapshot from last week and resyncing with fast for us.

skunert · 2024-03-02T23:18:58Z

The failure happens here in the babe block import queue. It tries to load a weight from the auxiliary database that is already removed. The weight from previous blocks is cleaned up when a new finality notification arrives, so usually it is not a problem during block import.

"finalized": 19720320
"(best) block": 19720319 or 19720317,

This seems to be the root cause. In the code linked above we check what is the last best block and try to load the weight. But because finalized block has advanced past the best block, the weight in aux db has been cleaned up already. I am not sure how we ended up in that situation. Maybe some scenario occured in which we are not setting the best block correctly. Maybe we should use select_chain instead of the best block from the db in that code.

Since the issue appears on older versions, it is probably already fixed, otherwise the issue would have appeared on all the relay chain nodes too. As a workaround resyncing the embedded relay should work or using an external RPC node for the relay.

lansehuiyi6 · 2024-03-04T09:24:21Z

Any solution for this issue?

bkchr · 2024-03-04T09:35:08Z

You need to delete the relay chain db and resync it. For resyncing you can use warp sync with -- --sync warp.

lansehuiyi6 · 2024-03-05T01:55:35Z

You need to delete the relay chain db and resync it. For resyncing you can use warp sync with -- --sync warp.

About the relay chain db, do you have specific folder or files?

lansehuiyi6 · 2024-03-05T02:17:15Z

You need to delete the relay chain db and resync it. For resyncing you can use warp sync with -- --sync warp.

I mean, for full node, the size of db is 1.6T, so is there any fast resync solution?

bkchr · 2024-03-05T08:24:18Z

As I told you, you can use warp sync. You also only need to delete the relay chain db. Check your base path, there you somewhere find the polkadot folder.

lansehuiyi6 · 2024-03-05T10:11:27Z

As I told you, you can use warp sync. You also only need to delete the relay chain db. Check your base path, there you somewhere find the polkadot folder.

Do you have any estimate time how long it will take to sync using warp sync mode?

skunert · 2024-03-06T14:58:23Z

I investigated a bit more what happened here. I arrived at the following situation:

-19720315(fin) --> 19720316 --> 19720317 --> 19720318 --> 19720319(best)

New fork arrives and is set as best block.

                              /--> 19720316'(best)
                            /
-19720315(fin) --> 19720316 --> 19720317 --> 19720318 --> 19720319

19720320 arrives but is not set as best block (idk exact reason, probably less primary slots there?). Additionally, 19720317 is finalized. The logic before 0.9.43 set the best block to the finalized one if there were more leaves than one. This is the case here since 19720316' is a fork. Since 19720317 is finalized, 19720316' is pruned.

-19720315 --> 19720316 --> 19720317(best)(fin) --> 19720318 --> 19720319 --> 19720320

19720320 is finalized, but we have no forks, so best block is now behind finalized block. At this point we prune the babe weight data in the aux db for the blocks below the finalized block.

-19720315 --> 19720316 --> 19720317(best)--> 19720318 --> 19720319 --> 19720320(fin)

Now 19720321 arrives, we go in this else branch but can not load the aux data for the best block -> import error that we have seen.

This was fixed in 0.9.43 over a year ago: paritytech/substrate#14308, in recent versions we set the best block to the finalized one, even if there is only a single branch.

Do you have any estimate time how long it will take to sync using warp sync mode?

This is hard to say and depends on your nodes settings. Warp sync should not take long, but since you said your node db is 1.6T I assume its an archive? Then it will take a while.

EDIT: Was fixed in 1.0.0

lansehuiyi6 · 2024-03-07T02:23:51Z

I investigated a bit more what happened here. I arrived at the following situation:
-19720315(fin)- -> 19720316 --> 19720317 --> 19720318 --> 19720319(best)
New fork arrives and is set as best block.
                              /--> 19720316'(best)
                            /
-19720315(fin)- -> 19720316 --> 19720317 --> 19720318 --> 19720319
19720320 arrives but is not set as best block (idk exact reason, probably less primary slots there?). Additionally, 19720317 is finalized. The logic before 0.9.43 set the best block to the finalized one if there were more leaves than one. This is the case here since 19720316' is a fork. Since 19720317 is finalized, 19720316' is pruned.
-19720315 -> 19720316 --> 19720317(best)(fin) --> 19720318 --> 19720319 --> 19720320
19720320 is finalized, but we have no forks, so best block is now behind finalized block. At this point we prune the babe weight data in the aux db for the blocks below the finalized block.
-19720315(fin)- -> 19720316 --> 19720317(best)--> 19720318 --> 19720319 --> 19720320(fin) 
Now 19720321 arrives, we go in this else branch but can not load the aux data for the best block -> import error that we have seen.

This was fixed in 0.9.43 over a year ago: paritytech/substrate#14308, in recent versions we set the best block to the finalized one, even if there is only a single branch.

Do you have any estimate time how long it will take to sync using warp sync mode?

This is hard to say and depends on your nodes settings. Warp sync should not take long, but since you said your node db is 1.6T I assume its an archive? Then it will take a while.

Now i reboot with warp sync, and the node already catch the latest node.
But as you know, Warp sync doesn't work for archive nodes, so do you know how many blocks will warp mode save?

wischli · 2024-03-07T07:52:04Z

This was fixed in 0.9.43 over a year ago: paritytech/substrate#14308, in recent versions we set the best block to the finalized one, even if there is only a single branch.

Really appreciate the postmortem! The linked fix is not part of Polkadot v0.9.43 though but v1.0.0 instead:

I was skeptical because our clients @ v0.9.43 were affected by this issue.

skunert · 2024-03-07T08:35:58Z

Really appreciate the postmortem! The linked fix is not part of Polkadot v0.9.43 though but v1.0.0 instead:

Yes, thanks for the correction! Edited the comment above.

Warp sync doesn't work for archive nodes, so do you know how many blocks will warp mode save?

Warp sync will quickly go to the tip of the chain and then respect your pruning settings for all new incoming blocks. For the blocks from genesis to the tip however, it will only download headers and bodies, not reconstruct the states. So if you want an archive node you can not use warp as you correctly said.

lansehuiyi6 · 2024-03-11T03:20:40Z

Really appreciate the postmortem! The linked fix is not part of Polkadot v0.9.43 though but v1.0.0 instead:

Yes, thanks for the correction! Edited the comment above.

Warp sync doesn't work for archive nodes, so do you know how many blocks will warp mode save?

Warp sync will quickly go to the tip of the chain and then respect your pruning settings for all new incoming blocks. For the blocks from genesis to the tip however, it will only download headers and bodies, not reconstruct the states. So if you want an archive node you can not use warp as you correctly said.

I tried warp sync with ' --blocks-pruning=50000 --state-pruning=50000', but seemed the node still only stored the latest 256 block while 256 is the default set for state-pruning; older blocker will throw 'State already discarded for txhash'; and the node version is v1.0.0.

/usr/bin/polkadot --sync=warp --rpc-external --rpc-cors=all --rpc-max-connections=2048 --rpc-max-request-size=52500000 --rpc-max-response-size=52500000 --blocks-pruning=50000 --state-pruning=50000

So if the pruning issue exist with warp sync mode? Or had i entered the wrong command to sync?

bkchr · 2024-03-11T09:04:38Z

but seemed the node still only stored the latest 256 block while 256 is the

The node syncs to the tip of the chain. It will take 50000 blocks added to the tip of the chain to have all of them in the pruning window. We don't download old blocks state.

lansehuiyi6 · 2024-03-12T09:33:57Z

but seemed the node still only stored the latest 256 block while 256 is the

The node syncs to the tip of the chain. It will take 50000 blocks added to the tip of the chain to have all of them in the pruning window. We don't download old blocks state.

Thanks very much!

skunert mentioned this issue Mar 4, 2024

Block blocked at height 19720317 #3557

Closed

skunert closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parachain node unable to synchronize relaychain #3550

Parachain node unable to synchronize relaychain #3550

xlc commented Mar 2, 2024

xlc commented Mar 2, 2024

nud3l commented Mar 2, 2024 •

edited

matherceg commented Mar 2, 2024

stakeworld commented Mar 2, 2024 •

edited

paulormart commented Mar 2, 2024

JafarAz commented Mar 2, 2024

NZT48 commented Mar 2, 2024

DamianStraszak commented Mar 2, 2024 •

edited

Sudo-Whodo commented Mar 2, 2024

dzmitry-lahoda commented Mar 2, 2024

wischli commented Mar 2, 2024

skunert commented Mar 2, 2024 •

edited

lansehuiyi6 commented Mar 4, 2024

bkchr commented Mar 4, 2024

lansehuiyi6 commented Mar 5, 2024

lansehuiyi6 commented Mar 5, 2024

bkchr commented Mar 5, 2024

lansehuiyi6 commented Mar 5, 2024

skunert commented Mar 6, 2024 •

edited

lansehuiyi6 commented Mar 7, 2024

wischli commented Mar 7, 2024 •

edited

skunert commented Mar 7, 2024

lansehuiyi6 commented Mar 11, 2024

bkchr commented Mar 11, 2024

lansehuiyi6 commented Mar 12, 2024

Parachain node unable to synchronize relaychain #3550

Parachain node unable to synchronize relaychain #3550

Comments

xlc commented Mar 2, 2024

xlc commented Mar 2, 2024

nud3l commented Mar 2, 2024 • edited

matherceg commented Mar 2, 2024

stakeworld commented Mar 2, 2024 • edited

paulormart commented Mar 2, 2024

JafarAz commented Mar 2, 2024

NZT48 commented Mar 2, 2024

DamianStraszak commented Mar 2, 2024 • edited

Sudo-Whodo commented Mar 2, 2024

dzmitry-lahoda commented Mar 2, 2024

wischli commented Mar 2, 2024

skunert commented Mar 2, 2024 • edited

lansehuiyi6 commented Mar 4, 2024

bkchr commented Mar 4, 2024

lansehuiyi6 commented Mar 5, 2024

lansehuiyi6 commented Mar 5, 2024

bkchr commented Mar 5, 2024

lansehuiyi6 commented Mar 5, 2024

skunert commented Mar 6, 2024 • edited

lansehuiyi6 commented Mar 7, 2024

wischli commented Mar 7, 2024 • edited

skunert commented Mar 7, 2024

lansehuiyi6 commented Mar 11, 2024

bkchr commented Mar 11, 2024

lansehuiyi6 commented Mar 12, 2024

nud3l commented Mar 2, 2024 •

edited

stakeworld commented Mar 2, 2024 •

edited

DamianStraszak commented Mar 2, 2024 •

edited

skunert commented Mar 2, 2024 •

edited

skunert commented Mar 6, 2024 •

edited

wischli commented Mar 7, 2024 •

edited