Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parachain node unable to synchronize relaychain #3550

Closed
xlc opened this issue Mar 2, 2024 · 25 comments
Closed

Parachain node unable to synchronize relaychain #3550

xlc opened this issue Mar 2, 2024 · 25 comments

Comments

@xlc
Copy link
Contributor

xlc commented Mar 2, 2024

We received two reports from Acala node operators that the nodes stops finalize blocks, because it is unable to sync the relaychain wiht following error:

2024-03-02 15:10:16 [Relaychain] 馃挃 Error importing block 0xaf540fcfaf1ace474c8a8fe96ca1d2501de72990d3f04e8639219e97f4b99436: consensus error: Chain lookup failed: No block weight for parent header.    
2024-03-02 15:10:17 [Relaychain] 馃挃 Error importing block 0xaf540fcfaf1ace474c8a8fe96ca1d2501de72990d3f04e8639219e97f4b99436: consensus error: Chain lookup failed: No block weight for parent header.    
2024-03-02 15:10:17 [Relaychain] 馃挃 Error importing block 0xaf540fcfaf1ace474c8a8fe96ca1d2501de72990d3f04e8639219e97f4b99436: consensus error: Chain lookup failed: No block weight for parent header.

One report is from Aacla node with 0.9.43 and another one is using 1.3.0 (the latest released Acala version).

@xlc
Copy link
Contributor Author

xlc commented Mar 2, 2024

aca.log

@nud3l
Copy link

nud3l commented Mar 2, 2024

We are experiencing the issue on all Interlay collators using our 1.25.3 and 1.25.4 release.

Interlay-collator-logs.txt

@matherceg
Copy link

My log for Polkadot AssetHub for the last 24 hours.
log_assethub_polkadot24h.txt

@stakeworld
Copy link
Contributor

stakeworld commented Mar 2, 2024

Attached 24 hours of polkadot bridgehub:
polkadot-bridgehub-collator.txt
Version: polkadot-parachain 1.7.0-97df9dd6554

@paulormart
Copy link
Contributor

My collator logs for Assethub Polkadot, last 24h, with upgrade from v1.7.0 to v4.0.0-ec7817e5adc
polkadot_ahp_20240302.log

@JafarAz
Copy link

JafarAz commented Mar 2, 2024

The Composable parachain has also halted >4 hours ago and unable to produce further blocks. It's also version 0.9.49

https://composable.subscan.io/block

@NZT48
Copy link

NZT48 commented Mar 2, 2024

Same error logs and issues observed on some NeuroWeb collators. They are on version 0.9.40

@DamianStraszak
Copy link

DamianStraszak commented Mar 2, 2024

Judging by telemetry https://telemetry.polkadot.io/#/0x91b171bb158e2d3848fa23a9f1c25182fb8e20313b2c1eb49219da7a70ce90c3 -- vast majority of nodes stuck on 19720319 are on version <=0.9.43. While version >=1 seems to be fine.

Edit: interestingly the stuck nodes report on telemetry

  • "finalized": 19720320
  • "(best) block": 19720319 or 19720317,

which is kind of strange, because finalized should always be smaller than best.

@Sudo-Whodo
Copy link

Here are my collator logs from asset-hub-polkadot, it is running 4.0.0-ec7817e5adc
asset-hub-polkadot.txt

@dzmitry-lahoda
Copy link
Contributor

@Sudo-Whodo it seems you logs show that you are import blocks of relay and para well

@wischli
Copy link

wischli commented Mar 2, 2024

On Centrifuge, we have also experienced the issue on one collator and several fullnodes which all run on Polkadot v0.9.43. Two legacy nodes at Polkadot v0.9.38 werent effected, might be too small sample size though. Fixed by rolling back to snapshot from last week and resyncing with fast for us.

@skunert
Copy link
Contributor

skunert commented Mar 2, 2024

The failure happens here in the babe block import queue. It tries to load a weight from the auxiliary database that is already removed. The weight from previous blocks is cleaned up when a new finality notification arrives, so usually it is not a problem during block import.

"finalized": 19720320
"(best) block": 19720319 or 19720317,

This seems to be the root cause. In the code linked above we check what is the last best block and try to load the weight. But because finalized block has advanced past the best block, the weight in aux db has been cleaned up already. I am not sure how we ended up in that situation. Maybe some scenario occured in which we are not setting the best block correctly. Maybe we should use select_chain instead of the best block from the db in that code.

Since the issue appears on older versions, it is probably already fixed, otherwise the issue would have appeared on all the relay chain nodes too. As a workaround resyncing the embedded relay should work or using an external RPC node for the relay.

@lansehuiyi6
Copy link

Any solution for this issue?

@bkchr
Copy link
Member

bkchr commented Mar 4, 2024

You need to delete the relay chain db and resync it. For resyncing you can use warp sync with -- --sync warp.

@lansehuiyi6
Copy link

You need to delete the relay chain db and resync it. For resyncing you can use warp sync with -- --sync warp.

About the relay chain db, do you have specific folder or files?

@lansehuiyi6
Copy link

You need to delete the relay chain db and resync it. For resyncing you can use warp sync with -- --sync warp.

I mean, for full node, the size of db is 1.6T, so is there any fast resync solution?

@bkchr
Copy link
Member

bkchr commented Mar 5, 2024

As I told you, you can use warp sync. You also only need to delete the relay chain db. Check your base path, there you somewhere find the polkadot folder.

@lansehuiyi6
Copy link

As I told you, you can use warp sync. You also only need to delete the relay chain db. Check your base path, there you somewhere find the polkadot folder.

Do you have any estimate time how long it will take to sync using warp sync mode?

@skunert
Copy link
Contributor

skunert commented Mar 6, 2024

I investigated a bit more what happened here. I arrived at the following situation:

-19720315(fin) --> 19720316 --> 19720317 --> 19720318 --> 19720319(best)

New fork arrives and is set as best block.

                              /--> 19720316'(best)
                            /
-19720315(fin) --> 19720316 --> 19720317 --> 19720318 --> 19720319

19720320 arrives but is not set as best block (idk exact reason, probably less primary slots there?). Additionally, 19720317 is finalized. The logic before 0.9.43 set the best block to the finalized one if there were more leaves than one. This is the case here since 19720316' is a fork. Since 19720317 is finalized, 19720316' is pruned.

-19720315 --> 19720316 --> 19720317(best)(fin) --> 19720318 --> 19720319 --> 19720320

19720320 is finalized, but we have no forks, so best block is now behind finalized block. At this point we prune the babe weight data in the aux db for the blocks below the finalized block.

-19720315 --> 19720316 --> 19720317(best)--> 19720318 --> 19720319 --> 19720320(fin) 

Now 19720321 arrives, we go in this else branch but can not load the aux data for the best block -> import error that we have seen.

This was fixed in 0.9.43 over a year ago: paritytech/substrate#14308, in recent versions we set the best block to the finalized one, even if there is only a single branch.

Do you have any estimate time how long it will take to sync using warp sync mode?

This is hard to say and depends on your nodes settings. Warp sync should not take long, but since you said your node db is 1.6T I assume its an archive? Then it will take a while.

EDIT: Was fixed in 1.0.0

@skunert skunert closed this as completed Mar 6, 2024
@lansehuiyi6
Copy link

I investigated a bit more what happened here. I arrived at the following situation:

-19720315(fin)- -> 19720316 --> 19720317 --> 19720318 --> 19720319(best)

New fork arrives and is set as best block.

                              /--> 19720316'(best)
                            /
-19720315(fin)- -> 19720316 --> 19720317 --> 19720318 --> 19720319

19720320 arrives but is not set as best block (idk exact reason, probably less primary slots there?). Additionally, 19720317 is finalized. The logic before 0.9.43 set the best block to the finalized one if there were more leaves than one. This is the case here since 19720316' is a fork. Since 19720317 is finalized, 19720316' is pruned.

-19720315 -> 19720316 --> 19720317(best)(fin) --> 19720318 --> 19720319 --> 19720320

19720320 is finalized, but we have no forks, so best block is now behind finalized block. At this point we prune the babe weight data in the aux db for the blocks below the finalized block.

-19720315(fin)- -> 19720316 --> 19720317(best)--> 19720318 --> 19720319 --> 19720320(fin) 

Now 19720321 arrives, we go in this else branch but can not load the aux data for the best block -> import error that we have seen.

This was fixed in 0.9.43 over a year ago: paritytech/substrate#14308, in recent versions we set the best block to the finalized one, even if there is only a single branch.

Do you have any estimate time how long it will take to sync using warp sync mode?

This is hard to say and depends on your nodes settings. Warp sync should not take long, but since you said your node db is 1.6T I assume its an archive? Then it will take a while.

Now i reboot with warp sync, and the node already catch the latest node.
But as you know, Warp sync doesn't work for archive nodes, so do you know how many blocks will warp mode save?

@wischli
Copy link

wischli commented Mar 7, 2024

This was fixed in 0.9.43 over a year ago: paritytech/substrate#14308, in recent versions we set the best block to the finalized one, even if there is only a single branch.

Really appreciate the postmortem! The linked fix is not part of Polkadot v0.9.43 though but v1.0.0 instead:

I was skeptical because our clients @ v0.9.43 were affected by this issue.

@skunert
Copy link
Contributor

skunert commented Mar 7, 2024

Really appreciate the postmortem! The linked fix is not part of Polkadot v0.9.43 though but v1.0.0 instead:

Yes, thanks for the correction! Edited the comment above.

Warp sync doesn't work for archive nodes, so do you know how many blocks will warp mode save?

Warp sync will quickly go to the tip of the chain and then respect your pruning settings for all new incoming blocks. For the blocks from genesis to the tip however, it will only download headers and bodies, not reconstruct the states. So if you want an archive node you can not use warp as you correctly said.

@lansehuiyi6
Copy link

Really appreciate the postmortem! The linked fix is not part of Polkadot v0.9.43 though but v1.0.0 instead:

Yes, thanks for the correction! Edited the comment above.

Warp sync doesn't work for archive nodes, so do you know how many blocks will warp mode save?

Warp sync will quickly go to the tip of the chain and then respect your pruning settings for all new incoming blocks. For the blocks from genesis to the tip however, it will only download headers and bodies, not reconstruct the states. So if you want an archive node you can not use warp as you correctly said.

I tried warp sync with ' --blocks-pruning=50000 --state-pruning=50000', but seemed the node still only stored the latest 256 block while 256 is the default set for state-pruning; older blocker will throw 'State already discarded for txhash'; and the node version is v1.0.0.

/usr/bin/polkadot --sync=warp --rpc-external --rpc-cors=all --rpc-max-connections=2048 --rpc-max-request-size=52500000 --rpc-max-response-size=52500000 --blocks-pruning=50000 --state-pruning=50000

So if the pruning issue exist with warp sync mode? Or had i entered the wrong command to sync?

@bkchr
Copy link
Member

bkchr commented Mar 11, 2024

but seemed the node still only stored the latest 256 block while 256 is the

The node syncs to the tip of the chain. It will take 50000 blocks added to the tip of the chain to have all of them in the pruning window. We don't download old blocks state.

@lansehuiyi6
Copy link

but seemed the node still only stored the latest 256 block while 256 is the

The node syncs to the tip of the chain. It will take 50000 blocks added to the tip of the chain to have all of them in the pruning window. We don't download old blocks state.

Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests