-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on restart, beaconchain initial-sync error shows beacon node doesn't have a parent in db with root... causing validator to keep waiting (prater/goerli) #11279
Comments
This is the log entry for what seems like the first time it happened: Aug 20 07:19:08 nuc4 beacon-chain[370963]: time="2022-08-20 07:19:08" level=info msg="Peer summary" activePeers=40 inbound=0 outbound=40 prefix=p2p |
@ptxptx2 Does this still fail even after a restart ? |
Yes, I've restarted a few times with the same result. I've attached the log entries at the restart |
Before the initial restart, was the node functioning fine ? |
Yes, I installed this on Aug 8, and it went through the prater/goerli merge fine and was running ok until the Aug 20 restart. Prior to the Aug 20 restart, there were other times where the system was restarted and it ran fine for those. One difference with the Aug 20 restart was that the node was down for about 1 hour - whereas the others were only down for hardware reboots (no more than 2-3 minutes). |
To further clarify, the node was down for about 1 hour and started up on Aug 19 23:57:03, that seemed to chug along fine until Aug 20 07:12. Here's the snippet of that in the log - I've attached the log for that time period. Here are some error and warning log entries in that time period: Aug 20 07:13:29 nuc4 beacon-chain[370963]: time="2022-08-20 07:13:29" level=error msg="Could not handle p2p pubsub" error="could not process block: could not execute state transition: could not validate state root, wanted: 0xf4a2580a73964f9064b86aab0b722e8d2982475bb6dbba750552d2dffdbc4da6, received: 0x2bf2b6951cf0cc5eb3bfa439b5a58e792aba18fff1afa3f9e4b0526d5dab658f" prefix=sync topic="/eth2/c2ce3aa8/beacon_block/ssz_snappy" |
Thank you, these last ones are very useful |
Running into same issue on goerli with 2.1.4 |
Let me know if you need other logs. |
Just to make sure, about this sentence:
Here it means that since aproximately midnight of Aug 20 until 7am, the node was synced fine and functioning and validators attesting fine? |
@chpiatt Just to clarify, what are the errors you see on your node ? |
Yes, here are some entries prior to Aug 20 07:12 and overlapping into 07:13 where some error occurs: Aug 20 07:11:49 nuc4 beacon-chain[370963]: time="2022-08-20 07:11:49" level=info msg="Finished applying state transition" attestations=128 payloadHash=0xa6083873a135 prefix=blockchain slot=3708059 syncBitsCount=437 txCount=40 |
time="2022-08-21 14:55:24" level=warning msg="Batch is not processed" error="beacon node doesn't have a parent in db with root: 0x45ab4d89e866c6ac9e16e0cd13e1afc8e552896da931f7d784ab26322f6b3fd2 (in processBatchedBlocks, slot=815488)" prefix=initial-sync |
I just checked the log on the validator, and to answer your specific question, yes, the validator submitted a new attestation at 07:15 (there is only one validator running): Aug 20 07:12:48 nuc4 validator[370982]: time="2022-08-20 07:12:48" level=info msg="Attestation schedule" attesterDutiesAtSlot=1 prefix=validator pubKeys=[0x8428c056246d] slot=3708077 slotInEpoch=13 totalAttestersInEpoch=1 The validator failed for the subsequent epoch: Aug 20 07:19:12 nuc4 validator[370982]: time="2022-08-20 07:19:12" level=info msg="Attestation schedule" attesterDutiesAtSlot=1 prefix=validator pubKeys=[0x8428c056246d] slot=3708119 slotInEpoch=23 totalAttestersInEpoch=1 |
Do you see the same roots as in this message?
|
No, I'm seeing different roots |
Working off of this genesis, if it helps: https://github.com/eth-clients/eth2-networks/raw/master/shared/prater/genesis.ssz |
do you see the same error mesasge like the one I posted from @ptxptx2? if so, can you paste yours? |
No, I don't see any |
ahh ok, then this you're experiencing a different issue not related to this one. |
@ptxptx2 could you tell me what exact CPU do you run and what are the contents of |
8 core - here's one of them... processor : 7 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities |
Thanks for that. That CPU is very well covered and tested. Specially on linux Also if you wouldn't mind running a memtest on that computer that would help, We want to rule out hardware issues cause from here it looks as if your dbase has saved a bad root |
(It's taking me some time to get memtest going, working on it....) |
I did get memtest running and it's still running - having just completed pass 1 (out of 4 passes). Pass 1 showed no errors. |
What could have caused the bad root in the database and how can that be repaired? |
Thanks, I don't think there's a solution to that if you have a corrupt database. But it's important to get to the bottom as to why do you have this database. It could be hardware (memory corruption, disk problems, etc) or it could be a bug in the software. We've spent a good few hours analyzing this and came with nothing yet. Would you be able to provide us with your beacon database for us to analyze? |
I assume this is beaconchain.db. Did you need network-keys as well? Is there a place I can upload it to? I can try that once memtest finishes (it's still on pass 2). |
On memtest, it shows 0 errors on all 4 passes. So, no memory issues detected. |
@ptxptx2 Can you try uploading it here ? Then share the link and we can take a look at it. |
beaconchain.db is 99gb big - and it just failed when I tried to share via https://share.ipfs.io. I will see if I can set up an ipfs node for that. Or do you have other suggestions on sharing the file. |
Ok in the interest of time, maybe you can just provide us your finalized state. Can you start up your node again with that db and fetch this object ?
And then paste it here. You can run the node with these additional flags: |
It says the file size is too big - so I shared finalized.ssz as https://ipfs.io/ipfs/QmZmEjuoJRw5Fo3sAcqg84roXoe7eHVEZK7zRhF7fkvrh4 - hopefully, you can access properly? (changed the url to add /ipfs in the path) |
I am not sure that ipfs thing is working - so I just uploaded finalized.ssz to my google drive: https://drive.google.com/file/d/1Gfb1pkJGz1Ux_vISv0Cc56xaBJbLebvI/view?usp=sharing I'll figure out the ipfs thing in a while.... |
Ok, the ipfs url - https://ipfs.io/ipfs/QmZmEjuoJRw5Fo3sAcqg84roXoe7eHVEZK7zRhF7fkvrh4 - should be working now as well (for finalized.ssz). |
I also have the database shared via ipfs - https://ipfs.io/ipfs/QmUncEf8BHyFL6JoQA2FWLcj3uzfbuqeuaNe6nZVmjMKNe |
Thanks a lot this is very helpful |
This is very helpful @ptxptx2 , thank you for uploading it. |
|
|
FYI, on my original problem, I've since resynced both execution and consensus layers (with v1.10.23 geth and v3.0.0 prysm) and that resynced system has been stable. |
Hi, as posted on Discord, I might have encountered the same problem on Mainnet, with version 3.1.2, Ubuntu Server 22. Appending beaconchain logs. I did checkpoint resync of beaconchain which fixed the issue. Let me know if you need any more details. |
Same problems here, after restart the sync is not working anymore: |
@skliarovartem Please update your execution client too for the hardfork. |
my current setup: |
Hi, Getting this on mainnet, maybe my CPU is too weak causing geth I/O errors? |
When resyncing from scratch it may work for 10hours max and error is back, should I upgrade my hardware? |
Also seeing this issue while trying to sync from scratch. my machine: it's been syncing slowly for over a week. I'm not in a hurry but after reading the comments will maybe try to go back to v3.0.0. current sync state: some logs:
hope this helps! |
🐞 Bug Report
I am running v2.1.4 of beaconchain and validator. The validator can't proceed because the beaconchain is not syncing properly.
Description
A clear and concise description of the problem...On restart of the beaconchain, it seems to be stuck trying to complete the initial-sync portion. It keeps repeating this in the log:
Aug 20 09:29:58 nuc4 beacon-chain[1945]: time="2022-08-20 09:29:58" level=info msg="Connected to new endpoint: http://localhost:8551" prefix=powchain
Aug 20 09:30:05 nuc4 beacon-chain[1945]: time="2022-08-20 09:30:05" level=info msg="Processing block batch of size 57 starting from 0x99d6671e... 3707969/3708750 - estimated time remaining 4m34s" blocksPerSecond=2.9 peers=45 prefix=initial-sync
Aug 20 09:30:05 nuc4 beacon-chain[1945]: time="2022-08-20 09:30:05" level=warning msg="Batch is not processed" error="could not process block in batch: could not process block: could not process block header: parent root 0x5e918c1757a5ae990d67d869f909a344199fe5e0448999782049e5bdf382eb18 does not match the latest block header signing root in state 0x9ef78c510be027cb8abebfe3910c4a3b82f21ce5446d9c541903b7ba25398369" prefix=initial-sync
Aug 20 09:30:05 nuc4 beacon-chain[1945]: time="2022-08-20 09:30:05" level=info msg="Processing block batch of size 58 starting from 0x90aa636a... 3708033/3708750 - estimated time remaining 2m4s" blocksPerSecond=5.8 peers=45 prefix=initial-sync
...
Aug 20 09:30:06 nuc4 beacon-chain[1945]: time="2022-08-20 09:30:06" level=error msg="Could not get reconstruct full bellatrix block from blinded body" error="could not fetch execution block with txs by hash 0x01d1075f83868e6eb900e4e255d83903d0c869748e858f2b809500f9d563c204: timeout from http.Client" prefix=sync
(see attached for the complete log fragment)
This seems similar to another issue with the same title - but that was for ropsten, but this one is for praeter/goerli. Maybe I should have combined issues? (Maybe the same issue as #11270)
Has this worked before in a previous version?
Yes, the previous version in which this bug was not present was: ....Yes, this was working before. I was using 2.1.4-rc1 going into the merge. Last night, I stopped and restarted and from there, it kept repeating this in the log. I upgraded to 2.1.4 but it did not improve the situation.
🔬 Minimal Reproduction
🔥 Error
🌍 Your Environment
Operating System:
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
x86_64
What version of Prysm are you running? (Which release)
2.1.4
Anything else relevant (validator index / public key)?
Log Fragment:
repeated-log.txt
$ geth version
Geth
Version: 1.10.21-stable
Git Commit: 671094279e8d27f4b4c3c94bf8b636c26b473976
Architecture: amd64
Go Version: go1.18.4
Operating System: linux
GOPATH=
GOROOT=go
geth --goerli --http --cache=2048 --datadir --http.api eth,net,engine,admin --authrpc.vhosts="localhost" --authrpc.jwtsecret=/jwt.hex
/usr/local/bin/beacon-chain --datadir=
--http-web3provider=http://localhost:8551 --prater --jwt-secret=/jwt.hex --genesis-state=/genesis.ssz --accept-terms-of-use --suggested-fee-recipient=0xThe text was updated successfully, but these errors were encountered: