New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc-merge-devnet-1 TBD-issue Jul 3 #9
Comments
lodestar-nethermind-1 was the first host to experience issues. The beacon node (lodestar) stalled at ~ Jul 3 2022 6:00:00 GMT+0000 Note Lodestar is mostly single thread, except for BLS sig verification Something in system kernel is taking 50% of CPU so at Jul 3 2022 7:30:00 all cores are at 100% usage. Nethermind appears to go offline as prometheus can't scrape it From logs some authentication error started happening at Jul-02 12:36:36 repeating every ~5min at semi-random intervals until ~ Jul-03 3:00:00. After that authentication errors started happen with more frequency, including for produceBlockV2
Also, other errors appear indicating an inconsistent communication and / or time between beacon - vc
Some error with execution communcation?
At some point the beacon node is unable to process any blocks due to timeouts
And latter Lodestar gets stuck trying to execute block 47316
|
@michaelsproul lighthouse-nethermind-3 logs around the incident Recurring logs that happen extremely often after ~ Jul-03 00:00:00.
Some notable logs I found. For temporal reference lighthouse-nethermind-3 stalls at Jul-04 15:00:00.
At this point it's just a wall of errors as the EL appears offline. @michaelsproul lighthouse seems to be downscoring peers as a result of the inn-ability to validate blocks with the EL?
As a control test I'm looking at all debug logs from Jul 04 14:58:00, as at Jul 04 14:58:30 the EL was offline. The node seems to be processing gossip blocks fine. That means that the communication between CL-EL is okay right? There no explicit log about the EL responding ok.
@michaelsproul why does lighthouse go looking for a block At the next slot, it reject the block at 71203 because of missing parent but it can request it's parent and process it this time.
Note that the snooper does not acknowledge any request at
|
This was an issue in v2.3.0 but has since been fixed in this commit which is part of v2.3.2-rc.0: sigp/lighthouse@f428719. I'd suggest running the latest release (candidate) or even just
Yeah that's right, although we should maybe add a debug log
There's a mostly harmless race where we'll go looking for a block on RPC if we haven't seen it on gossip, at around the same time as it arrives on gossip. This is improved by a new PR we haven't merged yet: sigp/lighthouse#3317
Lighthouse caches the offline status, and will do an "upcheck" (hitting |
Closing stale issue |
Nethermind (user list) and Lighthouse (user list) devs can ssh to hosts (ip list here)
ssh devops@<ip>
. Same as Pari's deploymentsNOTE: Currently digging in the logs, will post updates
The text was updated successfully, but these errors were encountered: