Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low Peer Numbers #6805

Closed
siladu opened this issue Mar 26, 2024 · 23 comments · Fixed by #7089
Closed

Low Peer Numbers #6805

siladu opened this issue Mar 26, 2024 · 23 comments · Fixed by #7089
Assignees
Labels
bug Something isn't working mainnet P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc)) peering

Comments

@siladu
Copy link
Contributor

siladu commented Mar 26, 2024

Multiple reports of peering issues, this is a placeholder to track and gather info.

Anecdotally seems like issues began after Dencun fork on March 13th and are worse during initial sync.

  • @garyschulte suspected a Dencun fork id issue on March 22nd related to geth during snap server testing...it ultimately turned out to be an old version of geth, but there were still a lack geth peers on our canary nodes.

  • Other devs internally have noticed issues connecting to geth peers while trying to sync Holesky.

    • This node syncs basically using 2 Geth nodes and disconnects all non Geth nodes, because they are useless. After the sync, when we stop sending snap requests we do find peers, but out of a max of 21, I have 2 Geth nodes, 1 reth node, 1 besu node, and the rest is nethermind. Something is not right here!

  • Issue reported on geth No Inbound peers after upgrade ethereum/go-ethereum#29312

User reports from EthStaker #besu-helpdesk Discord

  • GGubignou — 17/03/2024 19:51

Hi, I updated to the latest Besu via eth-docker and it now seems to struggle finding peers. I was previously at 25 peers constantly but now it oscillates between 8-13 peers. Has anything been changed with regards to port forwarding in the latest version?
(FYI my consensus client Teku finds peers as usual and I have otherwise not changed anything to my setup)

  • GGubignou — 18/03/2024 16:49

Hi @siladu, after a couple hours of struggle it eventually managed to go back to 25 peers and it seems fine now. I upgrade frequently so I was likely on the previous release before that.

  • ktmkancha — 23/03/2024 01:05

Hi Siladu- I've been having trouble with Peer connections. I hadn't updated my node prior to the Dencun fork for few months so I was probably few updates behind. I have been hovering around 9-11 peers in the last few hours. It hasn't really climbed up or down. Any advice?

  • ktmkancha — 26/03/2024 05:28

Since the Dencun update, my peer count has been single digits. I used to get consistently 25 peers prior to this, any advise? Thank you

  • ktmkancha — 28/03/2024 02:27

Hi @siladu Just wanted to report that after the besu node sync up with the top of the chain- the peer count is back up to 15+. It's better than single digits, but compared to dencun update- it's still 10 less peers. I used to get 25 peers consistenly. Will keep an eye out of the any besu updates! thanks

  • MrPrezident — 24/03/2024 03:08

I noticed that while syncing besu, it is consistently using only 1 peer, whereas teku currently has 24 peers. Is that because I'm doing a checkpoint sync?

  • MrPrezident — 24/03/2024 06:32

teku is connected to 28 peers, so I don't think it is having any problems. Only besu is having issues with peers. Both teku and besu are being run on the same host using ethdocker
The only firewall should be my pfsense router.
but I have portforwarding turned on in pfsense

  • MrPrezident — 24/03/2024 14:26

Just for testing purposes, I reconfigured my ethdocker to use geth/prysm and restarted syncing to see if it would have the same problem as besu/teku. It did finish syncing on both execution and consensus clients, but I noticed that the consensus client (prysm) is hovering at about 70 peers and the execution client (geth) is hovering at about 5 peers. That seems like a huge discrepancy. That is sort of in line with what I was seeing with besu/teku. Besu was only getting 1 or 2 peers vs about 20 peers with Teku. Is this normal to have less peers with the execution client? How many peers should I expect to see with the execution client?

  • MrPrezident — Yesterday at 00:51

This morning I looked at my dashboard and now I see 33 execution peers and 72 consensus peers (this is with geth/prysm). I didn't change anything. I just let it run overnight. I assume this is a normal range? It would be good to know how many peers I should expect. (this is on holesky)

  • scott — 26/03/2024 05:32

Also happening here, seems to be a known issue with an unknown root cause right now.

  • All Might — 26/03/2024 06:01

Same here, my peers have been at 5 the entire day. Has happened several times in the last 5 days and I don't know how to fix it

  • RemiRaznor — 28/03/2024 00:01

I'm having the same issue with low to no peer count. I'm trying to rebuild my Besu database after switching from goerli to holesky. For 2 days, I have gotten "Unable to find sync target. Currently checking 0 peers for usefulness" messages. Once or twice it found 1 or 2 peers enough for the download progress to be just 13.49%. I have restarted besu 4 or 5 times in the last 2 days with no success. This validator was working fine in goerli. All ports are open.

  • RemiRaznor — 28/03/2024 06:10

I got it to sync finally by adding the --p2p-port=30303 card to the Besu service. After restarting the EL, I was able to connect with one peer just enough to get the download started again and then I got one more peer which got me to 57% and then I lost the peers again and stalled. I started adding the other ports, rpc port, engine port and metrics port one by one until after the metrics port the sync started up again. Not sure if this will help but good luck...

@pinges
Copy link
Contributor

pinges commented Mar 26, 2024

Some examples of current peers on some mainnet nodes:

25 peers on dev-elc-besu-teku-mainnet-dev-stefan-BWS-2:

erigon/v2.59.0-0f0a4452/linux-amd64/go1.21.8","caps":["eth/68"]
Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.6","caps":["eth/68","snap/1"]
Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68"]
erigon/v2.58.2-125509e4/linux-amd64/go1.21.5","caps":["eth/66","eth/67"]
erigon/v2.58.1-f12e451c/linux-amd64/go1.21.4","caps":["eth/68"]
erigon/v2.58.1/linux-amd64/go1.21.6","caps":["eth/68"]
erigon/v2.58.1/linux-amd64/go1.21.6","caps":["eth/66","eth/67"]
erigon/v2.58.1-f12e451c/linux-amd64/go1.21.5","caps":["eth/66","eth/67"]
Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68","snap/1"]
reth/v0.1.0-alpha.21","caps":["eth/68","eth/67","eth/66"]
erigon/v2.59.0-0f0a4452/linux-amd64/go1.21.6","caps":["eth/66"]
erigon/v2.58.1-f12e451c/linux-amd64/go1.21.5","caps":["eth/68"]
erigon/v2.59.0-0f0a4452/linux-amd64/go1.21.6","caps":["eth/66","eth/67"]
Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68","snap/1"]
reth/v0.1.0-alpha.22-a9ac903ee/x86_64-unknown-linux-gnu","caps":["eth/68","eth/67","eth/66"]
reth/v0.1.0-alpha.21","caps":["eth/68","eth/67","eth/66"]
reth/v0.1.0-alpha.21","caps":["eth/68","eth/67","eth/66"]
reth/v0.1.0-alpha.21","caps":["eth/68","eth/67","eth/66"]
erigon/v2.59.0-0f0a4452/linux-amd64/go1.21.6","caps":["eth/68"]
reth/v0.1.0-alpha.21","caps":["eth/68","eth/67","eth/66"]
erigon/v2.58.0-9731f716/linux-amd64/go1.20.3","caps":["eth/66","eth/67"]
reth/v0.1.0-alpha.21","caps":["eth/68","eth/67","eth/66"]
erigon/v2.58.1-f12e451c/linux-amd64/go1.22.0","caps":["eth/66","eth/67"]
Nethermind/v1.26.0-exp.3+e02b744c/linux-x64/dotnet8.0.2","caps":["eth/66","eth/67","eth/68","nodedata/1","snap/1"]
erigon/v2.58.2-125509e4/linux-amd64/go1.20.14","caps":["eth/68"]

prd-elc-besu-teku-mainnet-bonsai-snap
Note: most of the Geth nodes connected here are version 1.1.8, released in 2020? Except for two Geth 1.13. which are up to date
I checked the head block for some of the Geth1.1.8 peers and they were all after the cancun hardfork
The 1.1.8 nodes support more than just eth/68

Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0x960cc4e7981bfc70ad8d4ad407d31339b2fd32f4dbf55950b18574effc2eca98
erigon/v2.58.1/linux-amd64/go1.21.7","caps":["eth/66","eth/67"] head:0x1315e4118e98b7caac01e590530bfe2a98e31c5fe0980a1794fd78a2d61a2c3d
Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68"] head:0x03acdf02f882cfe1569fb2c19a75f03e8ec0d0dea99a5a7c512a57030be45a8b
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0x618862fd4423255f62918cc49fd4a6c5abda707f6f66b276dc7acc263a3946f6
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0x004c0539eb79a8a535c725c435dfedad16ee6c1c970b14bffbafca2a524a7890
reth/v0.1.0-alpha.10","caps":["eth/68","eth/67","eth/66"] head:0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0xa8efe59d83d87f93d0a2e61dec881ebbed5688127bf025799462024183b5a9d5
Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68"] head:0x5a3e10c0fe9f6b29ce46d0b01f904906ddb243ca48bd4ea9fe37fed5d960d522
Nethermind/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68"] head:0xd631df240616de2d7e1baa8c8236d5b5711cd52310f47808eafcfc7d2ba650ec
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0x78cb37deca8107b53310e25a848ce097adec57be03481f5e2012f143e22b4f2e
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0xafb9d777feef06413496d16fb19fd21bec1c7d6d1f53ec784091f61f9e93355d
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0xa92c29408b50d5b4084d6d837f280ed0e6ec80a625770beb23627fc3e9583f48
erigon/v2.59.0-dev-729e82db/linux-amd64/go1.21.6","caps":["eth/66","eth/67"] head:0x7a33ad410557e24adcb96a72e427fed177f921f09dfdf08b27f9e409bdfeea49
erigon/v2.59.0-0f0a4452/linux-amd64/go1.20.14","caps":["eth/68"] head:0x6d5e4f3afba38fedd2cefab81fb0c503b75fcc4ca4ea4912de694ac2eddb4b31
Geth/v1.13.14-stable-d5a69be9/linux-amd64/go1.22.0","caps":["eth/68","snap/1"] head:0x19c969aa9e9058a2280b17545c4790f397f66e7e19a4e2e546fea9162d93de0e
Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.6","caps":["eth/68","snap/1"] head:0xca3e61fa5986b66be3dc13b52b78088220537a6a693deadcc02054d5e0769c2c
erigon/v2.59.0-0f0a4452/linux-amd64/go1.21.6","caps":["eth/66","eth/67"] head:0x962fd27661c398b104074b9af8f775d039124f034008f713e74b1b4e42cd23ed
erigon/v2.59.0-7f28419f/linux-amd64/go1.20.14","caps":["eth/66"] head:0xa8efe59d83d87f93d0a2e61dec881ebbed5688127bf025799462024183b5a9d5
besu/v24.1.2/linux-x86_64/openjdk-java-17","caps":["eth/63","eth/64","eth/65","eth/66","eth/67","eth/68","snap/1"] head:0x30c53e9e18a555f9e8ecdf96a1212f1b39bae5ab64fb91d971b3f4b487235fe3
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0x6a8cab760d83df575477c51189a598c17903818cb045653a944650ff4a42c4a4
erigon/v2.58.2-125509e4/linux-amd64/go1.21.5","caps":["eth/68"] head:0x759e27a5069535949f0a7247ebc999367dbd77964d77ed004ffc8db3d4940248
besu/v24.3.0/linux-x86_64/openjdk-java-17","caps":["eth/63","eth/64","eth/65","eth/66","eth/67","eth/68","snap/1"] head:0x82e90be44ae7285a8ebe930361a5603cde6e385e262efc13dff884bb6023c9dc
erigon/v2.58.2-125509e4/linux-arm64/go1.20.14","caps":["eth/68"] head:0xf7e22e2e8c35037af82c95f39248ed8f507b855c6e9692917678fe93cc391f2b
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0x960cc4e7981bfc70ad8d4ad407d31339b2fd32f4dbf55950b18574effc2eca98
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"] head:0x64c3e5f9b4686fb9759881e14a76b28f7a8a7a9b5e5543d1add51026a1dfbae8

prd-elc-besu-nimbus-mainnet-nightly-bonsai-snap
Note: looks a lot like the last node!

Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x466cda14dce6f9a4eefb62ef6452df2e85a182b075a51231cfef396eadde0708"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0xfffc4c928eb7f2b58fbd6bd3d870d013c99835f53a866274795131bf8528a857"
besu/v24.3.1-SNAPSHOT/linux-x86_64/openjdk-java-17","caps":["eth/63","eth/64","eth/65","eth/66","eth/67","eth/68","snap/1"]head: 0xf90e0395632d3ea84ae198b20908966b7dd4a2258a1021116cd19a8b2c1a65c4"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0xa3513dfb4012f7bf2cbc17785de83552e08b0f02832eee39804bb6e917c6c658"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x87e1abc77e5380e7ba60b8b01973bbae6c9436d6523fb5b34b2d85658cadaf08"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x0f7c4af187e50e0f1ed0b725c4741aeed693ad7db13863780f0d56acba0ce65a"
reth/v0.1.0-alpha.22-a9ac903ee/x86_64-unknown-linux-gnu","caps":["eth/68","eth/67","eth/66"]head: 0xe8fa17e0eadd1cff120a979f3331d13d90ddf4383df4bcabd5e95e6f34261d7c"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0xe2ec61193fb6aabbf3ec080f308a92419c9cf813c585de68451cf68bfa639769"
erigon/v2.58.2/linux-amd64/go1.21.4","caps":["eth/66","eth/67"]head: 0x40a79ce0537068c6c96416520857e5002144738a8c9ad3ad77c656ee9e43be0b"
Nethermin/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68","snap/1"]head: 0x925f7a7d4c9a6cbfef1e309a7054c541e53b08e6104da61289ae6ed351ecf9e9"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x14bf4dc31c9b81cd290506757d5b8717bc81ed8187a92b6ca83c97371da98040"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0xf90e0395632d3ea84ae198b20908966b7dd4a2258a1021116cd19a8b2c1a65c4"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x000f215a3b5fa3c87eec3c786ceba11aa76d969831d3499d49577f2b4446c231"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0xb91faa6e8f45401eabe750758a260cbb02962956b34a56d49227a8187e4fcda9"
Geth/v1.13.13-stable/linux-amd64/go1.22.1","caps":["eth/68","snap/1"]head: 0x616903f8b03ca625ad93cc2901ba8503ed49036fa7401e82941f00e2f5fc83f5"
Nethermin/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68"]head: 0xda8af43b3c22e19b8d0572ceefaab2e45b636820c9a92e67f043265bdfabf65e"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x000f215a3b5fa3c87eec3c786ceba11aa76d969831d3499d49577f2b4446c231"
reth/v0.1.0-alpha.22-a9ac903/x86_64-unknown-linux-gnu","caps":["eth/68","eth/67","eth/66"]head: 0xad48092a1c4210a4a1905fb53b89fd402fd973609257beb703691961b44a0c9c"
Geth/v1.13.14-stable-aaff12e3/linux-amd64/go1.22.0","caps":["eth/68","snap/1"]head: 0x6f42582ed73ed2927a8dfb5f6199dd4d51957b6422c87c610d225a6b81e2a53a"
Nethermin/v1.25.4+20b10b35/linux-x64/dotnet8.0.2","caps":["eth/66","nodedata/1","eth/67","eth/68","snap/1"]head: 0x3b2afe1a830ffbf5eb7603bfa80ed02fe8be40ecd25b6755a629f4ef5571ccd0"
erigon/v2.58.2-125509e4/linux-amd64/go1.20.2","caps":["eth/66","eth/67"]head: 0xd405b1a8c724d0b6bd2b837b76f1352063e69edc8c4ca611585613449e2d71bf"
besu/v24.3.0/linux-x86_64/openjdk-java-17","caps":["eth/63","eth/64","eth/65","eth/66","eth/67","eth/68","snap/1"]head: 0x8885dd6d5bde29931b63bed1f5e732575aab7d5c3ca0953f3378e83510392220"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x23369871d0083896cf75f688173fd6f4a0fdd08ce5f051716ccd9f2dcec64ee2"
Geth/v1.1.8-859186f2/linux-amd64/go1.16.13","caps":["eth/66","eth/67","eth/68"]head: 0x000f215a3b5fa3c87eec3c786ceba11aa76d969831d3499d49577f2b4446c231"

@siladu
Copy link
Contributor Author

siladu commented Mar 26, 2024

most of the Geth nodes connected here are version 1.1.8

I wonder if this is a misreported version, but if it's not then this is the commit hash...
ethereum/go-ethereum@859186f2

From Jan 2022.
Merge pull request #743 from binance-chain/develop !

I checked the head block for some of the Geth1.1.8 peers and they were all after the cancun hardfork

That lends more weight to this being a misreported version I guess

@non-fungible-nelson non-fungible-nelson added bug Something isn't working mainnet peering P1 Very High (ex: Security issues, unplanned backward incompatibility preventing network participation) labels Mar 26, 2024
@non-fungible-nelson
Copy link
Contributor

@non-fungible-nelson
Copy link
Contributor

Some discussion on #execution-dev ...

@siladu
Copy link
Contributor Author

siladu commented Mar 28, 2024

Updated description with more reports. We are still actively looking into this, hopefully have an update soon.

@siladu
Copy link
Contributor Author

siladu commented Mar 28, 2024

We have identified an issue related to Blob transaction gossip resulting in Geth peers sometimes disconnecting Besu. We are currently testing out a fix. This should only be an issue once the node is in sync (since transaction gossip is paused during sync).

There may still be a separate issue related to peer discovery which we are still investigating

@macfarla
Copy link
Contributor

macfarla commented Mar 29, 2024

blob transaction gossip fix has been merged into main - more info - #6777

@Beanow
Copy link

Beanow commented Mar 31, 2024

FWIW observed the same on Holesky with 24.3.0. I raised the peer limit to 40 from 25, but during snap sync it was essentially 0-2 peers that I actually synced with. Towards the very end I started to get more peers. Never hit the 40 limit in the 12 hours it's been running.

I tried making sense of debug logs but don't have much context (new to the staking scene).
It's a little bit verbose and weighs in at 581MB 😂
Trimming some json RPC blocks it's still 230MB, so let me know if you need me to share it.

It was clearly trying though, lots of disconnects.

I presume the "Disconnected USEFULL peer" might be most important from those DCs.
Filtering like cat orig.log| grep --no-group-separator -A1 -E 'Disconnected USEFULL peer' > useful-dc.log.
useful-dc.log

I'll try again with yesterday's 24.3.1 release now.
Edit: this seems a lot better. With a synced client I no longer DC from useful peers.
2x Disconnect - Inbound - 0x04 TOO_MANY_PEERS and that's it.
Peer count recovered seemingly a lot faster too.

@Beanow
Copy link

Beanow commented Mar 31, 2024

With 24.3.1 I've not been able to resync on holesky 🥲
This seems like a regression from 24.3.0. The 2 bootnodes will support the SNAP syncing up to a certain point and a flurry of issues show up. Both sides seems to be saying USELESS_PEER and I'm down to 0 peers at block 3000 or 4000.

2024-03-31 19:36:04.672+00:00 | EthScheduler-Services-42 (importBlock) | TRACE | ImportBlocksStep | Imported block 99 (0x13a0dce9e9be317f1557d573fdc1fca28b1530f97454fae64367d87a56ff1a17)
2024-03-31 19:36:04.672+00:00 | EthScheduler-Services-42 (importBlock) | TRACE | IncrementalTimestampRule | Is block timestamp more recent that its parent? true, [block timestamp 1695904008, parent timestamp 1695903984]
2024-03-31 19:36:04.674+00:00 | EthScheduler-Services-42 (importBlock) | DEBUG | TrailingPeerLimiter | Enforcing trailing peers limit (min height 1255002, max trailing peers 0) by disconnecting 0xac906289e4b7f12d...... with height 1255001
2024-03-31 19:36:04.674+00:00 | EthScheduler-Services-42 (importBlock) | TRACE | EthPeer | handleDisconnect - EthPeer PeerId: 0xac906289e4b7f12d...... PeerReputation score: 109, timeouts: {}, useless: 0, validated? true, disconnected? true, client: Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.7, [Connection with hashCode -1312496927 inboundInitiated? false initAt 1711913652228], enode://ac906289e4b7f12d...:30303
2024-03-31 19:36:04.674+00:00 | EthScheduler-Services-42 (importBlock) | TRACE | EthPeer | Received useless response for request type headers from peer 0xac906289e4b7f12d......
2024-03-31 19:36:04.674+00:00 | EthScheduler-Services-42 (importBlock) | TRACE | EthPeer | Received useless response for request type receipts from peer 0xac906289e4b7f12d......
2024-03-31 19:36:04.674+00:00 | EthScheduler-Workers-0 | DEBUG | DownloadHeaderSequenceTask | Downloading headers from 1401 to 1600.
2024-03-31 19:36:04.675+00:00 | EthScheduler-Services-42 (importBlock) | DEBUG | GetReceiptsForHeadersTask | Requesting bodies to complete 117 blocks, starting with 802.
2024-03-31 19:36:04.675+00:00 | EthScheduler-Services-42 (importBlock) | DEBUG | EthPeers | Disconnected USEFULL peer PeerId: 0xac906289e4b7f12d...... PeerReputation score: 109, timeouts: {}, useless: 2, validated? true, disconnected? true, client: Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.7, [Connection with hashCode -1312496927 inboundInitiated? false initAt 1711913652228], enode://ac906289e4b7f12d...:30303
2024-03-31 19:36:04.675+00:00 | EthScheduler-Services-42 (importBlock) | DEBUG | EthProtocolManager | Disconnect - Outbound - 0x03 USELESS_PEER - 0xac906289e4b7f12d... - 2 peers left
2024-03-31 19:36:04.675+00:00 | EthScheduler-Services-42 (importBlock) | TRACE | EthProtocolManager | 2 EthPeers {
PeerId: 0xa3435a0155a3e837...... PeerReputation score: 110, timeouts: {}, useless: 0, validated? true, disconnected? false, client: Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.7, [Connection with hashCode -555409451 inboundInitiated? false initAt 1711913651869], enode://a3435a0155a3e837c02f5e7f56...:30303, 
PeerId: 0x3bd1127b27212d2d...... PeerReputation score: 116, timeouts: {}, useless: 0, validated? true, disconnected? false, client: erigon/v2.58.1-f12e451c/linux-amd64/go1.21.7, [Connection with hashCode -975701323 inboundInitiated? false initAt 1711913652436], enode://3bd1127b27212d2d165ffaa7ae1e0efbdfdf...:30304}
2024-03-31 19:36:04.675+00:00 | EthScheduler-Services-42 (importBlock) | TRACE | AbstractPeerConnection | Writing DisconnectMessage{code=1, size=2} to PeerInfo{version=5, clientId='Geth/v1.13.14-stable-2bd6bd01/linux-amd64/go1.21.7', capabilities=[eth/68, snap/1], port=0, nodeId=0xac906289e4b7f12df423d654c5a962b6ebe5b3a74cc9e06292a85221f9a64a6f1cfdd6b714ed6dacef51578f92b34c60ee91e9ede9c7f8fadc4d347326d95e2b} via protocol null
2024-03-31 19:36:04.675+00:00 | EthScheduler-Services-42 (importBlock) | DEBUG | AbstractPeerConnection | Disconnecting connection -1312496927, peer 0xac906289e4b7f12d... reason 0x03 USELESS_PEER

Deleting all data and reverting to 24.3.0 I'm able to keep syncing with the bootnodes.

@siladu
Copy link
Contributor Author

siladu commented Mar 31, 2024

Thanks for the analysis @Beanow! Will take a closer look at your logs. What you're describing is similar to what I've I've seen in my analysis so far: on Holesky at least, it can sync with the bootnodes but sometimes disconnects these due to empty responses leading us to treat it as a useless peer (correctly). This can happen if the peer is busy, as I expect a bootnode would be.

The problem during Holesky sync is that it isn't finding new peers, or finds and disconnects them quickly maybe. This is intermittent...sometimes restarting besu finds a new non-bootnode peer, but peer numbers still remain low until after syncing.

Mainnet is a different story: maybe lower numbers during sync too but >> 0. Primary issue on mainnet seems to be the post-sync blob tx gossip disconnects which 24.3.1 (24.4.0) should have resolved.

This seems like a regression from 24.3.0

I think it might just be intermittent/peering luck, as I've been investigating the same problem on 24.3.0 and 24.1.2. But I will try to confirm.

@siladu
Copy link
Contributor Author

siladu commented Mar 31, 2024

Also worth noting that until 23.3.1/24.4.0, our Holeskly config relied entirely on the bootnodes for finding new peers, we added the dns discovery server during this investigation: 63a53aa

Despite this, holesky found peers fine until recently.

@Beanow
Copy link

Beanow commented Mar 31, 2024

It did seem more consistent than luck. I've tried to wipe the data and resync on 24.3.0 and 24.3.1 3 or 4 times each.
And with the patch, the bootnodes DC after some time. I think that might be why it doesn't recover either, it sits at like 4k blocks and reports (on prometheus at least) hey I'm 0 blocks behind, all synced! Possibly that made new peers decline us?
While 24.3.0 is able to keep at least 1 of 2 bootnodes for long enough to sync fully.

@siladu
Copy link
Contributor Author

siladu commented Mar 31, 2024

@Beanow When you wipe the data do you also wipe the node key? Retaining the node key may cause us to remain on other peers' good or bad lists and produce more consistent results.

@Beanow
Copy link

Beanow commented Mar 31, 2024

I believe the node key is in $DATA_PATH/key right? Then yes, rocket pool's script discards the docker volume that has the data path.

@siladu
Copy link
Contributor Author

siladu commented Mar 31, 2024

it sits at like 4k blocks and reports (on prometheus at least) hey I'm 0 blocks behind, all synced!

Hopefully that's just a metric/dashboard being misleading...we should be sending the correct head information to the peers over the rlpx layer https://github.com/ethereum/devp2p/blob/master/rlpx.md
...if our peering logic even makes it that far :)

@Beanow
Copy link

Beanow commented Mar 31, 2024

The metric is ethereum_best_known_block_number and on 24.3.1 it seems to go up to the expected 1.2 mil regions when we go from 0 to 1 peer. But when they leave it drops back to what's reported as our height with ethereum_blockchain_height.
image

Edit: and looking around the metrics code spotted besu_synchronizer_in_sync which also goes to 1 whenever we're at that low block height and 0 peers.

@siladu
Copy link
Contributor Author

siladu commented Mar 31, 2024

I've been analysing the logs of a holesky node that dropped all peers during sync and has been sitting on 0 peers for days.
Most interesting things I've found so far...

  1. We got "Failed to connect to peer" for 2200 unique peers during a period of ~4 hours with 0 peers on holesky.
    That's out of 112926 attempts in total, so on average 50 "retries" per peer.

About 45% of these are "Timed out waiting to establish connection with peer" which means we've found the peer to be on the correct network + forkId ("ForkId OK..."), but have waited 10 seconds between "Prepared ECIES handshake..." and "Failed to connect to peer"

{"@timestamp":"2024-03-30T09:40:00,227","level":"DEBUG","thread":"pool-8-thread-1","class":"EthProtocolManager","message":"ForkId OK or not available for peer 0xc13e9d36ae609738...","throwable":""}
{"@timestamp":"2024-03-30T09:40:00,227","level":"TRACE","thread":"pool-8-thread-1","class":"RlpxAgent","message":"Initiating connection to peer: enode://c13e9d36ae609738ada0b174972148205eb73d0301e517a6d5800ae74179a3918dd600b089c206a2aff7bf1252002fae5b37a287e0dc29989dbaadb28fed31ab@135.181.65.228:30303","throwable":""}
{"@timestamp":"2024-03-30T09:40:00,227","level":"TRACE","thread":"nioEventLoopGroup-3-7","class":"ECIESHandshaker","message":"Prepared ECIES handshake with node 0xc13e9d36ae609738ada0b17497214820... under INITIATOR role","throwable":""}
{"@timestamp":"2024-03-30T09:40:10,227","level":"DEBUG","thread":"nioEventLoopGroup-3-7","class":"RlpxAgent","message":"Failed to connect to peer 0xc13e9d36ae609738ada0b174972148205eb73d0301e517a6d5800ae74179a3918dd600b089c206a2aff7bf1252002fae5b37a287e0dc29989dbaadb28fed31ab: java.util.concurrent.TimeoutException: Timed out waiting to establish connection with peer: 0xc13e9d36ae609738ada0b174972148205eb73d0301e517a6d5800ae74179a3918dd600b089c206a2aff7bf1252002fae5b37a287e0dc29989dbaadb28fed31ab","throwable":""}

Two issues I want to confirm with this:
a. Are we starving the peering resources by waiting 10 seconds for this? We appear to only use 10 worker threads for this type of connection...

b. Why do we end up going through this with the same peers repeatedly...on average 50 times in 4 hours. The peer in the logs above is an outlier, we got "Failed to connect to peer" for enode://c13e9d36... 702 times in 4 hours :(

  1. Found an IP, 65.108.1.189, with 162 unique enodes attached...wondering if this could be a discovery attack. Most of them yield a TOO_MANY_PEERS disconnect so not sure it's malicious but so many enodes resolving to one IP is a bit sus. Note this could be enodes intended for some other network, maybe not even Ethereum - I've seen polygon peers in the logs for example.
$ cat /var/log/besu/besu.log | grep -o 'enode://[^@]*@65.108.1.189:[0-9]*?discport=[0-9]*' | sort | uniq -c | sort
    190 enode://f19199dfc0e440eb7fbda76b909e654392fa6abc32bf7238774e7d6a39d0764e76d9301515b995b71dbe89f2c4b9d9b8fcc730419d1fc068815a361c574eacc8@65.108.1.189:30303?discport=30493
    272 enode://606161aaea58a7afd6db21cd8180a6df6e76f73b6ff219e4e2439250046e73baf24fd5680d446c94526a648c3491e634a85e605b1b07b409ca675b47a621d607@65.108.1.189:30303?discport=30503
...
  29769 enode://5508ea580420555129a3bd3c598b9a8d4cc86966987e9caeaaac989a7f07ca533a3d45a43aa535f7a63dccd0d55ba963b021b451f4752702042010f44766a316@65.108.1.189:30303?discport=30373
  93660 enode://e0046ba9621201bcfbaad95bc318a91770ec43b962eb6bbbbfb5501a1bab33974a6f84a19281a5bec7dd9f86d5e6c6da7f1a023eefe98fa5064ab4efc9fe622c@65.108.1.189:30303?discport=30318

@Beanow
Copy link

Beanow commented Mar 31, 2024

Seeing a lot of that Hetzner IP too, but haven't gotten a HELLO user agent yet. They do seem to report themselves as neighbors. Might also be a proxy or load balancer of sorts?

Still looking at the unique peers number you have, and the numbers by DC reason I think it should have gravitated towards good peers before long. As in you find a few, they will mention neighbors that are probably good too.

But the reasons are all over the place.
image

so on average 50 "retries" per peer.

I haven't crunched the numbers on this yet, but I've also seen the same peers being tried over and over, while they're on a different network ID. Or have no capabilities overlap.

@siladu
Copy link
Contributor Author

siladu commented Apr 5, 2024

Update: we have included several fixes in a release candidate, 24.4.0-RC2, which is currently undergoing testing and thus not ready for production, look out for announcements on discord.

Notably #6777 which hopefully avoids in-sync users from disconnecting peers.

We are still investigating other peering issues, especially during holesky sync. In our testing, sepolia and mainnet peers are performing reasonably during sync using 24.4.0-RC2.

@macfarla macfarla added P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc)) and removed P1 Very High (ex: Security issues, unplanned backward incompatibility preventing network participation) labels Apr 8, 2024
@macfarla macfarla assigned macfarla and unassigned jflo Apr 8, 2024
@siladu
Copy link
Contributor Author

siladu commented Apr 10, 2024

Update: We're not get confident enough in 24.4.0-RC2 to release it: it may have some issues unrelated to peering, we are still investigating.

At the same time, we are preparing a hotfix release which contains the known mainnet peering issue related to blobs.

@siladu
Copy link
Contributor Author

siladu commented Apr 18, 2024

Update: https://github.com/hyperledger/besu/releases/tag/24.3.3 hotfix was released which should help besu keep hold of geth peers.

Note, there still appear to still be some issues finding peers, particularly on the testnets. If you still experience low peers on mainnet for a prolonged period, please let us know. We would regard normal as taking 30-60 minutes to find 25 peers on mainnet. From our testing, we are aware that some nodes are taking as long as 15 hours though.
Also note, it is not expected that low peer numbers should impact your ability to follow the chain and attest, if that is impacted it is likely a separate issue.

We are still working on the next major release which has more peering improvements, but was found to introduce an unrelated syncing bug so has been delayed.

@shemnon
Copy link
Contributor

shemnon commented Apr 18, 2024

peerage on testnets is always a problem, regardless of clients, I've experienced issues with Geth and Reth as well. The nodes tend to be more static and fewer in number, resulting in more "cliques".

@siladu
Copy link
Contributor Author

siladu commented Apr 18, 2024

peerage on testnets is always a problem, regardless of clients, I've experienced issues with Geth and Reth as well. The nodes tend to be more static and fewer in number, resulting in more "cliques".

@shemnon Indeed. There did seem to be a marked difference shortly following Dencun (if not the fork itself, then maybe following client updates around Dencun time perhaps). Mixed reports about success on Holesky especially. We have significant issues getting beyond the bootnodes on a lot of our new holesky nodes. I'm sure some of it is noticing pre-existing issues now we're looking more closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mainnet P2 High (ex: Degrading performance issues, unexpected behavior of core features (DevP2P, syncing, etc)) peering
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants