Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reports of Windows users losing peers #13431 #13936

Open
defeedme opened this issue Apr 30, 2024 · 4 comments
Open

Reports of Windows users losing peers #13431 #13936

defeedme opened this issue Apr 30, 2024 · 4 comments
Labels
Bug Something isn't working

Comments

@defeedme
Copy link

defeedme commented Apr 30, 2024

Describe the bug

I spoke too soon. Still having the same problem and even changed ISP and still not helping. I've worked with nishant on this for a long time and we tried almost every trick in the book. At one point after changing some settings on net-time (time drift) it worked perfect for 2 days, which was the longest.. The only thing that halfway works now is to literally shut down the node every 4 hours and re-start using task scheduler.. not a very healthy way to run a validator..

Background
Opening this as it's nuanced and haven't been wide spread however it does randomly happen to some windows users
where the OS decides to drop all packets from prysm for some reason. This results in a reduced peer count - and eventually all peers lost. Root cause is currently unknown but opening this as a tracking issue.

Currently there have been a non trivial amount of users complaining on Windows about their number of peers
slowly reducing to zero. v1.05 has had some important changes with regards to stream management and subnet search.
However none of these components should have had any material impact on discovery. Windows machines being more
susceptible to clock drifts suddenly stop being able to find new peers for some reason, being minutes ahead/behind with respect to the network time should by right not have any effect in finding new peers.

related prysmaticlabs/documentation#891
related #8144

Has this worked before in a previous version?

It worked for almost 3 years with no issues. Then in december 2023 out of the blue this issue started. Geth would not sync anymore so I switched to nethermind which has been great.

🔬 Minimal Reproduction

  1. start the prysm node

Error

all peers lost between 1 hour and 1 day.

Platform(s)

Windows 10 (x86)

What version of Prysm are you running? (Which release)

latest version

Anything else relevant (validator index / public key)?

https://beaconcha.in/validator/26606#attestations
almostperfectuntildec23

@defeedme defeedme added the Bug Something isn't working label Apr 30, 2024
@web3dev2023
Copy link

I am having this same problem of losing peers on prysm and geth for months. I have a Windows 10 and a Windows 11 PC and they were running prysm and geth smoothly before January, until they began losing peers.

This bug does not seem to happen on every Windows machine. I have a Windows server in NYC with the same configs and even the same database files and everything works fine.

For those who are suffering from this issue, I have a temporary workaround before this bug is fixed(by using lighthouse and static peers):
#13431 (comment)

@defeedme
Copy link
Author

defeedme commented May 7, 2024

I am having this same problem of losing peers on prysm and geth for months. I have a Windows 10 and a Windows 11 PC and they were running prysm and geth smoothly before January, until they began losing peers.

This bug does not seem to happen on every Windows machine. I have a Windows server in NYC with the same configs and even the same database files and everything works fine.

For those who are suffering from this issue, I have a temporary workaround before this bug is fixed(by using lighthouse and static peers): #13431 (comment)

thanks so much for the workaround! i've been suffering for months lol.. at least now I know I'm not crazy and this is definitely a real issue.. lighthouse here we come

@anderspatriksvensson
Copy link

anderspatriksvensson commented May 18, 2024

Having the EXACT same issue. Since tesnet and genesis validating without any issues on windows. Now all of a sudden last few days peers drop to zero and must restart prysm validator to fix the issue. Even with a restart, it fails within a day or so and then must restart again. Tagging myself in here to see if a real solution comes along, not a big fan of setting static nodes.

Edit: Read other thread which hinted in the past this could be time related. Since genesis been syncing time with Nettime without a single issue, always staying within' ms of actual time so the time drift is never larger than 20-30ms.... So time drift doesnt seem to be the cause here? Trying a different timeserver (had google, switched to 0.nettime.pool.ntp.org) to see if this resolves it.... But right now must restart every day to keep it attesting efficiently....

@hzysvilla
Copy link
Contributor

#14025 a new report to the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants