Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM and Utilization Issues when using Prysm v5 #14020

Open
fhildeb opened this issue May 17, 2024 · 13 comments
Open

OOM and Utilization Issues when using Prysm v5 #14020

fhildeb opened this issue May 17, 2024 · 13 comments
Labels
Bug Something isn't working

Comments

@fhildeb
Copy link

fhildeb commented May 17, 2024

Describe the bug

I'm running a Prysm validator on LUKSO (Layer 1 EVM up to date with Shanghai-Capella).
Related to the upcoming Cancun-Deneb fork, other homestakers and I upgraded to Prysm v5.0.3.

Since upgrading, I have:

  • much higher CPU usage (11-30% instead of 3-5%) and temps (75-85C instead of 35-40C)
  • the occupied physical memory constantly grows until I get an OOM error

The EL Client stayed the same all the time (used Geth 1.14.0)

After reaching the maximum memory, the CPU spikes up (to 75%) until Prysm crashes. Until the OOM Error from the OS, there are no visible warnings or errors in the logs. I'm using 32GB of RAM- so the memory of the Prysm client is crashing after 48-55 hours. Other LUKSO community members running Prysm validators reported similar errors after upgrading- the client crashes just around a day for those with only 16GB of RAM.

Every time it crashed, I reverted back to one version to trim down where the root cause was introduced. So far, I've got the same OOM issue for v5.0.3, v5.0.2, v5.0.1, and 5.0.0- coming to the conclusion that it got introduced with v5. When downgrading to 4.2.1, everything returns to normal, and the physical memory of the validator and consensus client combined does not grow beyond 5GB

As Prysm crashed, I've always started a clean setup, removing all previous blockchain data gathered during the previous try. I've used checkpoint sync to quickly get back online. Therefore, it might be that this memory issue exists while the EL client is syncing. However, I did not investigate too much, and this is plainly speculative.

I've also seen other issues being opened about OOM lately:

As well as a draft PR about a potential memory bugfix:

Would love to know:

  • What going on with the increased CPU usage, or is it related to the growing memory
  • If there are certain flags/configurations necessary to reduce resources

Monitoring V5.0.2

monitoring_node_prysm_v5

Returning back to V4.2.1 after it crashed

Bildschirmfoto 2024-05-17 um 11 35 59

Has this worked before in a previous version?

Yes. 4.2.1

🔬 Minimal Reproduction

  1. Start Geth v1.14.0 with these Geth parameters
  2. Start Prysm v5.0.0, v5.0.1, v5.0.2, or v5.0.3 with these Prysm and these Validator parameters
  3. Wait to see the physical memory grow indefinitely
  4. After using up all accessible physical memory, the client will crash

To simplify starting clients, I've used the LUKSO CLI Tool to create a JWT & load the network configuration. However, it just starts up the EL/CL clients and should not be related.

Error

OS ERROR: OMM (Out of Memory)- Prysm Process crashed.

Platform(s)

Linux (x86)

What version of Prysm are you running? (Which release)

v5.0.0 and above

Anything else relevant (validator index / public key)?

Used OS/Hardware:

  • Operating System: Ubuntu 22.04.2 Server
  • Processor: Intel Core i7-10710U (4.7 GHz, 6 Cores, 12 Threads)
  • Motherboard: Intel NUC 10 (NUC10i7FNHN)
  • RAM: 32GB DDR4
@fhildeb fhildeb added the Bug Something isn't working label May 17, 2024
@prestonvanloon
Copy link
Member

gm, what flags are you using the run Prysm?

@prestonvanloon
Copy link
Member

Actually, I see your flags. Thanks

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary

@prestonvanloon
Copy link
Member

prestonvanloon commented May 17, 2024

Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

We are still investigating the OOMs you have linked, but we know that --subscribe-all-subnets often doubles the memory requirement and your peer count is really high so I suspect these are the reasons your node is crashing

@fhildeb
Copy link
Author

fhildeb commented May 17, 2024

GM, wow, thanks for the quick response.

Yeah my flags are included in the config files/parameters above: prysm.yaml and validator.yaml

Actually, I've already adjusted the max peers to 100, should've stated that 😅
But will try to turn off the --subscribe-all-subnets and report back. 🙏🏻

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

@prestonvanloon
Copy link
Member

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

@mxmar
Copy link

mxmar commented May 17, 2024

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

@prestonvanloon
Copy link
Member

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

Yes. Blobs are required in deneb

@rkapka
Copy link
Contributor

rkapka commented May 27, 2024

I just wanted to say this is an awesome issue report @fhildeb

@fhildeb
Copy link
Author

fhildeb commented May 27, 2024

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary
Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

Used a max peer count of 100 and also tried with just 50. I also turned all subnets to false.

The issue remains (used v5.0.3 in this case and waited 2 days again).

The LUKSO network does not have blobs yet- as it's only up to Shanghai-Capella (as stated in the report), so the configuration should not cause these increases compared to v4.2.1.

@externalman
Copy link

Hey @prestonvanloon 👋 Is there any new update about this issue? Does https://github.com/prysmaticlabs/prysm/releases/tag/v5.0.4 solve this problem?

@MrKoberman
Copy link

We are still experiencing the issue on v5.0.4

@git-ljm
Copy link

git-ljm commented Sep 12, 2024

We are still experiencing the issue on v5.1.0

@nisdas
Copy link
Member

nisdas commented Sep 12, 2024

what flags are you running this with @git-ljm and which network is this with ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants