OOM and Utilization Issues when using Prysm v5 #14020

fhildeb · 2024-05-17T13:35:51Z

Describe the bug

I'm running a Prysm validator on LUKSO (Layer 1 EVM up to date with Shanghai-Capella).
Related to the upcoming Cancun-Deneb fork, other homestakers and I upgraded to Prysm v5.0.3.

Since upgrading, I have:

much higher CPU usage (11-30% instead of 3-5%) and temps (75-85C instead of 35-40C)
the occupied physical memory constantly grows until I get an OOM error

The EL Client stayed the same all the time (used Geth 1.14.0)

After reaching the maximum memory, the CPU spikes up (to 75%) until Prysm crashes. Until the OOM Error from the OS, there are no visible warnings or errors in the logs. I'm using 32GB of RAM- so the memory of the Prysm client is crashing after 48-55 hours. Other LUKSO community members running Prysm validators reported similar errors after upgrading- the client crashes just around a day for those with only 16GB of RAM.

Every time it crashed, I reverted back to one version to trim down where the root cause was introduced. So far, I've got the same OOM issue for v5.0.3, v5.0.2, v5.0.1, and 5.0.0- coming to the conclusion that it got introduced with v5. When downgrading to 4.2.1, everything returns to normal, and the physical memory of the validator and consensus client combined does not grow beyond 5GB

As Prysm crashed, I've always started a clean setup, removing all previous blockchain data gathered during the previous try. I've used checkpoint sync to quickly get back online. Therefore, it might be that this memory issue exists while the EL client is syncing. However, I did not investigate too much, and this is plainly speculative.

I've also seen other issues being opened about OOM lately:

As well as a draft PR about a potential memory bugfix:

Delete BeaconState from eth1chaindata #14011

Would love to know:

What going on with the increased CPU usage, or is it related to the growing memory
If there are certain flags/configurations necessary to reduce resources

Monitoring V5.0.2

Returning back to V4.2.1 after it crashed

Has this worked before in a previous version?

Yes. 4.2.1

🔬 Minimal Reproduction

Start Geth v1.14.0 with these Geth parameters
Start Prysm v5.0.0, v5.0.1, v5.0.2, or v5.0.3 with these Prysm and these Validator parameters
Wait to see the physical memory grow indefinitely
After using up all accessible physical memory, the client will crash

To simplify starting clients, I've used the LUKSO CLI Tool to create a JWT & load the network configuration. However, it just starts up the EL/CL clients and should not be related.

Error

OS ERROR: OMM (Out of Memory)- Prysm Process crashed.

Platform(s)

Linux (x86)

What version of Prysm are you running? (Which release)

v5.0.0 and above

Anything else relevant (validator index / public key)?

Used OS/Hardware:

Operating System: Ubuntu 22.04.2 Server
Processor: Intel Core i7-10710U (4.7 GHz, 6 Cores, 12 Threads)
Motherboard: Intel NUC 10 (NUC10i7FNHN)
RAM: 32GB DDR4

The text was updated successfully, but these errors were encountered:

prestonvanloon · 2024-05-17T13:40:24Z

gm, what flags are you using the run Prysm?

prestonvanloon · 2024-05-17T13:41:05Z

Actually, I see your flags. Thanks

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary

prestonvanloon · 2024-05-17T13:43:38Z

Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

We are still investigating the OOMs you have linked, but we know that --subscribe-all-subnets often doubles the memory requirement and your peer count is really high so I suspect these are the reasons your node is crashing

fhildeb · 2024-05-17T13:48:06Z

GM, wow, thanks for the quick response.

Yeah my flags are included in the config files/parameters above: prysm.yaml and validator.yaml

Actually, I've already adjusted the max peers to 100, should've stated that 😅
But will try to turn off the --subscribe-all-subnets and report back. 🙏🏻

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

prestonvanloon · 2024-05-17T13:56:17Z

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

mxmar · 2024-05-17T17:27:01Z

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

prestonvanloon · 2024-05-21T19:20:21Z

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

Yes. Blobs are required in deneb

rkapka · 2024-05-27T05:10:08Z

I just wanted to say this is an awesome issue report @fhildeb

fhildeb · 2024-05-27T07:43:47Z

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary
Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

Used a max peer count of 100 and also tried with just 50. I also turned all subnets to false.

The issue remains (used v5.0.3 in this case and waited 2 days again).

The LUKSO network does not have blobs yet- as it's only up to Shanghai-Capella (as stated in the report), so the configuration should not cause these increases compared to v4.2.1.

externalman · 2024-07-15T09:53:25Z

Hey @prestonvanloon 👋 Is there any new update about this issue? Does https://github.com/prysmaticlabs/prysm/releases/tag/v5.0.4 solve this problem?

MrKoberman · 2024-07-26T10:33:12Z

We are still experiencing the issue on v5.0.4

git-ljm · 2024-09-12T01:28:09Z

We are still experiencing the issue on v5.1.0

nisdas · 2024-09-12T03:48:31Z

what flags are you running this with @git-ljm and which network is this with ?

fhildeb added the Bug Something isn't working label May 17, 2024

Wolmin mentioned this issue May 20, 2024

fix: Downgrade suggested Prysm version lukso-network/network-docker-containers#22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM and Utilization Issues when using Prysm v5 #14020

OOM and Utilization Issues when using Prysm v5 #14020

fhildeb commented May 17, 2024 •

edited

Loading

prestonvanloon commented May 17, 2024

prestonvanloon commented May 17, 2024

prestonvanloon commented May 17, 2024 •

edited

Loading

fhildeb commented May 17, 2024 •

edited

Loading

prestonvanloon commented May 17, 2024

mxmar commented May 17, 2024

prestonvanloon commented May 21, 2024

rkapka commented May 27, 2024

fhildeb commented May 27, 2024

externalman commented Jul 15, 2024

MrKoberman commented Jul 26, 2024

git-ljm commented Sep 12, 2024

nisdas commented Sep 12, 2024

OOM and Utilization Issues when using Prysm v5 #14020

OOM and Utilization Issues when using Prysm v5 #14020

Comments

fhildeb commented May 17, 2024 • edited Loading

Describe the bug

Monitoring V5.0.2

Returning back to V4.2.1 after it crashed

Has this worked before in a previous version?

🔬 Minimal Reproduction

Error

Platform(s)

What version of Prysm are you running? (Which release)

Anything else relevant (validator index / public key)?

prestonvanloon commented May 17, 2024

prestonvanloon commented May 17, 2024

prestonvanloon commented May 17, 2024 • edited Loading

fhildeb commented May 17, 2024 • edited Loading

prestonvanloon commented May 17, 2024

mxmar commented May 17, 2024

prestonvanloon commented May 21, 2024

rkapka commented May 27, 2024

fhildeb commented May 27, 2024

externalman commented Jul 15, 2024

MrKoberman commented Jul 26, 2024

git-ljm commented Sep 12, 2024

nisdas commented Sep 12, 2024

fhildeb commented May 17, 2024 •

edited

Loading

prestonvanloon commented May 17, 2024 •

edited

Loading

fhildeb commented May 17, 2024 •

edited

Loading