Investigate RAM consumption during crash recovery #2139

beorn7 · 2016-10-31T12:49:39Z

We have received occasional reports of servers OOMing during crash recovery.

Obviously, the checkpoint has to be loaded in its' entirety, but if more is loaded from disk, it could explain the OOMing as no series maintenance or chunk eviction is running. After a quick check, I could only see chunk descs being loaded. In extreme cases, even the relatively small chunk descs might cause an OOM, so unloading chunk descs will definitely be a way to reduce RAM usage during crash recovery.

But there might be other code paths where chunks might be loaded. This has to be investigated more thoroughly.

Obviously, having #447 in place would come in handy.

@matthiasr as discussed earlier today.

beorn7 · 2016-11-24T14:52:09Z

Random observation: A beefy Prometheus server seemed to ramp up its RAM usage during rebuilding the metrics index (xxx metrics queued for indexing).

beorn7 · 2016-11-24T15:01:52Z

Wild guess: If LevelDB gets a lot of updates, it might run into trouble cleaning up and hogs too much RAM.

beorn7 · 2017-04-03T14:21:33Z

I have decided to not tackle the LevelDB issues. This will be hairy at best, and it is going away in v2.0 anyway.
Evicting chunkdescs is however low hanging fruit. I'll create a PR shortly (for the 1.6 release).

lock · 2019-03-23T20:44:40Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

beorn7 added kind/enhancement component/local storage labels Oct 31, 2016

beorn7 self-assigned this Oct 31, 2016

beorn7 mentioned this issue Nov 14, 2016

Fix possible memory leak by defer inside loop #2184

Merged

beorn7 mentioned this issue Nov 21, 2016

crash recovery: Deal with un-open-able LevelDBs archived_fingerprint_to_timerange and archived_fingerprint_to_metric #2210

Closed

beorn7 mentioned this issue Apr 3, 2017

storage: Evict unused chunk.Descs in crash recovery #2561

Merged

beorn7 closed this as completed Apr 3, 2017

beorn7 mentioned this issue Aug 8, 2017

Crash recovery uses too much memory compared to target-heap-size #3038

Closed

estahn mentioned this issue Nov 7, 2018

Crash recovery OOM kills prometheus-server container #4833

Closed

lock bot locked and limited conversation to collaborators Mar 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate RAM consumption during crash recovery #2139

Investigate RAM consumption during crash recovery #2139

beorn7 commented Oct 31, 2016

beorn7 commented Nov 24, 2016

beorn7 commented Nov 24, 2016

beorn7 commented Apr 3, 2017

lock bot commented Mar 23, 2019

Investigate RAM consumption during crash recovery #2139

Investigate RAM consumption during crash recovery #2139

Comments

beorn7 commented Oct 31, 2016

beorn7 commented Nov 24, 2016

beorn7 commented Nov 24, 2016

beorn7 commented Apr 3, 2017

lock bot commented Mar 23, 2019