-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate RAM consumption during crash recovery #2139
Comments
Random observation: A beefy Prometheus server seemed to ramp up its RAM usage during rebuilding the metrics index (xxx metrics queued for indexing). |
Wild guess: If LevelDB gets a lot of updates, it might run into trouble cleaning up and hogs too much RAM. |
I have decided to not tackle the LevelDB issues. This will be hairy at best, and it is going away in v2.0 anyway. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
We have received occasional reports of servers OOMing during crash recovery.
Obviously, the checkpoint has to be loaded in its' entirety, but if more is loaded from disk, it could explain the OOMing as no series maintenance or chunk eviction is running. After a quick check, I could only see chunk descs being loaded. In extreme cases, even the relatively small chunk descs might cause an OOM, so unloading chunk descs will definitely be a way to reduce RAM usage during crash recovery.
But there might be other code paths where chunks might be loaded. This has to be investigated more thoroughly.
Obviously, having #447 in place would come in handy.
@matthiasr as discussed earlier today.
The text was updated successfully, but these errors were encountered: