Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upCrash recovery OOM kills prometheus-server container #4833
Comments
simonpasquier
added
the
component/local storage
label
Nov 9, 2018
This comment has been minimized.
This comment has been minimized.
|
Do you have any logs before the crash? Do you have a copy of the |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier Please find all logs for that day attached. |
This comment has been minimized.
This comment has been minimized.
|
I don't see any errors in the logs indicating a bug. the WAL is almost an exact copy of the data currently loaded in the RAM so increased scraping load or high cardinality would cause an increased memory usage and larger WAL size. If the RAM demand is so high that Prometheus is OOM killed at next startup it will load the same WAL again causing the same RAM usage and as a result will be killed again. |
This comment has been minimized.
This comment has been minimized.
|
I also suspect a label cardinality getting out of control. Is the issue reoccurring? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev @simonpasquier Indeed, I don't see any errors either. So the only thought I had was maybe the data in WAL is expanding to more than 6 times its size. After deleting the WAL directory it recovered nicely and we haven't had the issue since. I don't remember making any changes affecting label cardinality. If this can not be solved or investigated with the current information I will close this ticket and reopen it once it happens again. |
This comment has been minimized.
This comment has been minimized.
|
Closing it for now to keep the backlog manageable. Reopen if it happens again (having a copy of the buggy |
estahn commentedNov 7, 2018
•
edited
You see it is reaching its 30GB limit and then it goes down due to OOM kill.
Logs:
Configuration:

WAL:
Remediation:
Removing
waland restarting container. We lost around 5h of data.Possibly related to #2139
Originally posted by @estahn in #4047 (comment)