Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upOccasional unbounded memory growth (not disk growth) at large scales, leading to OOM #3640
Comments
This comment has been minimized.
This comment has been minimized.
|
On attempting to kill that subsequent restart, I see:
This process then ran out of memory and was restarted - captured a pprof dump |
This comment has been minimized.
This comment has been minimized.
There's the tsdb CLI that can provide some insights provided that you've got access to the data directory.
|
This comment has been minimized.
This comment has been minimized.
nipuntalukdar
commented
Jan 2, 2018
|
@simonpasquier just curious, do we really need the directories with zero samples and zero chunks. I did a small fix to avoid that, but not sure if problem of "empty sample" directory is that frequent. |
This comment has been minimized.
This comment has been minimized.
|
@nipuntalukdar that is my local env which is frequently started/stopped so this explains the zero samples & chunks... |
This comment has been minimized.
This comment has been minimized.
|
Here is the output of "tsdb ls" on the problematic cluster
|
This comment has been minimized.
This comment has been minimized.
|
@pgier Why do you have 120s blocks? The default is 2 hours, and we don't expect users to change that. |
This comment has been minimized.
This comment has been minimized.
|
That was legacy from a pre 2.0.0 release where block size wasn't defaulting. Will test with it removed. |
This comment has been minimized.
This comment has been minimized.
|
After removing block size setting (to take the default), we have gone 4 days without this issue across 5 clusters (previously was occurring once per day or two). Closing, thanks for the spot. |
smarterclayton
closed this
Jan 9, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
smarterclayton commentedDec 30, 2017
•
edited
We run a number of large prometheus instances on several large Kubernetes clusters (4GB to 20GB RSS, 1mil to 3mil series, 400k to 1.5mil samples scraped, 160 to 600 scrape targets, between 10gb and 50gb disk use for 3 day retention) at 2.0.0 and are occasionally seeing some of the instances transition from steady state memory usage to memory growth that continues (without correspondingly large disk growth) until the server OOMs, at which point a restart shows the same growth behavior. The only "fix" at that point is to clear the historical data, at which point the instance usually returns to a normal steady state memory usage pattern.
This is a graph of prometheus RSS across the 5 separate clusters over the last week - each spike corresponds to a transition to the crash looping state (which, since these are running on Kubernetes, are simply restarted every 5-10 minutes and will OOM continuously)
I suspect that this is due to unbounded series growth, or some other temporary phenomenon of the scrape target, but have not yet been able to capture a series count from a server that is experiencing the unbounded growth, and the logs are being purged by the container after the first OOM restart. I have a 4GB snapshot of the data directory of an instance that is experiencing the data, but cannot publicly share it because it has customer data.
On the largest server (after a restart with an empty datadir) I was seeing this recur, but the count query for series
count({__name__=~".+"})appears to time out (even when queried locally). Is there a way to perform the same count from disk?The logs from that restart (where it began growing from 18GB to 24GB over a 10 minute window, but only 4GB on disk):
The panic appears to have occurred when querying the total series count. Subsequent series counts failed - when triggering a restart with a kill I observed the following added to the logs
but the process continued running without exiting. I ended up having to force terminate the process. On a restart, I experience the same slow memory growth with high CPU (2+ core, vs 0.5-0.75 normally). A series count query failed with