Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRetention cutoff failed leading to compaction failed/spiralling disk usage #3506
Comments
This comment has been minimized.
This comment has been minimized.
|
It is not recommended to use Prometheus with NFS, as we need a POSIX filesystem. We also expect to be the only thing creating files in our data directory. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil Prometheus is the only thing creating anything in the data directory. We have no option but to use NFS to get the IOPS required. |
This comment has been minimized.
This comment has been minimized.
|
NFS is creating files in the directory. I'd suggest using a local SSD. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil Local SSD is not fast enough (or it wasn't for Prometheus v1, has the IOPS requirement dropped for the new back end?). Sorry you'll have to excuse my ignorance on NFS, but if that file is created by the NFS why does Prometheus have it pinned open? |
This comment has been minimized.
This comment has been minimized.
|
Yes, 2.0 has a new storage backend with a much reduced need for iops for writes. I'm not familiar enough with this peculiar detail of NFS to know what's going on. |
This comment has been minimized.
This comment has been minimized.
|
https://serverfault.com/questions/201294/nfsxxxx-files-appearing-what-are-those Found the above reading around the subject - seems like the NFS file's there because of an open file, suggests perhaps there's an issue with files not being closed out? Not sure. Since we're running this on Kubernetes I don't know that SSD is an option for us even if it's performant enough. The only way would be to pin it to a specific Node and add the SSD there, but then we lose resiliency. Tricky. |
This comment has been minimized.
This comment has been minimized.
|
@alxmk You can add a SSD as a persistent volume to your pods. Then Kubernetes handles to mount the SSD to the node where the Prometheus pod is running. Did you consider that? |
This comment has been minimized.
This comment has been minimized.
|
My understanding is that still has the issue that when the node that has the SSD physically attached to it goes down, we lose the volume, even if the prometheus pod is running elsewhere. |
This comment has been minimized.
This comment has been minimized.
|
This should be fixed by prometheus/tsdb#213 and #3508 |
gouthamve
closed this
Nov 30, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
alxmk commentedNov 23, 2017
What did you do?
I am running Prometheus 2.0.0 in Kubernetes with a 7 day retention interval.
What did you expect to see?
After 7 days, Prometheus expires time series data older than that.
What did you see instead? Under which circumstances?
At 7 days and ~3 hours:
Subsequently seeing repeated:
And disk utilisation growing incredibly quickly (~250GB in 24 hours).
Directory looks like:
The file preventing deletion of chunks directory is opened by Prometheus:
Looks like a similar result to #3487 but potentially a different root cause.
Environment
Kubernetes 1.7.4
Not relevant
We have 100 jobs with around 10000 targets in total.
Not relevant
prom.log