Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus 2.0.0-beta5 doesn't recover nicely when running out of disk space #3283
Comments
This comment has been minimized.
This comment has been minimized.
|
CC @fabxc @gouthamve |
This comment has been minimized.
This comment has been minimized.
smarterclayton
commented
Oct 19, 2017
|
Also seen this:
|
This comment has been minimized.
This comment has been minimized.
|
@smarterclayton Did this also happen as a result of Prometheus running out of disk space, or did it occur arbitrarily? |
This comment has been minimized.
This comment has been minimized.
smarterclayton
commented
Oct 19, 2017
|
It was very close to being out of disk (174GB out of 187GB) so I suspect it was. But I didn't actually see anything closer to 187 |
This comment has been minimized.
This comment has been minimized.
|
It may well be the case that you did hit the limit, though. Many file systems keep a certain percentage of free space reserved to prevent excessive block fragmentation. |
This comment has been minimized.
This comment has been minimized.
|
@smarterclayton do you recall whether |
This comment has been minimized.
This comment has been minimized.
smarterclayton
commented
Oct 23, 2017
|
Recreated prom.walcrash.tar.gz Included both the list of files and the full log. |
This comment has been minimized.
This comment has been minimized.
smarterclayton
commented
Oct 23, 2017
and files:
|
smarterclayton
referenced this issue
Oct 24, 2017
Closed
Some rate calculations on rc1 are incorrect #3345
This comment has been minimized.
This comment has been minimized.
kargakis
commented
Nov 2, 2017
|
I saw prometheus cluttered by the following logs yesterday in an instance we run in our ci cluster
Seems like prom was hosed and stopped collecting metrics. I couldn't access the console, but that may be due to our proxy. Running I've restarted prom ever since because of a different issue I wanted to debug but I guess it is going to recur. |
This comment has been minimized.
This comment has been minimized.
TimSimmons
commented
Nov 14, 2017
•
|
Note: This was a case of mistaken I saw this with Prometheus 2.0.0 today
Presumably because the disk was reporting full, I saw the same with So I started looking to see what was using disk space with I then looked at some of my other 2.0 servers and noticed the same gap in reporting from For example:
Looking at a 1.6.1 server.
Looking into that discrepancy led me to https://serverfault.com/questions/490704/ubuntu-server-hard-drive-always-full
However I don't find anything with
|
This comment has been minimized.
This comment has been minimized.
ankon
commented
Nov 16, 2017
|
I just hit the same issue with prometheus 2.0.0: my disk ran full, prometheus started to complain about that, and eventually got itself in the 'file already closed' loop, continuously reporting this error for the same file. I do have a log available, if you need it. |
brian-brazil
added
component/local storage
kind/bug
labels
Dec 8, 2017
This comment has been minimized.
This comment has been minimized.
Mattias-
commented
Jan 12, 2018
|
Saw this as well today
Data dir filled up pretty fast with |
This comment has been minimized.
This comment has been minimized.
efahale
commented
Jan 24, 2018
|
Hi, |
This comment has been minimized.
This comment has been minimized.
stefreak
commented
Feb 12, 2018
|
I have a similar issue, with
Also inodes usage is only 1%:
Also ncdu reports:
Prometheus stopped collecting metrics with the error We'll go ahaad and test v2.1.0 there is a changelog entry that sounds like it could potentially be related? |
This comment has been minimized.
This comment has been minimized.
shomron
commented
Feb 12, 2018
•
|
@stefreak this sounds like a separate issue I've encountered of exceeding the process ulimit for open files. Search your logs for |
This comment has been minimized.
This comment has been minimized.
|
Same issue, see:
Would be great if the health check could catch that (cc: @simonpasquier) |
iksaif
referenced this issue
Apr 16, 2018
Open
WAL log samples: log series: write metrics/wal/000315: file already closed #317
hoffie
referenced this issue
May 26, 2018
Closed
Prometheus not recovering gracefully after disk fill event #4194
This comment has been minimized.
This comment has been minimized.
|
replying to the original issue. Everyone else please open a new one and will troubleshoot those as well.
This is fixed now in tsdb and also added to prometheus. I just tested this locally and it works as expected by deleting the empty WAL files.
Looking at the code the errors are return when making OS calls so I don't think we can do anything to handle this |
This comment has been minimized.
This comment has been minimized.
TroubleshooteRJ
commented
Aug 7, 2018
•
|
My running prometheus, version 2.3.1. I keep on getting the following error -> WAL log samples: log series: write /var/lib/prometheus/wal/001530: disk quota exceeded msg="append failed" err="WAL log samples: log series: write /var/lib/prometheus/wal/001530: disk quota exceeded" `rakesh.jain@labs-monitor:/var/lib/prometheus/wal$ ls -ltr rakesh.jain@labs-monitor:/var/lib/prometheus/wal$ du -sh * Do I need to restart the Prometheus service ?? |
nipuntalukdar
referenced this issue
Nov 9, 2018
Closed
Fatal error handling (when writes to wal file fail) #247
This comment has been minimized.
This comment has been minimized.
VLZZZ
commented
Feb 8, 2019
|
Same here on
We've run out of space, then resize the kubernetes
Prometheus logs:
wal file on node:
After removing this file and restarting |
This comment has been minimized.
This comment has been minimized.
|
nothing yet, still waiting for @nipuntalukdar to find the time and tackle this in #3283 otherwise I might give it a try after clearing some other pending PRs. |
This comment has been minimized.
This comment has been minimized.
|
just did and tested prometheus/tsdb#582. Once the file system has more free space Prometheus will continue to write to the WAL. I am not sure if we should make it to hard fail instead of keep retrying and flooding the error log with messages which might cause other issues. Option 1 - continue retrying until it succeeds or Prometheus is terminated. I think I am leaning towards option 2 - hard fail after some number of attempts. |
EdSchouten commentedOct 12, 2017
What did you do?
I run a Prometheus server that briefly ran out of disk space earlier today. A colleague of mine made the volume larger.
What did you expect to see?
As soon as disk space comes available, Prometheus should continue its business.
What did you see instead? Under which circumstances?
Prometheus was unable to scrape any targets from then on. The targets page showed "WAL log samples: log series: write /prometheus/wal/000024: file already closed" next to every target in the table.
I tried to do a restart of Prometheus, but what happened then was that Prometheus no longer wanted to start, terminating almost immediately with the message below:
/prometheus/wal/000025was a zero-byte file. After doing anrm /prometheus/wal/000025, Prometheus continued as usual.In short, there may be two issues here:
wal/directory.Environment
Linux 3.16.0-4-amd64 x86_64
2.0.0-beta5