Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upMetrics for local storage write failures #2091
Comments
This comment has been minimized.
This comment has been minimized.
|
Did |
This comment has been minimized.
This comment has been minimized.
|
Maybe there is a way to solve the problem in a more general case, as in monitoring available disk space and use prometheus |
This comment has been minimized.
This comment has been minimized.
|
The relevant metrics in the current metrics layout is indeed |
This comment has been minimized.
This comment has been minimized.
|
Persist errors appeared about 50 minutes after the first logged write-failure. I was missing alert rules for it, but such a long delay makes me think that there's something else that could be done. |
This comment has been minimized.
This comment has been minimized.
|
A logged write failure should definitely also increase a suitable counter. Unless it's from within LevelDB... different story. ;-/ |
beorn7
self-assigned this
Oct 18, 2016
beorn7
added
kind/bug
component/local storage
labels
Oct 18, 2016
This comment has been minimized.
This comment has been minimized.
|
Might be that we just don't count failures to write a checkpoint in |
beorn7
added a commit
that referenced
this issue
Apr 5, 2017
beorn7
referenced this issue
Apr 5, 2017
Merged
storage: Increment s.persistErrors on all persist errors #2583
beorn7
added a commit
that referenced
this issue
Apr 6, 2017
beorn7
closed this
in
#2583
Apr 6, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
aecolley commentedOct 18, 2016
What did you do?
I ran Prometheus out of disk space
What did you expect to see?
An increment in
prometheus_local_storage_series_ops_errors_totalon the/metricsof the struggling prometheus. Alternatively, an equivalent change in the metrics which I could use to drive an alert when this happens again.What did you see instead? Under which circumstances?
No clear signal appeared in the metrics. Instead, the problem was discovered when RPCs to alertmanager failed often enough that I read the stderr log of the failing prometheus and saw all the "Error while checkpointing[...] no space left on device" logs created by
MemorySeriesStorage.loop.Prometheus version:
1.1.3 (plus some local mods which aren't related)
Logs: