Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for local storage write failures #2091

Closed
aecolley opened this Issue Oct 18, 2016 · 7 comments

Comments

Projects
None yet
4 participants
@aecolley
Copy link

aecolley commented Oct 18, 2016

What did you do?
I ran Prometheus out of disk space

What did you expect to see?
An increment in prometheus_local_storage_series_ops_errors_total on the /metrics of the struggling prometheus. Alternatively, an equivalent change in the metrics which I could use to drive an alert when this happens again.

What did you see instead? Under which circumstances?
No clear signal appeared in the metrics. Instead, the problem was discovered when RPCs to alertmanager failed often enough that I read the stderr log of the failing prometheus and saw all the "Error while checkpointing[...] no space left on device" logs created by MemorySeriesStorage.loop.

  • Prometheus version:

    1.1.3 (plus some local mods which aren't related)

  • Logs:

time="2016-10-17T13:49:43-04:00" level=error msg="Error while checkpointing: write /var/opt/prometheus/state/heads.db.tmp: no space left on device" source="storage.go:1158"
time="2016-10-17T13:49:43-04:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548"
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Oct 18, 2016

Did prometheus_local_storage_persist_errors_total go up at least?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Oct 18, 2016

Maybe there is a way to solve the problem in a more general case, as in monitoring available disk space and use prometheus predict_linear to alert a few minutes/hours/days before the server would run out of disk or hit 80% or whatever makes sense in your case. I've seen this being used to monitor file descriptor saturation in a similar way.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 18, 2016

The relevant metrics in the current metrics layout is indeed prometheus_local_storage_persist_errors_total.

@aecolley

This comment has been minimized.

Copy link
Author

aecolley commented Oct 18, 2016

Persist errors appeared about 50 minutes after the first logged write-failure. I was missing alert rules for it, but such a long delay makes me think that there's something else that could be done.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 18, 2016

A logged write failure should definitely also increase a suitable counter.

Unless it's from within LevelDB... different story. ;-/

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Oct 18, 2016

Might be that we just don't count failures to write a checkpoint in prometheus_local_storage_persist_errors_total, and for some reason, there weren't any chunks to persist in your set-up. I'll check that out and fix as appropriate.

beorn7 added a commit that referenced this issue Apr 5, 2017

beorn7 added a commit that referenced this issue Apr 6, 2017

@beorn7 beorn7 closed this in #2583 Apr 6, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.