Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric tracking crash recoveries? #1918

Closed
mattbostock opened this Issue Aug 24, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@mattbostock
Copy link
Contributor

mattbostock commented Aug 24, 2016

There's a metric to track inconsistencies in the local storage:

prometheus_local_storage_inconsistencies_total

dirtyCounter: prometheus.NewCounter(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "inconsistencies_total",
Help: "A counter incremented each time an inconsistency in the local storage is detected. If this is greater zero, restart the server as soon as possible.",
}),

...but it's not incremented (as far as I can tell) when the storage is found dirty during startup (and crash recovery is invoked):

defer func() {
if p.dirty {
log.Warn("Persistence layer appears dirty.")
err = p.recoverFromCrash(fingerprintToSeries)
if err != nil {
sm = nil
}
}
}()

The help for the above metric says:

A counter incremented each time an inconsistency in the local storage is detected. If this is greater zero, restart the server as soon as possible.

So it probably doesn't make sense to use it to indicate that crash recovery took place when Prometheus started.

Should we have a separate metric dedicated to this purpose?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 26, 2016

As discussed in person, I think a metric like that would make sense. It's usually obvious from the logs that it happened, but you have to check the logs to find out. So the fact that you should check the logs would be something for a ticketing alert. Could be prometheus_local_storage_started_dirty or something.

@mattbostock

This comment has been minimized.

Copy link
Contributor Author

mattbostock commented Aug 27, 2016

Thanks, I'll raise a PR soon.

mattbostock added a commit to mattbostock/prometheus that referenced this issue Aug 27, 2016

Storage: Add crash recovery metric 'started_dirty'
...to indicate when crash recovery was invoked during Prometheus
startup.

Fixes prometheus#1918.
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.