systemd.go: Added systemd health metric #113

jpds · 2023-11-13T22:59:50Z

Fixes: #112

robryk · 2023-11-15T23:17:21Z

Hm~

I understand that you might prefer to still serve /metrics in that case for some reasons, but would like to understand those reasons.

The disadvantage of not just stopping to serve that I see is that this approach requires awareness from everyone who writes alerts based on this exporter, because apart from the standard alert on up being false they need to write a special other alert for this exporter. If there was a convention (maybe we could start it though?) across many exporters on how to report the health of the exporter itself (see prometheus-community/smartctl_exporter#91 (comment), which I forgot about and didn't open a different bug as they requested yet, for a similar issue), using that wouldn't have this problem.

Thoughts?

Fixes: prometheus-community#112 Signed-off-by: Jonathan Davies <jpds@protonmail.com>

jpds · 2023-11-17T17:05:45Z

I understand that you might prefer to still serve /metrics in that case for some reasons, but would like to understand those reasons.

To me, a 5xx error would mean that someone made a serious programming error.

The disadvantage of not just stopping to serve that I see is that this approach requires awareness from everyone who writes alerts based on this exporter, because apart from the standard alert on up being false they need to write a special other alert for this exporter.

This is just a fact of life when it comes to different exporters exposing different metrics and the vast majority use their name as a prefix, I can see http_requests_total (for Thanos), prometheus_http_requests_total, and caddy_http_requests_total for example. You have to analyze what exporters provide you and build alerts/Grafana dashboards from there.

SuperQ · 2024-01-11T17:27:22Z

systemd/systemd.go

 	}
+
+	ch <- prometheus.MustNewConstMetric(c.scrapeDurationDesc, prometheus.GaugeValue, time.Since(begin).Seconds(), namespace)


This doesn't make sense to me, as it's not actually exposing metrics per collector that I can tell. namespace is always systemd, so the label would always be collector="systmed".

This means it's no different than up and scrape_duration_seconds recorded by Prometheus.

We didn't have collectors up until this point - will fix that in a bit.

SuperQ · 2024-01-11T17:29:01Z

To me, a 5xx error would mean that someone made a serious programming error.

The standard for Prometheus is to return a 5xx error for failed scraps. Some exporters do what you're proposing here, but in general it's not encouraged to do this.

jpds force-pushed the health-metric branch from ca952f4 to a80d5ba Compare November 17, 2023 15:40

systemd.go: Added collector health/duration metrics

1392149

Fixes: prometheus-community#112 Signed-off-by: Jonathan Davies <jpds@protonmail.com>

jpds force-pushed the health-metric branch from a80d5ba to 1392149 Compare November 17, 2023 17:05

SuperQ requested changes Jan 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd.go: Added systemd health metric #113

systemd.go: Added systemd health metric #113

jpds commented Nov 13, 2023

robryk commented Nov 15, 2023

jpds commented Nov 17, 2023

SuperQ Jan 11, 2024

jpds Jan 12, 2024

SuperQ commented Jan 11, 2024

		}

		ch <- prometheus.MustNewConstMetric(c.scrapeDurationDesc, prometheus.GaugeValue, time.Since(begin).Seconds(), namespace)

systemd.go: Added systemd health metric #113

Are you sure you want to change the base?

systemd.go: Added systemd health metric #113

Conversation

jpds commented Nov 13, 2023

robryk commented Nov 15, 2023

jpds commented Nov 17, 2023

SuperQ Jan 11, 2024

Choose a reason for hiding this comment

jpds Jan 12, 2024

Choose a reason for hiding this comment

SuperQ commented Jan 11, 2024