Skip to content

Add new Network UPS Tools exporter to the exporters list#1749

Merged
brian-brazil merged 1 commit intoprometheus:masterfrom
DRuggeri:master
Oct 1, 2020
Merged

Add new Network UPS Tools exporter to the exporters list#1749
brian-brazil merged 1 commit intoprometheus:masterfrom
DRuggeri:master

Conversation

@DRuggeri
Copy link
Contributor

Another COVID project as I work on the network and instrumenting more and more of it.

Feedback always welcome

Signed-off-by: Daniel Ruggeri <druggeri@primary.net>
@brian-brazil
Copy link
Contributor

brian-brazil commented Sep 23, 2020

Could this supersede the apcupsd we currently list, or is it doing something different?

Some advice to help improve your exporter:
You're using direct instrumentation which is generally racy and incorrect for exporters, you should be using MustNewConstMetric.

variable should be part of the metric name, not a label.

The various last_scrape metrics aren't very useful, I'd suggest removing them. I'd also suggest not reporting an error with a gauge, but instead failing the scrape. You're also not catching when the 2nd RPC fails.

The _device metric is an info metric, so should end with _info. It also seems to be missing many of the labels I'd expect it to have given the NUT docs - I'd expect everything bar uptime.

Why not collect everything, rather than having the user provide a list?

@beorn7 beorn7 added the exporters and integrations Requests for new entries in the list of exporters and integrations label Sep 23, 2020
@DRuggeri
Copy link
Contributor Author

  • Could this supersede the apcupsd we currently list, or is it doing something different?

Yes, but the use case may depend. That's why I created this exporter (I'm changing UPS vendors). The apcupsd_exporter does have access to data that is not visible to NUT - presumably due to proprietary or undocumented protocol data/levels of NUT support. Some examples I came across are time in seconds on UPS, number of transfers to battery, and remaining runtime on battery.

  • Some advice to help improve your exporter:
  • You're using direct instrumentation which is generally racy and incorrect for exporters, you should be using MustNewConstMetric.

I know we discussed this in a separate PR and guess I built this from an old template. I've switched to the Desc/MustNewConst pattern. I also did make sure that all logic from an invocation of Collect is encapsulated to avoid race conditions (i.e. if there are concurrent scrapes, it would result in a session with the UPS daemon for each scrape rather than stepping on a common session).

  • variable should be part of the metric name, not a label.

I suppose this could be done by gathering the requested variables during init and then building metrics dynamically from the list, but each UPS will have an unknown set of variables available depending on level of support, NUT driver, and UPS "openness". In the case of a hard-coded list of variables to export, that's not too hard. However, in the case that the administrator wants all variables by setting the nut.vars_enable value to an empty string, it would require during init to reach out to the UPS and find out what is available, right? That's doable, I guess, but can you say a bit more about the benefits of putting the variable name in the metric name rather than being a label dimension?

  • The various last_scrape metrics aren't very useful, I'd suggest removing them. I'd also suggest not reporting an error with a gauge, but instead failing the scrape.

OK - I've gone ahead and removed them. They are leftovers from the exporter that I built this one from. What is the idiomatic way to fail the scrape, though, and how does one detect the failure from from the alertmanager side of things? The Collect method doesn't return an error, for example, and this target would still appear up in Prometheus, right?

  • You're also not catching when the 2nd RPC fails.

I think you're referring to the population of upsList. This should be OK since a failure there leaves an empty slice negating all the logic in the range loop. LMK if I'm not tracking to what you mean or if this was related to the scrape error metric which has since been deleted.

  • The _device metric is an info metric, so should end with _info. It also seems to be missing many of the labels I'd expect it to have given the NUT docs - I'd expect everything bar uptime.

I've renamed it in 3b8fe22e62d8b43afd40067db37d73cb6edbe273. Can you say more about the labels you expect, but are missing? These are held in the deviceLabels slice and are populated during each scrape. Currently the slice mirrors what is in the NUT doco ("model", "mfr", "serial", "type", "description", "contact", "location", "part", "macaddr", "uptime")

  • Why not collect everything, rather than having the user provide a list?

Oops - this is already possible, I just forgot to mention it in the README (fixed in 58135dbb3dce25790e1091a2772e8f2be2950048). This can be done by setting nut.vars_enable to an empty string. Generally speaking, I prefer to give the administrator the option of exporting only the time series data they care about (i.e. if they don't care about efficiency metrics, don't dump that into the TSDB).

@brian-brazil
Copy link
Contributor

it would require during init to reach out to the UPS and find out what is available, right?

That's making it more complicated than it needs to be, you would usually build up a map as you go. However the Go client will do all of this for you when you're using MustNewConstMetric.

can you say a bit more about the benefits of putting the variable name in the metric name rather than being a label dimension?

You're making things much more difficult and error prone for users to query for no potential benefit.

What is the idiomatic way to fail the scrape, though, and how does one detect the failure from from the alertmanager side of things?

https://www.robustperception.io/failing-a-scrape-with-the-prometheus-go-client

This should be OK since a failure there leaves an empty slice negating all the logic in the range loop.

Silently missing data is not good, it's simplest to fail the scrape if any part of the data collection fails.

Can you say more about the labels you expect, but are missing?

Everything bar uptime from that list, as uptime is both a number and changing over time.

Generally speaking, I prefer to give the administrator the option of exporting only the time series data they care about (i.e. if they don't care about efficiency metrics, don't dump that into the TSDB).

Personally as there's only ~200 of them which is small, I'd keep it simple and always expose everything. You're not avoiding any load on the target, you're not going to have that many UPSes, and if someone really wants to micro-optimise for some reason they can use metric relabelling.

You also should decide if this is a single target or multi-target exporter, in particular why is Ups an option? If it should be configurable by the user then it should come from a URL parameter rather than a config file and the target label device not be added in, as service discovery is Prometheus's responsibility.

https://github.com/DRuggeri/nut_exporter/blob/master/collectors/nut_collector.go#L86

You should leave it as the empty string if it's missing, generally try to keep the data as raw as you can.

@DRuggeri
Copy link
Contributor Author

it would require during init to reach out to the UPS and find out what is available, right?

That's making it more complicated than it needs to be, you would usually build up a map as you go. However the Go client will do all of this for you when you're using MustNewConstMetric.
Hah - yep - that makes sense. Thanks for connecting the dots. I've made this update.

can you say a bit more about the benefits of putting the variable name in the metric name rather than being a label dimension?

You're making things much more difficult and error prone for users to query for no potential benefit.

What is the idiomatic way to fail the scrape, though, and how does one detect the failure from from the alertmanager side of things?

https://www.robustperception.io/failing-a-scrape-with-the-prometheus-go-client

This should be OK since a failure there leaves an empty slice negating all the logic in the range loop.

Silently missing data is not good, it's simplest to fail the scrape if any part of the data collection fails.
OK - I've updated the scrape to emit an invalid metric as described. How would this manifest on the Prometheus side? For example, I'd like to emit an alert if the scrape fails. Do I look for the invalid metric being set?

Can you say more about the labels you expect, but are missing?

Everything bar uptime from that list, as uptime is both a number and changing over time.
I guess I still don't understand. These ARE in place regardless of being set or not, and are set on every scrape. See example output:

# HELP network_ups_tools_ups_device_info UPS Device information
# TYPE network_ups_tools_ups_device_info gauge
network_ups_tools_ups_device_info{contact="",description="",location="",macaddr="",mfr="American Power Conversion",model="Back-UPS ES 750G",part="",serial="4B1538P29631",type="ups",ups="ups"} 1

What am I missing?

You also should decide if this is a single target or multi-target exporter, in particular why is Ups an option? If it should be configurable by the user then it should come from a URL parameter rather than a config file and the target label device not be added in, as service discovery is Prometheus's responsibility.
I would say it makes sense as a multi-target exporter (I've only worked with single target exporters before). I'm a bit unclear how I gain access to URL parameters during Collect, though. I was examining snmp_exporter's code and realized that multi-target exporter consumes a config file rather than URL params. Any pointers on how to get access to query string parameters during a scrape?

https://github.com/DRuggeri/nut_exporter/blob/master/collectors/nut_collector.go#L86

You should leave it as the empty string if it's missing, generally try to keep the data as raw as you can.
OK, easy enough. This has been changed (see example above)

@brian-brazil
Copy link
Contributor

How would this manifest on the Prometheus side? For example, I'd like to emit an alert if the scrape fails. Do I look for the invalid metric being set?

Up will be 0, as for any failed scrape.

I guess I still don't understand. These ARE in place regardless of being set or not, and are set on every scrap

That looks right, I misread the code.

I'm a bit unclear how I gain access to URL parameters during Collect, though. I was examining snmp_exporter's code and realized that multi-target exporter consumes a config file rather than URL params. Any pointers on how to get access to query string parameters during a scrape?

By design, you don't. Instead do what the snmp exporter does, create a collector and registry per request.

@DRuggeri
Copy link
Contributor Author

I'm a bit unclear how I gain access to URL parameters during Collect, though. I was examining snmp_exporter's code and realized that multi-target exporter consumes a config file rather than URL params. Any pointers on how to get access to query string parameters during a scrape?

By design, you don't. Instead do what the snmp exporter does, create a collector and registry per request.

I see. OK - I've done this by instantiating a new collector during every scrape with a custom handler (all the base options set on command line are passed to the new collector, but the name of the ups is passed in the query string). This required a separate handler, so I've kept the UPS metrics at /metrics?ups=foo and left the usual promhttp.Handler() metrics at /exporter_metrics.

This may be a naive question, but is it an expensive operation to create a registry and handlers for each request or is that just splitting hairs?

@brian-brazil
Copy link
Contributor

Generally /metrics is the metrics of the exporter itself, and some other path would have the ups metrics.

The "ups" label should be removed, that is the responsibility of Prometheus service discovery to provide. In general you should never have the same labelpair across all of a metrics page.

This may be a naive question, but is it an expensive operation to create a registry and handlers for each request or is that just splitting hairs?

It's going to be effectively nothing.

@DRuggeri
Copy link
Contributor Author

Generally /metrics is the metrics of the exporter itself, and some other path would have the ups metrics.

OK, no worries. I've swapped them as proposed.

The "ups" label should be removed, that is the responsibility of Prometheus service discovery to provide. In general you should never have the same labelpair across all of a metrics page.

Sure - done - and I've added some notes in README as a helpful breadcrumb trail for those with multiple UPS devices configured in a single NUT server.

I think we should be good to go at this point - thanks for all the feedback!

@brian-brazil brian-brazil merged commit 0d74d5e into prometheus:master Oct 1, 2020
@brian-brazil
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

exporters and integrations Requests for new entries in the list of exporters and integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants