Consolidate metrics names to adhere to our guide lines #150

grobie · 2015-10-29T23:18:42Z

There are currently many metrics which don't follow our naming guidelines, e.g. they're missing _total suffixes or units like _bytes.

This will be a breaking change, but for the better. As node_exporter is one of the most popular exporters, it should also lead by example.

@juliusv @brian-brazil

The text was updated successfully, but these errors were encountered:

brian-brazil · 2015-10-29T23:23:44Z

One thing to keep in mind is that we're not going to be able to apply this to all metrics that we're exporting here, as it's not practical to determine which are/aren't counters and what the units are (e.g. netstat). There's a balance with all exporters between maintainability and exposition perfection. Anything procedurally generated usually ends up on the maintainability end of things. We should be able to fix up most of the core metrics though.

mischief · 2015-10-30T19:33:02Z

it is quite unfortunate that difference exist even in a single collector between platforms, such as node_memory_Active for linux meminfo vs node_memory_active for freebsd meminfo module.

the longer the differences exist, the more painful it will be to switch...

brian-brazil · 2015-10-30T19:37:24Z

@mischief That's one of the ones we can't fix unfortunately. I'd imagine memory is also one of the areas where the semantics will be subtly different.

mischief · 2015-10-30T19:54:49Z

@brian-brazil what does 'cant fix' mean here? not willing to change because of breaking compat?

brian-brazil · 2015-10-30T20:07:39Z

@mischief What we have to work with is /proc/meminfo, which has 42 fields which appear to be largely undocumented and it's not clear how that file has changed in the past and will change in the future. In that sort of situation the best thing to do is dump them out with minimal munging.

juliusv · 2015-10-30T20:12:30Z

What if in this case we at least converted CamelCase to under_scores (and downcase everything in general)? There are already some other underscores in that file, but the result should still be ok. Or too risky?

brian-brazil · 2015-10-30T20:17:34Z

Generally it's a little risky to go to underscores for a dump like that, as the original name is no longer apparent to be searched for online. In this case we've also a few that'd end up messy like NFS_Unstable, HugePages_Total, Hugepagesize and DirectMap2M.

grobie · 2015-10-30T20:20:34Z

The original name can be put in the Help text. I'd vote to export metrics according to our naming standards.

juliusv · 2015-10-30T20:22:05Z

Hmm, even with the name in the help text, huge_pages_total vs. hugepage_size would still look messy, and the _total also would make it look like a counter.

brian-brazil · 2015-10-30T20:31:13Z

I'd vote to export metrics according to our naming standards.

The challenge is that requires hand choosing metric names in the case of exporters, so it's not practical when more than a handful of metrics are involved (and that's presuming they don't change much over time).

This problem is not unique to the node exporter, it's a tradeoff we have to make with all exporters.

discordianfish · 2017-01-04T11:36:49Z

Hum thought I commented after setting to accepted:
After thinking about this again and again I also think we should expose them according to our naming standards. I know that it requires some maintenance to keep this up to date but I don't think it's that much work either and I believe those changes are usually straight forward and something people can easily fix and contribute back.

The remaining question is just when and how to change it. It's a lot of work and IMO requires some test infra to make sure all collectors still work.

brian-brazil · 2017-01-04T11:40:25Z

There are over 450 metrics exposed by default, and we don't even know the names of all of them in advance. That's more than a little maintenance.

discordianfish · 2017-01-10T17:24:32Z

@brian-brazil We don't necessarily need to change all but we should do it at least for the most common ones (cpu/memory/disk/net/).

discordianfish · 2017-01-10T17:25:36Z

Part of the consolidation should be having consistent metric names for the same things, see #414

brian-brazil · 2017-01-10T18:33:57Z

We can only sanely fix things where we fully decide the name in the first place. CPU is one of those, memory is not.

justinclift · 2017-01-10T19:14:21Z

Would it be useful to investigate how some of the older host monitoring solutions (Zabbix, etc) do this? They may have figured out a practical approach which also lends itself well to cross platform capable dashboards.

brian-brazil · 2017-01-10T19:16:16Z

Based on what I've seen, I imagine they're hand picking 10-20 metrics. We're at a scale where that doesn't work.

Cross-platform is very unlikely to work for anything beyond the simplest of metrics. Even something like load average varies significantly across platforms.

justinclift · 2017-01-10T19:29:58Z

No worries at all. 😄

schweikert · 2018-03-20T11:18:04Z

I'd like to point out that the decision to rename node exporter metrics only to better follow naming conventions, might make sense from a developer point of view, but from a user point of view this is going to cause a catastrophic experience. I wish not breaking stuff for users would be considered more important than cosmetic considerations.

How are people supposed to deploy these changes? Deploy new dashboards and new node exporters on all hosts at the same time?

SuperQ · 2018-03-20T12:16:17Z

@schweikert Dashboards like Grafana can be configured to make multiple queries. You can set them up to query both the old and new names, which will provide a reasonably seamless

Same goes for recording rules, you can setup rules to deal with both names, and even record the new name into the old rule name.

For example:

  rules:
  # The count of CPUs per node, useful for getting CPU time as a percent of total.
  - record: instance:node_cpus:count
    expr: count(node_cpu{mode="idle"}) without (cpu,mode)
  - record: instance:node_cpus:count
    expr: count(node_cpu_seconds_total{mode="idle"}) without (cpu,mode)

brian-brazil · 2018-03-20T12:27:16Z

You could also use or in a single expression.

schweikert · 2018-03-20T21:50:33Z

@SuperQ, @brian-brazil: thanks for the tips, that's good to know. I still worry about how users are going to take this. The reputation cost there is going to be quite significant.

juliusv · 2018-03-20T22:05:29Z

@schweikert It's definitely painful, but it's a tradeoff between having broken metric names forever or having this temporary pain once (and as long as it's still a 0.x version, might not be possible later, depending on metrics stability guarantees in a 1.x version).

From a usage and teaching perspective the node_cpu metric has always annoyed me, as I had to explain to people why it was named that way, and that it's not how you're supposed to name metrics in Prometheus... and yet it's one of the most commonly used metrics.

schweikert · 2018-03-21T06:46:26Z

Maybe it would help the transition to have an option to publish the old metric names as well, in addition to the new ones.

I am thinking of our company, where we publish a node exporter RPM package for consumption by various teams. If we would publish a new package with the new metric names in the yum repo, it would immediately start breaking things for our users.

If we had an option to expose both metric names, we could communicate that there is a transition period, and remove that option only a few months later in our package.

brian-brazil · 2018-03-21T08:15:02Z

Doubling the load on user's Prometheus servers would likely take many of them out. There's always metric_relabel_configs if you want a smoother transition.

grobie · 2018-03-21T17:48:43Z

The metric_relabel_configs option does not provide our desired properties of a smooth transition. We need to keep our dashboards accessible for historical data. I re-opened #830 and added a comment with a few solutions I see to support the upgrade process.

brian-brazil mentioned this issue Nov 11, 2015

Add new collector exposing 'ksmd' stats #165

Merged

discordianfish mentioned this issue Apr 29, 2016

[WIP] Use gopsutils for consistent cpu/mem/disk metrics across multiple platforms #233

Closed

discordianfish added accepted enhancement labels Dec 22, 2016

discordianfish mentioned this issue Jan 4, 2017

vmstat collector metric types #254

Closed

discordianfish mentioned this issue Jan 10, 2017

cpu counter label mismatch between Darwin and FreeBSD? #414

Closed

juliusv mentioned this issue Mar 13, 2017

boot_time for freebsd #466

Closed

brian-brazil mentioned this issue Jul 7, 2017

All counters have _total suffix #559

Closed

This was referenced Jan 15, 2018

Standardizing metric names #453

Closed

Make metrics better follow guidelines #787

Merged

SuperQ closed this as completed in #787 Jan 17, 2018

themanifold mentioned this issue Dec 6, 2018

Where can I see a detailed explanation of all the metrics ? #1170

Open

knyar mentioned this issue Apr 29, 2019

Add a Counter Aggregator Stackdriver/stackdriver-prometheus-sidecar#119

Merged

discordianfish mentioned this issue Jan 29, 2021

ZFS collector uses different names for the same metric on FreeBSD vs Linux #1945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate metrics names to adhere to our guide lines #150

Consolidate metrics names to adhere to our guide lines #150

grobie commented Oct 29, 2015

brian-brazil commented Oct 29, 2015

mischief commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

mischief commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

juliusv commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

grobie commented Oct 30, 2015

juliusv commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

discordianfish commented Jan 4, 2017

brian-brazil commented Jan 4, 2017

discordianfish commented Jan 10, 2017

discordianfish commented Jan 10, 2017

brian-brazil commented Jan 10, 2017

justinclift commented Jan 10, 2017

brian-brazil commented Jan 10, 2017

justinclift commented Jan 10, 2017

schweikert commented Mar 20, 2018

SuperQ commented Mar 20, 2018

brian-brazil commented Mar 20, 2018

schweikert commented Mar 20, 2018

juliusv commented Mar 20, 2018

schweikert commented Mar 21, 2018

brian-brazil commented Mar 21, 2018

grobie commented Mar 21, 2018

Consolidate metrics names to adhere to our guide lines #150

Consolidate metrics names to adhere to our guide lines #150

Comments

grobie commented Oct 29, 2015

brian-brazil commented Oct 29, 2015

mischief commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

mischief commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

juliusv commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

grobie commented Oct 30, 2015

juliusv commented Oct 30, 2015

brian-brazil commented Oct 30, 2015

discordianfish commented Jan 4, 2017

brian-brazil commented Jan 4, 2017

discordianfish commented Jan 10, 2017

discordianfish commented Jan 10, 2017

brian-brazil commented Jan 10, 2017

justinclift commented Jan 10, 2017

brian-brazil commented Jan 10, 2017

justinclift commented Jan 10, 2017

schweikert commented Mar 20, 2018

SuperQ commented Mar 20, 2018

brian-brazil commented Mar 20, 2018

schweikert commented Mar 20, 2018

juliusv commented Mar 20, 2018

schweikert commented Mar 21, 2018

brian-brazil commented Mar 21, 2018

grobie commented Mar 21, 2018