Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate metrics names to adhere to our guide lines #150

Closed
grobie opened this issue Oct 29, 2015 · 26 comments
Closed

Consolidate metrics names to adhere to our guide lines #150

grobie opened this issue Oct 29, 2015 · 26 comments

Comments

@grobie
Copy link
Member

grobie commented Oct 29, 2015

There are currently many metrics which don't follow our naming guidelines, e.g. they're missing _total suffixes or units like _bytes.

This will be a breaking change, but for the better. As node_exporter is one of the most popular exporters, it should also lead by example.

@juliusv @brian-brazil

@brian-brazil
Copy link
Contributor

One thing to keep in mind is that we're not going to be able to apply this to all metrics that we're exporting here, as it's not practical to determine which are/aren't counters and what the units are (e.g. netstat). There's a balance with all exporters between maintainability and exposition perfection. Anything procedurally generated usually ends up on the maintainability end of things. We should be able to fix up most of the core metrics though.

@mischief
Copy link
Contributor

it is quite unfortunate that difference exist even in a single collector between platforms, such as node_memory_Active for linux meminfo vs node_memory_active for freebsd meminfo module.

the longer the differences exist, the more painful it will be to switch...

@brian-brazil
Copy link
Contributor

@mischief That's one of the ones we can't fix unfortunately. I'd imagine memory is also one of the areas where the semantics will be subtly different.

@mischief
Copy link
Contributor

@brian-brazil what does 'cant fix' mean here? not willing to change because of breaking compat?

@brian-brazil
Copy link
Contributor

@mischief What we have to work with is /proc/meminfo, which has 42 fields which appear to be largely undocumented and it's not clear how that file has changed in the past and will change in the future. In that sort of situation the best thing to do is dump them out with minimal munging.

@juliusv
Copy link
Member

juliusv commented Oct 30, 2015

What if in this case we at least converted CamelCase to under_scores (and downcase everything in general)? There are already some other underscores in that file, but the result should still be ok. Or too risky?

@brian-brazil
Copy link
Contributor

Generally it's a little risky to go to underscores for a dump like that, as the original name is no longer apparent to be searched for online. In this case we've also a few that'd end up messy like NFS_Unstable, HugePages_Total, Hugepagesize and DirectMap2M.

@grobie
Copy link
Member Author

grobie commented Oct 30, 2015

The original name can be put in the Help text. I'd vote to export metrics according to our naming standards.

@juliusv
Copy link
Member

juliusv commented Oct 30, 2015

Hmm, even with the name in the help text, huge_pages_total vs. hugepage_size would still look messy, and the _total also would make it look like a counter.

@brian-brazil
Copy link
Contributor

I'd vote to export metrics according to our naming standards.

The challenge is that requires hand choosing metric names in the case of exporters, so it's not practical when more than a handful of metrics are involved (and that's presuming they don't change much over time).

This problem is not unique to the node exporter, it's a tradeoff we have to make with all exporters.

@discordianfish
Copy link
Member

Hum thought I commented after setting to accepted:
After thinking about this again and again I also think we should expose them according to our naming standards. I know that it requires some maintenance to keep this up to date but I don't think it's that much work either and I believe those changes are usually straight forward and something people can easily fix and contribute back.

The remaining question is just when and how to change it. It's a lot of work and IMO requires some test infra to make sure all collectors still work.

@brian-brazil
Copy link
Contributor

There are over 450 metrics exposed by default, and we don't even know the names of all of them in advance. That's more than a little maintenance.

@discordianfish
Copy link
Member

@brian-brazil We don't necessarily need to change all but we should do it at least for the most common ones (cpu/memory/disk/net/).

@discordianfish
Copy link
Member

Part of the consolidation should be having consistent metric names for the same things, see #414

@brian-brazil
Copy link
Contributor

We can only sanely fix things where we fully decide the name in the first place. CPU is one of those, memory is not.

@justinclift
Copy link

Would it be useful to investigate how some of the older host monitoring solutions (Zabbix, etc) do this? They may have figured out a practical approach which also lends itself well to cross platform capable dashboards.

@brian-brazil
Copy link
Contributor

Based on what I've seen, I imagine they're hand picking 10-20 metrics. We're at a scale where that doesn't work.

Cross-platform is very unlikely to work for anything beyond the simplest of metrics. Even something like load average varies significantly across platforms.

@justinclift
Copy link

No worries at all. 😄

@schweikert
Copy link

I'd like to point out that the decision to rename node exporter metrics only to better follow naming conventions, might make sense from a developer point of view, but from a user point of view this is going to cause a catastrophic experience. I wish not breaking stuff for users would be considered more important than cosmetic considerations.

How are people supposed to deploy these changes? Deploy new dashboards and new node exporters on all hosts at the same time?

@SuperQ
Copy link
Member

SuperQ commented Mar 20, 2018

@schweikert Dashboards like Grafana can be configured to make multiple queries. You can set them up to query both the old and new names, which will provide a reasonably seamless

Same goes for recording rules, you can setup rules to deal with both names, and even record the new name into the old rule name.

For example:

  rules:
  # The count of CPUs per node, useful for getting CPU time as a percent of total.
  - record: instance:node_cpus:count
    expr: count(node_cpu{mode="idle"}) without (cpu,mode)
  - record: instance:node_cpus:count
    expr: count(node_cpu_seconds_total{mode="idle"}) without (cpu,mode)

@brian-brazil
Copy link
Contributor

You could also use or in a single expression.

@schweikert
Copy link

@SuperQ, @brian-brazil: thanks for the tips, that's good to know. I still worry about how users are going to take this. The reputation cost there is going to be quite significant.

@juliusv
Copy link
Member

juliusv commented Mar 20, 2018

@schweikert It's definitely painful, but it's a tradeoff between having broken metric names forever or having this temporary pain once (and as long as it's still a 0.x version, might not be possible later, depending on metrics stability guarantees in a 1.x version).

From a usage and teaching perspective the node_cpu metric has always annoyed me, as I had to explain to people why it was named that way, and that it's not how you're supposed to name metrics in Prometheus... and yet it's one of the most commonly used metrics.

@schweikert
Copy link

Maybe it would help the transition to have an option to publish the old metric names as well, in addition to the new ones.

I am thinking of our company, where we publish a node exporter RPM package for consumption by various teams. If we would publish a new package with the new metric names in the yum repo, it would immediately start breaking things for our users.

If we had an option to expose both metric names, we could communicate that there is a transition period, and remove that option only a few months later in our package.

@brian-brazil
Copy link
Contributor

Doubling the load on user's Prometheus servers would likely take many of them out. There's always metric_relabel_configs if you want a smoother transition.

@grobie
Copy link
Member Author

grobie commented Mar 21, 2018

The metric_relabel_configs option does not provide our desired properties of a smooth transition. We need to keep our dashboards accessible for historical data. I re-opened #830 and added a comment with a few solutions I see to support the upgrade process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants