Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible label name pattern #1178

Closed
fabxc opened this Issue Oct 23, 2015 · 11 comments

Comments

Projects
None yet
4 participants
@fabxc
Copy link
Member

fabxc commented Oct 23, 2015

Prometheus and Kubernetes have common ground in their usage of labels, which makes them a very good fit for each other. However, Kubernetes has a more flexible pattern for label names. Prometheus only allows alpha-numeric characters and '_'. Kubernetes allows prefixing using . and / (detailed description).

When retrieving labels/annotations from Kubernetes via service discovery, we currently convert all . and / characters to _ to fit into our pattern.
This partially breaks an idiomatic interface between the two and may cause ambiguity and obfuscation.

Unideal integration with a single third-party component is of course not enough reason to change our model. However, I think adapting the Kubernetes pattern has advantages for Prometheus as well.

In our retrieval layer we have a multitude of labels with well-defined semantics that are used internally and set by the user in their configuration (specifically relabeling rules).
The Alertmanager is a component that should ideally have zero-knowledge about Prometheus. Currently, this is only partially true because certain labels or annotations (currently named 'payload') are interpreted in a specific way. The semantics are implied which inhibits flexibility to integrate well with other systems via templating.
More broadly, Prometheus has currently no way to expose labels with well-defined meaning to other systems without them implicitly interpreting their source. On the same note it offers no way to other systems to feed those labels into Prometheus explicitly.

More or less we are currently achieving a subset of those possibilities with our internal labels, which are prefixed by __. In seamless integration with other systems, the idea of a global namespace prometheus.io/<label> is superior.

There are some implications that would have to be solved by breaking or simply sticking with 'legacy', i.e. __name__ would more appropriately be prometheus.io/metric_name, alertname would be prometheus.io/alert_name. But that can be handled as far as I can see.

Normal metric labels are generally not affected by this, even though it is possible.

Adapting the Kubernetes approach will make it easier for us to keep track of labels within our eco-system that have a well-defined meaning and improve explicitness in their usage in third-party components.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 23, 2015

I think we should be vary wary of extending what we allow, as it limits our ability to add new syntax in the future.

Currently, this is only partially true because certain labels or annotations (currently named 'payload') are interpreted in a specific way. The semantics are implied which inhibits flexibility to integrate well with other systems via templating.

This is an acknowledged flaw in how the current Prometheus<->Alertmanager API developed, this will be resolved as part of the new alertmanager.

More broadly, Prometheus has currently no way to expose labels with well-defined meaning to other systems without them implicitly interpreting their source.

The problem here is much broader - Prometheus has no knowledge as to what the meaning of any labels are, as this going to be almost entirely defined by the context of the particular organisation that's using it.

Given that adding first-class namespace support doesn't really add anything for our users, as they just end up with longer label names and for the things that have more standard names (e.g. job and alert summary) the semantics are no clearer.

On the same note it offers no way to other systems to feed those labels into Prometheus explicitly.

The labelmap relabel action was added for exactly this purpose.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Oct 23, 2015

I think this is not the first time this comes up. While a dot should be fine, a / clashes with the division operator. foo/bar would become ambiguous.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 23, 2015

While a dot should be fine

I want to use the dot so that metric.job is syntactic sugar for metric{job="job"}. When you have multiple groups and jobs sharing a Prometheus having a job label helps avoid inadvertent clashes, so I believe almost all rules should use a job label and thus it should be syntactically easy.

a / clashes with the division operator. foo/bar would become ambiguous.

I've used systems in the past where / was a valid part of a name, it caused exactly the problem you describe. Here it should work out parsing wise, but anything we do to labelnames is likely to eventually propagate to metric names (for consistency and due to similar arguments) which would be problematic.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Oct 23, 2015

I think we should be vary wary of extending what we allow, as it limits our ability to add new syntax in the future.

We should always carefully consider any changes. That's the point of this issue. The statement can never be false though, so it's not an argument.

Currently, this is only partially true because certain labels or annotations (currently named 'payload') are interpreted in a specific way. The semantics are implied which inhibits flexibility to integrate well with other systems via templating.

This is an acknowledged flaw in how the current Prometheus<->Alertmanager API developed, this will be resolved as part of the new alertmanager.

Please elaborate in which way this is solved. The new API has annotations, which is just another word for payload. We still get sets of key/value pairs. Any assumptions we make about those is an implicit acknowledgement of their source.

The problem here is much broader - Prometheus has no knowledge as to what the meaning of any labels are, as this going to be almost entirely defined by the context of the particular organisation that's using it.

The using organisation and components it uses. That is exactly the point though the proposal though.

Given that adding first-class namespace support doesn't really add anything for our users, as they just end up with longer label names and for the things that have more standard names (e.g. job and alert summary) the semantics are no clearer.

There are also users of our users and users of components Prometheus uses (AM), who benefit greatly from any reliable definition of labels that have a broader meaning.

The labelmap relabel action was added for exactly this purpose.

It's a tool to transform labels, nothing more.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Oct 23, 2015

I want to use the dot so that metric.job is syntactic sugar for metric{job="job"}. When you have multiple groups and jobs sharing a Prometheus having a job label helps avoid inadvertent clashes, so I believe almost all rules should use a job label and thus it should be syntactically easy.

Sorry, this has been an ongoing pseudo-argument for many months now. I do absolutely not see why syntactic sugar for which there has been absolutely no need over an extended time justifies blocking things that have actual value.

Also, this does not work by definition as job can have any unicode character.

I see that there are potential problems when this propagates into metrics. One could limit this to label names for external communication.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 23, 2015

Please elaborate in which way this is solved. The new API has annotations, which is just another word for payload. We still get sets of key/value pairs. Any assumptions we make about those is an implicit acknowledgement of their source.

We added new things with specific meaning to the alertmanager code, rather than making them a separate structured field. generatorURL is the most obvious one there. With the new API we won't do so, anything which the alertmanager depends on at a code level should be an explicit field in the API.

The using organisation and components it uses. That is exactly the point though the proposal though.

Did you lose some words here?

There are also users of our users and users of components Prometheus uses (AM), who benefit greatly from any reliable definition of labels that have a broader meaning.

My argument is that this broader meaning doesn't exist in any useful sense. It's hard to get two teams in the same company to agree on what label taxonomies to use, let alone between companies. Target and alert labels tend to be deployment and team specific (though there's a few things like "severity" that are a defacto-standard, and within a company there tends to be some notion of "datacenter").

In my experience alert routing ends up being handled on a team by team basis. Every team ends up with their own part of the routing tree that they setup according to their particular needs.
Put another way, each Prometheus config and associated alertmanager config(s) on the other end is effectively private to that team. Asking them to put "my_team/" in front of their labels doesn't help them.

The labelmap relabel action was added for exactly this purpose.
It's a tool to transform labels, nothing more.

It allows you to feed labels defined as labels from other systems into Prometheus. Did you mean something else?

Sorry, this has been an ongoing pseudo-argument for many months now. I do absolutely not see why syntactic sugar for which there has been absolutely no need over an extended time justifies blocking things that have actual value.

I've used this and seen it used to great benefit previously, multi-tenant Prometheus setups and setups used by multiple groups who aren't all experts pose special challenges. This helps prevent one group inadvertently taking out another's monitoring (or their own). If you don't have such a setup you're mostly fine, but if you do this is essential.

I consider most of the examples of recording rules we have to be dangerous due to lacking a job label, as we're setting users up for an outage. It's too easy to inadvertently muck up some other job's data.

One could limit this to label names for external communication.

Hmm, what are you thinking here? I can see potential uses for remote storage, but I'm not seeing it for alerting or federation.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Oct 23, 2015

We added new things with specific meaning to the alertmanager code, rather than making them a separate structured field. generatorURL is the most obvious one there. With the new API we won't do so, anything which the alertmanager depends on at a code level should be an explicit field in the API.

No, we explicitly agreed after long back-and-forth that there will be labels and annotations as the only sources for data (aside from start and end times for an alert). I hope we do not have to reopen this discussion.

My argument is that this broader meaning doesn't exist in any useful sense. It's hard to get two teams in the same company to agree on what label taxonomies to use, let alone between companies. Target and alert labels tend to be deployment and team specific (though there's a few things like "severity" that are a defacto-standard, and within a company there tends to be some notion of "datacenter").

In my experience alert routing ends up being handled on a team by team basis. Every team ends up with their own part of the routing tree that they setup according to their particular needs.
Put another way, each Prometheus config and associated alertmanager config(s) on the other end is effectively private to that team. Asking them to put "my_team/" in front of their labels doesn't help them.

Okay, I sense there might be some misunderstanding.

I do not want to promote prefixing all labels everywhere. I care about labels that affect the processing behavior of components. They are not some free key/value pairs to slice and dice along but parameters.

One could limit this to label names for external communication.
Hmm, what are you thinking here? I can see potential uses for remote storage, but I'm not seeing it for alerting or federation.

The Alertmanager is designed in a way that alerts can be pushed from any system. Prometheus alerts certainly have some annotation data that may be worth submitting to AM and can be used in notification templates.
It may not be meaningful to interpret them in the same manner if they are coming from another system. Having them namespaced assures that a label has exactly the meaning I think it has.

There is also potential for more than one hop. Say I setup an alert that is dispatched to a webhook that performs some automatic action – the system behind that webhook has a parameter I define in my alert via labels or annotations.
Now I have to ensure that all systems in my chain do not clash on any label with well-defined meaning, which is not possible in general if I cannot globally define that meaning for that label.

I see that the line gets blurry for labels local to an organizational setup, e.g. severity. But I think for systems used across orgs in unknown setups it makes sense.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Oct 23, 2015

No, we explicitly agreed after long back-and-forth that there will be labels and annotations as the only sources for data (aside from start and end times for an alert). I hope we do not have to reopen this discussion.

That agreement was for what comes from the alert rule itself. GeneratorURL belongs as a separate field similar to the times, I don't believe there's any others (the HA might require some, but the user shouldn't be exposed to them).

Having them namespaced assures that a label has exactly the meaning I think it has.

I see where you're coming from now. The way I'd handle that is that there'd be some labelset that restricts what alerts the system behind the webhook only looks at whether via alertmanager routing config, or the system itself so you know it has the meaning intended.
Generally I'd handle such namespacing by a label or set of labels, and than a prefix like my_system_name_ on the particular label you're talking about.
Taking it a step further you only want this system to look at alerts for your team, so you add in whatever label your organisation uses for that, and then only that cluster etc.
I don't see this use case being sufficiently different to require new syntax, what you need is some way to identify the alerts/time series you want that have the required semantics. Which piece of code/system it comes from is likely only one of several labels you need to specify.

Now I have to ensure that all systems in my chain do not clash on any label with well-defined meaning, which is not possible in general if I cannot globally define that meaning for that label.

Per the above the problem isn't just the clash of that particular label, it's a clash within the labelspace of your organisation. I can see how what you're proposing would help, but it's just solving one part of your label namespacing. You still need to do the rest for all the org-specific labels so you're not saving much overall.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Oct 28, 2015

To me acknowledging something as a problem and then providing a workaround that will 'mostly be fine' makes me feel uneasy – structuring something conceptually structured is not bad. As you said elsewhere, there is such a thing as being too generic.
But it's also how we approach the namespacing problem for metric names. In our instrumentation we acknowledge the hierarchical patterns of a metric but then opt to completely obfuscate them, i.e. prometheus_local_storage_chunk_ops_total – what's local or chunked here?
In how far this is a practical problem is certainly up to personal judgement.

To make this short. As the discussion alone lacks opinions and participation it is fair to say that the far reaching changes to implement this are just unrealistic. So we can prune this one out to focus on other things.

Since you brought up the shortcut for job again, it might be worth thinking about the existing instrumentation patterns. If a job roughly corresponds to a binary, currently it is already baked into the metric name most of the time aside from some general metrics provided through instrumentation libraries, JMX exporter, or similar.

@fabxc fabxc closed this Oct 28, 2015

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 8, 2015

(/me still in catch-up mode. Just as an afterthought.)

Allowing anything that has a different meaning in PromQL as part of a label name would be a truly fundamental syntax change. "dot-job" notation would be a fundamental change of another kind.
But real namespacing is much more than a prefix with a unique separator. It's really about scope and all that…

In any case, all three things I've mentioned above are different problems (albeit not completely orthogonal), and any single one would require pretty deep changes. Neither is on the agenda right now.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.