Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent and confusing regex anchoring behavior #1200

Closed
grobie opened this Issue Nov 4, 2015 · 9 comments

Comments

Projects
None yet
5 participants
@grobie
Copy link
Member

grobie commented Nov 4, 2015

Regex anchoring got introduced by @juliusv and @brian-brazil in #996 in config files as well as for the label_replace() function in promql.

It was not changed anywhere else in promql, so label regex matchers are still not anchored. See this example: node_boot_time{instance=~"robustperception"}.

As I'm not aware of a single linux tool, programming langue or other regex interface which does anchoring (please give examples otherwise), I'm questioning the benefit of anchoring. It's a non-standard behavior, requires additional knowledge and therefore, makes it harder for people to get started with Prometheus. I'm able to write a regex in almost any tool immediately without the need to look up the documentation, but with prometheus it took me several tries to understand why my label_replace function did not work as expected.

I heard the argument that anchored regexs are shorter, they're not in all cases (^4 vs. 4.*). In general I don't think this should be an argument over consistency with the majority of other regex implementations.

I also heard that anchoring makes it harder for people to write wrong regular expressions. I checked tens of expressions using regular matchers at soundcloud and couldn't find a single one using ambiguous regexps. I'd argue that the majority of developers knows how to use regexps and disagree that it's Prometheus' role to prevent the minority of people from doing something stupid.

I see 4 options to solve this issue:

  1. Leave the current state (anchoring behavior: config, label_replace; standard behavior: rest of promql)
  2. Revert the label_replace change (anchoring: config; standard: promql)
  3. Revert anchoring everywhere.
  4. Update all places without exception to use anchoring.

While I personally don't see the benefit of anchoring, my realistic vote is 2. It does not require another disruptive change to configs and provides a clear rule for user: anchoring in config, standard behavior in promql.

@fabxc @beorn7

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 4, 2015

Option 2. ist roughly what my discussion about this with @beorn7 would have lead me. I don't have an issue with doing anchoring myself and find the wrapping odd. But people form different backgrounds may feel the opposite.

In configs there may be the risk of people not anchoring properly out of laziness, which could lead to conflicts later on, though unlikely. As queries are more volatile, I don't feel the same reasoning applies there – certainly not partially for the relabel functions only.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 4, 2015

FWIW, as part of #996 I also meant to make regex label matchers anchored back then (it's even stated as a goal in the issue), but apparently forgot that part. Up until today, I thought that I had made them anchored too. The intent was to make everything consistent, although that part was forgotten.

As for usages within SC, I've seen some wrong ones in the past (assuming anchored matches when they were not), but they are probably fixed due to user education by now.

As for 2), wouldn't anchoring some regexes and not others (config) be even more confusing to users than having anchoring everywhere?

What I like about anchoring is that if I want to match multiple full label values (common use case), I just have to write foo|bar|baz instead of ^(foo|bar|baz)$. I agree that unanchored usage is what's common in Unix tools though.

Anyways, I trust y'all to do the right thing. Just remember, if you do decide for 2), it'll be a second breaking change to label_replace(), reverting to the old behavior after breaking it once to be anchored. A bit of a weird back and forth.

Personally I prefer 4), but that might just be what I like myself.

/cc @brian-brazil

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 4, 2015

As I'm not aware of a single linux tool, programming langue or other regex interface which does anchoring (please give examples otherwise),

There's many that do beginning anchors, the problem is they aren't usually at all documented in this regard. The most obvious would be match vs search in python.

I'm able to write a regex in almost any tool immediately without the need to look up the documentation

I always end up having to explicitly anchor, as otherwise I'd end up having to read the code to determine if this particular interface anchors.

Due to the wide inconsistencies in this area I'm for 4) as that's probably what the user wants and had thought selectors already worked this way.

I'd argue that the majority of developers knows how to use regexps and disagree that it's Prometheus' role to guard the minority of people from doing something stupid.

This is different from my experience. Sysadmins tend to know it reasonably well, and developer knowledge tends not to be too great.

@grobie

This comment has been minimized.

Copy link
Member Author

grobie commented Nov 5, 2015

Due to the wide inconsistencies in this area

Please give examples for this claim. You presented one function in Python where a similar function is also available with partial match behavior. To my knowledge at least, the standard libraries in Java, Go or Ruby, the cmdline tools grep, sed, awk, test or the databases influxdb, graphite, opentsdb do all partial matching. Just to name a few in the related fields.

I'm for 4) as that's probably what the user wants and had thought selectors already worked this way.

Maybe time for a survey. I checked a few of the expressions written at SoundCloud and found examples for half anchored expressions (code=~"^5") as well as explicit partial matches (job=~"cassandra" with values "{namespace}-{db}-{suffix}"). So I'm at least not alone to expect partial matching behavior.

As I can not see your described "wide inconsistencies", my vote is to not get into a guessing game what a user might want here, but to follow the most common implementation (which seems to be partial matching by far) unless we have evidence that the absolute majority of Prometheus users indeed prefers anchored matching.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 5, 2015

Please give examples for this claim. You presented one function in Python where a similar function is also available with partial match behavior.

For clarity please don't use the term "partial match", that has a completely different meaning when we're talking about text search. All the regexes we're talking about here are fully matched. The correct term is anchoring.

In my experience it's a coin toss as whether a given Python program or interface will use unanchored or beginning anchored regexes, so I gave up tracking which was which many years ago.

Go does not document if it is anchored (reading between the lines it's unanchored). Java is documented as anchored. Ruby is documented as unanchored. Given that OpenTSDB is written in Java and is backed by Cassandra which is also in Java I'd presume that it's also anchored (the docs give no indication, there's a hint that it's fully anchored).

There is no standard here. It boils down to which language(s) you're used to using.

Maybe time for a survey.

We've had several users get confused by the previously unanchored behaviour of Prometheus configs, particularly with relabelling. Looking at the end result of a code base doesn't tell you about what user expectations are, it just tells you what the user ended up doing after an unknown amount of trial and error.

I agree with @juliusv that most users are looking for anchored matches. In the case of properly used labels you're going to almost always want fully anchored matches, and if you think you don't then you probably haven't thought things all the way through.

Taking your second example, "{namespace}-{db}-{suffix}" is not a proper use of labels as that should be three separate labels. Due to this suboptimal label choice, you've also got an incorrect regex of job=~"cassandra". This needs to be anchored to be safe, and you also need to allow for collisions with other naming schemes that happen to contain 'cassandra' in their name. The correct regex is going to end up something like ^[^-]+-cassandra-.*$.

If that was three separate labels then job=~"cassandra" would still be wrong, as you'd pick up other jobs that happened to contain that string which I'm pretty sure isn't intended (if it was beginning anchored that'd be different as then there's fairly obvious namespacing). A more realistic example would be job!~"cassandra|someotherjob" which I'd expect to be equivalent to job!="cassandra",job!="someotherjob"

Any completely unanchored use of regexes with labels is likely incorrect. You need anchors to be safe.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 5, 2015

First of all: +1 for consistency. (Good that we clarified that the current inconsistency was an oversight and not a deliberate decision.)

WRT "standard" behavior: grep, sed, awk all work line-based, where non-anchored matching makes sense (because you want to find part of a line). Matching label values is a different thing because you are usually interested in the whole label. As a Unix command line counter-example look at expr, which does anchoring at the beginning.

Library review:

Python

Has search, fullmatch, and match (which is non-anchored, fully anchored, anchored at the beginning, respectively). The most "naturally" named function is thus implementing a behavior that is not present at all in Prometheus.

C++

The popular RE2 library has FullMatch and PartialMatch (for fully anchored and non-anchored, respectively), so no "standard" at all.

Boost has regex_match for fully anchored and regex_search for non-anchored.

Java

java.util.regex.Matcher has matches, find, lookingAt for fully anchored, non-anchored, anchored at the beginning, respectively.

Conclusion

It doesn't make sense to talk about "standard" behavior. Libraries either do not acknowledge a standard or call anchored matching just "match" and non-anchored matching something like "find" or "search", implying anchored matching is the more "natural" matching. Command line tools usually do non-anchored matching.

As said initially, I first and foremost want consistency. I don't feel too strongly about the choice, but my preference in the context of Prometheus would be full anchoring.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 5, 2015

Going with a completely unified approach is easiest in the long run.
Most things that are not lists do indeed indicate that one is dealing with a label that should have been split into multiple ones.

I do think this will be breaking rules for a few people. As one is usually aware of the possible label values, skipping any leading and trailing .*-variants was just very convenient so far.

fabxc added a commit that referenced this issue Nov 5, 2015

Anchor regexes in vector matching
This commit makes the regex behavior of vector matching consistent with
configuration and label_replace() by anchoring it.

Fixes #1200

@fabxc fabxc closed this in #1201 Nov 5, 2015

@grobie

This comment has been minimized.

Copy link
Member Author

grobie commented Nov 5, 2015

Alright. Consistency inside of Prometheus is probably the most important thing, so the solution taken makes sense.

p.s. @brian-brazil The example was for a job label which has to be unique. As we currently don't support label definition for specific sd_config groups, each sd address is its own job with non-optimal naming.

wrouesnel pushed a commit to wrouesnel/prometheus that referenced this issue Jan 5, 2016

Anchor regexes in vector matching
This commit makes the regex behavior of vector matching consistent with
configuration and label_replace() by anchoring it.

Fixes prometheus#1200
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.